Datadog Boosts AI Query Agent Accuracy by 59% with LLM Autoresearch

Datadog Boosts AI Query Agent Accuracy by 59% with LLM Auto research

What if an AI agent could run your experiments while you sleep? That is exactly what Datadog’s Database Monitoring team did, and the results are turning heads across the AI development community.

The team used Andrej Karpathy’s autoresearch tool alongside Datadog LLM Observability Experiments to autonomously run 23 experiments overnight. The goal was to improve an AI-powered SQL query optimization agent. They succeeded. The agent’s precision score jumped from P=0.54 to P=0.86 to a 59% leap, without human intervention between sessions.

The challenge, however, started much earlier. Datadog’s Database Monitoring (DBM) product already uses a precise, rule-based heuristic engine for query optimization. That engine scores P=0.903 on precision. The team then wondered: could an AI agent catch the patterns the engine misses?

When they fed an LLM a zero-shot prompt, no domain rules, just “analyze this SQL”, the agent found far more patterns. Still, nearly half its suggestions were wrong. Recall was strong at R=0.90, but precision collapsed to P=0.54. The task, therefore, became teaching the agent to be more precise without losing its breadth.

To do that properly, the team first built a solid evaluation dataset. They created 100 test cases across five categories: rewrites, missing indexes, anti-patterns, maintenance, and schema changes. Thirty percent of those cases were negative examples, queries that needed no optimization at all. They then configured precision, recall, and F1 evaluators using the LLM Observability SDK.

Next came the autoresearch loop. Instead of designing each experiment manually, the team wrote a HANDOFF.md document defining goals, error analysis, and next hypotheses. A coding agent read the handoff, designed experiments, ran them via LLM Observability Experiments, analyzed per-case failures, and updated the handoff for the next session. The loop ran roughly four to eight experiments per session, fully automated.

The experiments ran in three phases. In Phase 1, the agent iterated on prompts and tool descriptions using Claude Sonnet 4.6. Requiring the model to cite tool evidence before suggesting optimizations such as referencing explain plan costs or schema indexes, pushed precision from 0.54 to 0.83 almost immediately. Adding three worked examples of what not to optimize then pushed precision further to 0.878 while holding recall at 0.858.

Phase 2 focused on compressing the gains into the smaller, cheaper Claude Haiku 4.5 model. A straightforward prompt transfer failed. So instead, the team applied a knowledge distillation approach. They compared Sonnet and Haiku traces on the same cases and found that Haiku was confusing missing indexes with stale statistics. They extracted four examples from Sonnet’s correct reasoning and embedded them into Haiku’s prompt. Both precision and recall improved as a result.

Even so, the results still fell just short of the team’s target of P≥0.85. The autoresearch driver then proposed a two-pass solution. a high-recall detector followed by a surgical verifier. After three iterations to calibrate the verifier’s aggressiveness, the final blind evaluation on 50 unseen cases landed at P=0.860, R=0.823, and F1=0.803.

Across 23 experiments, the autoresearch agent kept 17 improvements and discarded 6. Every experiment was recorded as a structured object in LLM Observability Experiments, complete with its hypothesis, configuration, and per-case results. That meant no scattered notes, no lost context between sessions, and no “I think we tried that” moments.

The methodology, notably, is not limited to SQL optimization. Any AI agent with an evaluation dataset, a task function, and measurable evaluators can be iterated the same way. Datadog’s LLM Observability now includes a free tier for the first 40,000 LLM spans, making this kind of LLM experimentation autoresearch accessible to teams of all sizes.

For teams still managing experiment history in terminal logs and half-remembered chats, this result makes a compelling case for infrastructure-driven AI research.