I asked what should have been a simple question: what's the best local LLM observability tool?
Ten names, some half-remembered feature bullets, and a recommendation that mostly reflected whatever the model had seen the most of. Authoritative tone, no reasoning. The usual.
So instead of asking for the answer, I asked for the work.
Research at least 50 tools. One report per tool. Build a feature and cost matrix. Find the top 10 most popular. Include at least five MIT-licensed options. End with an executive summary and a recommendation doc. Don't just tell me what's good — leave me with artifacts I can inspect.
That ended up working far better than the original question had any right to. Not because the model became smarter, but because the task stopped being "have an opinion" and became "do a survey, show your work, then synthesize."
I've reused that pattern enough times now that I have a ~/.claude/skills/agentic-research/ file that encodes the whole workflow — brief, parallel dispatch, per-system reports, synthesis, confidence markers. Fifteen research repos in my ~/code/ directory.
50 reports and a CSV
The LLM observability research left me with this:
research-llmobservability/
├── tools/
│ ├── 01-langfuse.md
│ ├── 02-langsmith.md
│ ├── 03-arize-phoenix.md
│ ├── ...
│ └── 50-arthur-ai.md
└── reports/
├── executive-summary.md
├── detailed-recommendations.md
├── llm-observability-tools-comparison.csv
└── web/
├── index.html
├── app.js
└── data.js
50 tool reports. A 60-row CSV with columns for license, self-hostability, free tier, tracing, evaluations, prompt management, OpenTelemetry support, GitHub stars. An executive summary with a ranked top 10 and a feature coverage matrix. A web UI to browse the results.
Each tool report followed the same template — overview, company info, license, key users, feature coverage, pricing, strengths, limitations, sources. The Langfuse report, for instance, had the company's funding history, their ClickHouse acquisition, the specific date they open-sourced their remaining features under MIT, and which organizations use them. Not because I needed all of that for my decision, but because the template forced completeness over skimming.
The CSV was the most useful artifact. I could filter by license, sort by GitHub stars, cross-reference features. The actual decision — Langfuse for my use case — took about ten minutes once the matrix existed. The "research" part that would have taken days of tab-hopping took a few hours of agent runtime.
Collection before synthesis
Most bad AI research output is bad because collection and synthesis happen simultaneously. The model starts making judgments before it has done enough collection, and by the time it reaches a conclusion it has baked the conclusion into the collection itself. Everything bends toward whatever answer it started drifting toward.
The fix is small and boring. Make the agent produce one report per system in a fixed format. Only after all reports exist, ask for synthesis. When 50 individual reports are sitting in a folder, the synthesis step becomes something you can inspect and challenge rather than just accept.
This is why the brief matters. Before any agent touches a keyboard, I write down:
- What decision this run needs to inform.
- What systems to look at.
- What each per-system report should contain (the template).
- What the synthesis should answer.
The template is load-bearing infrastructure. Without it, agents produce reports of wildly varying depth and focus. With it, every report covers the same ground, and the comparison matrix writes itself.
Same shape, different domain
Once I had the workflow, I pointed it at everything.
Vector search databases — 72 systems, same structure. One report per database, an overview, a comparison, a recommendation doc.
research-vector-search-databases/
├── tools/
│ ├── annoy.md
│ ├── chroma.md
│ ├── faiss.md
│ ├── lancedb.md
│ ├── milvus.md
│ ├── pgvector.md
│ ├── pinecone.md
│ ├── qdrant.md
│ ├── weaviate.md
│ ├── ... (72 tools total)
│ └── zep.md
├── OVERVIEW.md
├── COMPARE.md
└── RECOMMENDATIONS.md
The recommendation doc was practical — it started with my actual requirements (local-only, lightweight, for prototyping AI features), then ranked options against those constraints with code examples. LanceDB won. I wouldn't have found it by asking "what's a good vector database?" because the answer to that question depends entirely on what you're building, and the model doesn't know what you're building unless you force it through a structured comparison first.
Agentic tools — a broader survey of coding-agent products and projects. 588 files. Messier, because the space is messier. Tools move fast, product pages say very little, the naming is muddy, and some things that look like products are wrappers around the same underlying ideas. The same workflow still helped: collect first, standardize second, synthesize last.
Installer scripts — I wanted to understand how projects handle the curl | sh install flow, partly because I wanted to do something similar for Token without cargo-culting whatever the first project I found happened to do. That run covered 120 installer scripts. You learn a lot from staring at how other people detect platforms, escalate privileges, verify downloads, and quietly cut corners.
Skill marketplaces — started as a curiosity, turned into a quality audit.
Most skills are bad
I'd been suspecting that most skill marketplaces were full of thin prompt wrappers and generic fluff. I wanted to see if that impression survived a larger pass.
It did. From the analysis:
| Rating | Count | % | Description |
|---|---|---|---|
| 5 | 171 | 7.9% | Excellent |
| 4 | 415 | 19.2% | Good |
| 3 | 891 | 41.2% | Decent |
| 2 | 558 | 25.8% | Low quality |
| 1 | 129 | 6.0% | Spam/useless |
24,676 skills mapped across sources. 2,164 classified in detail by automated review. The low-end failure modes were consistent: skills that restate advice the base model already knows. Placeholder templates published as finished workflows. Walls of domain notes with no execution protocol. Duplicates of other skills with names swapped out.
The better ones were immediately different. They were specific about when to use them. They constrained output. They defined steps. They carried a quality standard. They handled unknowns honestly instead of pretending everything was knowable from a skim.
That distinction matters because if you're going to rely on skills at all, you need to know whether you're installing a workflow or just decorating the prompt stack.
Confidence markers
The skill that emerged from all of this forces one thing that turns out to matter more than it sounds: explicit confidence annotations.
Every section of every report gets a marker:
- 🟢 High — API docs, schema, or primary source available
- 🟡 Medium — inferred from help docs, UI screenshots, or demos
- 🔴 Low — guessed from marketing copy or reviews
"Data model: Not publicly documented. Inferred from UI screenshots and help articles." is more useful than a confident-sounding fabrication. One of the easiest ways to ruin this kind of workflow is to let every finding wear the same level of certainty. The model defaults to confident. You have to force it to hedge.
Separating responsibilities
Once you start dispatching parallel research agents, the important question stops being "what prompt did you use?" and becomes "what responsibilities did you separate?"
If the same agent is trying to discover, judge, summarize, recommend, and implement all in one flow, quality drops fast. If you let multiple agents do the boring collection work in parallel and then synthesize over the results, the output gets more stable.
The pattern that keeps working:
- Collection agents — one per system, running in parallel, following the same template, using web search and web fetch to pull real data. No opinions. Just structured reports with confidence markers and source URLs.
- Synthesis agent — runs after all collection is done, reads all reports, builds the comparison matrix, answers the original research questions, produces recommendations.
This is token-hungry. A 50-system research run burns through context fast. But this is one of the few cases where spending more tokens buys something real — because each additional report adds data the synthesis can draw on, and the synthesis quality scales with the breadth of collection.
Research as design input
The workflow doesn't just produce rankings and summaries. Once you have enough structured findings, implementation ideas fall out almost for free.
The installer-script research gave me practical patterns I could reuse or avoid when building install flows for Token. The skill-marketplace sweep clarified what a good skill actually needs — which fed directly back into the research skill itself. The vector database run didn't just tell me which database to use, it showed me the API shapes and usage patterns that informed how I structured my embedding code.
That recursive loop is part of why I keep doing this. Research starts as comparison, then turns into design input. The more structured the collection, the more the design implications become obvious during synthesis.
What fails
The easiest ways to ruin this workflow:
- start with a vague objective ("research AI tools" — which ones? for what decision?)
- collect too much with no fixed structure (50 reports of varying depth and focus are worthless)
- skip the synthesis pass because you're tired (the reports aren't the output — the synthesis is)
- treat all sources as equally trustworthy (marketing copy is not API docs)
- ask for a recommendation before the collection is stable
The result is a folder full of documents that feel impressive and say very little.
The fix is boring: define what decision the run needs to inform, keep the report template tight, mark confidence explicitly, and force the synthesis to answer the original questions.
The brief
Don't start by asking for 50 or 100 systems unless you have a reason. Start with 8–12 systems and one decision that matters.
You don't need to write the brief from scratch. Start with the rough idea and let the model turn it into a structured brief. Something like:
I'm evaluating background job frameworks for a Rails app that currently
uses Sidekiq. We're hitting scaling issues and I want to know what else
is out there. I care about Redis dependency, horizontal scaling, job
prioritization, dead letter handling, and whether it plays well with
Kubernetes. I'd also like to know what non-Ruby ecosystems use — Go
and Rust job systems might have patterns worth stealing.
Turn this into a research brief. Define the objective, list 10–15
systems to research (include 2–3 from adjacent ecosystems), write
the per-system report template, and define the synthesis questions
this research should answer.
The model is good at structuring your thinking. It will identify systems you forgot, suggest comparison dimensions you didn't think of, and produce a template that covers the right ground. Review the brief, adjust it, then use it to dispatch the actual research agents.
The two-pass approach — rough idea, then structured brief, then parallel research — is faster than trying to write the perfect brief yourself. The brief doesn't need to be perfect anyway. It needs to be specific enough that every report covers the same ground and the synthesis has clear questions to answer.
Then dispatch the agents in parallel. The workflow doesn't demand your full attention while it runs. You spend the energy up front defining what good collection looks like, then come back later to a pile of material worth thinking about.
One thing I learned after a few runs: always include 2–3 systems from adjacent domains, not just direct competitors. Sometimes the most useful patterns don't come from your category — they come from a neighboring one that's had to solve a similar problem under different constraints. The vector database research benefited from including general-purpose databases with vector extensions alongside purpose-built vector stores. The contrast was where the interesting insights lived.
If the run ends with "interesting notes" and no decision surface, it wasn't done yet.
The bar
Asking "what's the best tool?" produces polished mush. Asking for 50 structured reports produces something you can actually reason about.
I kept doing it because it turned research into something operational. Fifteen research repos, each with the same structure: per-system reports, comparison artifacts, synthesis documents. Not profound. Not magical. Just more useful than "top 10 tools in 2026."
The bar is simple: does the output help you make a decision you couldn't make before? A ranked list of names doesn't clear that bar. A folder with 50 structured reports and a filterable comparison matrix does. The difference is entirely in whether you asked for an opinion or asked for the work.
The research skill I use for this workflow is available as a gist. Drop it in ~/.claude/skills/agentic-research/SKILL.md and it'll trigger automatically when you ask Claude Code to research multiple systems.
