Synthetic Peer Review — or, How Fake Reddit Comments Found Real Bugs

reddit-scrutinizer simulating Reddit feedback on a codebase

I built an Emacs major mode, added a --sandbox security flag, fixed a memory leak, and corrected documentation that had been confidently wrong since day one — all because of feedback from people who don't exist. 305 of them, spread across two simulated subreddits, tearing apart a Lisp interpreter I'd been building with AI agents.

The exercise worked well enough that I turned it into a reusable CLI tool called reddit-scrutinizer.

The Technique: Synthetic Peer Review

Most solo developers and small teams don't have a security researcher, a domain expert, and a hostile user all reviewing their code before launch. Synthetic peer review is a way to approximate that: use an LLM to generate realistic reviewer feedback from multiple personas, then treat each critique as a hypothesis and verify it against the codebase.

The workflow:

Generate critiques from distinct personas — a security researcher, a domain expert, a skeptic, an enthusiast, a troll. Each approaches the project from a different angle with different incentives.
Extract claims — turn each criticism into a checkable statement. "Your stdlib naming is inconsistent" becomes "audit naming conventions across all modules."
Verify — reproduce or disprove each claim. Run tests, check docs, measure actual values, fuzz inputs.
Fix what's real, discard what isn't, note what's interesting for later.

Half the output will be wrong — confidently wrong, in the way internet commenters are confidently wrong. That's fine. The workflow includes verification. The value is in the half you wouldn't have thought to check.

Reddit threads turned out to be a particularly good format for this. Subreddit cultures have distinct personalities — r/rust is constructive but thorough, r/lisp cares about language semantics, r/programming is cynical about everything. Simulating a specific community gives the critiques coherent perspective instead of generic "here are some issues" output. It also makes the results more fun to read, which matters when you're asking yourself to audit 300 comments.

Here's how I tested this on a real project.

The Experiment

I'd been building Sema — a Lisp with first-class LLM primitives, implemented in Rust — and was drafting Reddit posts for r/lisp and r/programming. Both communities are sharp, opinionated, and good at spotting hand-waving. I wanted to know what they'd focus on before finding out in public.

I had Claude role-play as an entire Reddit community. Not a single "pretend you're a critic" prompt — a full simulation with distinct personas, voting patterns, nested reply chains, and the specific culture of each subreddit.

The setup:

Two subreddits: r/lisp (language design focused) and r/programming (benchmark focused)
Two draft posts: one pitching Sema's LLM primitives to the Lisp crowd, one leading with benchmark numbers for the general programming audience
Persona archetypes: domain experts (lispm — an SBCL maintainer asking about referential transparency), skeptics (skeptical_schemist — questioning why not just use a Python SDK), trolls (mass_downvoter_9000 — "imagine using Lisp in 2026"), concerned users (genuinely_concerned_user — pointing out security issues), and enthusiasts (grug_brain_dev — appreciating the small codebase)

The result was a 305-comment thread rendered as a dark-mode Reddit-lookalike HTML page, complete with votes, flairs, awards, and nested replies. It looked real enough that I had to remind myself I'd generated all of it.

Then came the useful part: auditing every criticism against the actual codebase.

What Was Actually True

The value isn't that the AI is smarter than you. It's that each persona approaches the project from an angle you haven't considered. A simulated Emacs user thinks about editor integration. A simulated security researcher thinks about sandboxing. A simulated language implementer thinks about memory semantics.

Real Bugs and Gaps

Memory leaks. A simulated comment pointed out that recursive define calls would create Rc reference cycles — lambda captures environment, environment contains lambda. This was correct. Long-running sessions would leak memory because there was no cycle collector.

No sandbox mode. genuinely_concerned_user raised the concern that anyone running an untrusted .sema script was giving it full access to shell, the filesystem, and environment variables (including API keys). There was no --sandbox flag. This was a real security gap.

Wrong documentation. The internals documentation claimed the Value enum was "a discriminant byte + up to 8 payload bytes." I ran std::mem::size_of::<Value>() — it was 16 bytes on aarch64. The docs were wrong, and the kind of wrong that r/rust would have caught immediately.

Naming inconsistencies. The stdlib used four different conventions simultaneously: string/trim (module/function), string-append (kebab-case), substring (concatenated), and string->number (arrow notation). A simulated comment called this out as "a stdlib designed by committee where the committee never met." Fair.

No schema validation in llm/extract. The structured extraction function had no way to validate that the LLM's response actually matched the requested schema. A simulated commenter pointed out that garbage data could silently pass through. I added a :validate option and retry logic.

Criticisms That Were Wrong

Not everything landed. Some simulated critics were confidently wrong, in the way real Reddit commenters often are:

"Rust internal names leak into stack traces" — The CallFrame struct correctly used Lisp function names, not Rust symbol names. The simulation assumed a common mistake that I hadn't actually made.

"Your (load) function doesn't resolve relative paths" — It did. It used the calling file's directory as the base, which is the correct behavior.

"The reader probably panics on malformed input" — Fuzz tests confirmed it returned Result errors safely. No panics.

"Your llm/batch is probably sequential under the hood" — It used join_all for concurrent requests. The simulated skeptic assumed the lazy implementation; I'd done the right thing.

The distribution was roughly 50/50 — half the criticisms were valid issues I needed to fix, half were assumptions that didn't hold. This is close enough to real Reddit that it felt useful.

Feature Suggestions From Nobody

Some simulated comments didn't point out bugs — they suggested features. And the suggestions were good enough that I built them.

emacs_wizard_42 wrote:

Have you considered writing an Emacs major mode for .sema files? The playground's syntax highlighting looks good — porting that to Emacs would take maybe a day and would get you instant adoption from the Lisp community. We all live in Emacs.

This is the kind of comment that's easy to dismiss as noise. But it's right. The Lisp community does live in Emacs. So I built the mode — sema-mode.el with syntax highlighting, indentation, and REPL integration via comint. Then I built modes for Vim, Helix, and VS Code too. A fake persona driven by a simulated subreddit culture drove a real expansion of the project's ecosystem.

The trick, as I described it at the time: "I tricked you into predicting failure modes by pretending to be other people that would look at this differently, and now we are gonna preemptively fix all that."

Turning It Into a Tool

The experiment worked well enough that I wanted to run it on other projects without spending an hour setting up personas and prompts each time. So I packaged the workflow into reddit-scrutinizer — a CLI tool that automates the entire pipeline.

It scans your project (file tree, README, config files), generates a realistic Reddit submission for the target subreddit, identifies the critique angles the community would focus on, then builds a threaded comment tree with votes, flairs, awards, and OP replies. Four Claude API calls in sequence, each building on the previous output.

Subreddit Vibe Packs

Each subreddit has a JSON "vibe pack" defining its personality:

Tone — the baseline attitude (r/rust is constructive but thorough, r/programming is cynical, r/webdev is practical)
Pet topics — things the community always brings up (r/rust: "have you considered using Arc instead of Rc?", r/lisp: "why not just use Common Lisp?")
Taboos — things that get you downvoted (r/golang: criticizing error handling, r/haskell: calling monads burritos)
Archetypes — commenter personas with consistent posting patterns (the senior dev who's seen it all, the enthusiastic beginner, the one-line snark account)

There are 22 built-in subreddits including cpp, golang, haskell, javascript, lisp, programming, python, rust, typescript, webdev, reactjs, devops, gamedev, localllama, and more.

Usage

# Install globally
npm install -g reddit-scrutinizer

# Or run directly without installing
npx reddit-scrutinizer ./my-project --subreddit rust

# Snarky r/programming with 60 comments, auto-open browser
reddit-scrutinizer ./my-project --subreddit programming --comments 60 --style snarky --open

# Reproducible run with a fixed seed
reddit-scrutinizer ./my-project --subreddit typescript --seed 42

# View a previous result
reddit-scrutinizer serve ./reddit-scrutiny.json --open

The output is a JSON file and an optional browser UI — the same dark-mode Reddit-lookalike that the original Sema experiment used, now served via Bun.serve() and automatically opened in your browser.

I ran it on itself. The top-voted simulated comment called out the irony of using AI to simulate humans criticizing AI-generated code. The second-highest suggested that vibe packs were "just prompt engineering with extra steps." Both fair.

Applying This in Practice

If you want to try this yourself, the fastest workflow is to generate the comments with the CLI tool, then point a coding agent at the output to do the verification.

Here's the two-pass approach:

Pass 1: Generate and audit. Run reddit-scrutinizer on your project, then hand the output to a coding agent and ask it to verify each criticism against your actual codebase.

I ran reddit-scrutinizer on this project. The output is in ./reddit-scrutiny.json.

Read the simulated Reddit comments (in simulation.comments, each has body_md
with the comment text and score for how "important" the community considered it).

For each comment that makes a technical claim or criticism:

1. State the claim in one sentence
2. Check it against the actual codebase — read the relevant files, run tests
   if needed, verify measurements
3. Classify as: REAL ISSUE, NOT AN ISSUE (with evidence), or WORTH DISCUSSING

Focus on the highest-scored comments first. Skip pure jokes, meta-commentary,
and style preferences. I want a table of findings when you're done.

Pass 2: Fix what's real. In the same conversation, ask the agent to act on the confirmed issues.

Good. Now fix every issue you classified as REAL ISSUE above.

For documentation claims, verify empirically before correcting — run the
code, measure sizes, check actual behavior. For code issues, add regression
tests where appropriate. Skip anything cosmetic or subjective.

The two-pass approach matters. If you ask an agent to "find and fix all the issues from this Reddit thread" in one shot, it'll treat every criticism as valid and start making changes you didn't ask for. The audit step forces verification before action — which is the same discipline that made the original experiment useful.

You don't need the CLI tool for this. The underlying technique works with any LLM and a well-structured prompt. But the tool handles the persona generation, subreddit voice matching, and comment threading — the parts that are tedious to set up manually and easy to get wrong.

Simulated vs Real

Simulated critics are better than real ones in some ways. They don't get distracted by your post title. They don't pile on because the first comment set a negative tone. They don't skip reading the README. They engage with the actual technical content — because that's all they have.

They're worse in all the ways that matter for long-term product development. They can't tell you what confused them during installation. They can't tell you that your API feels wrong after a week of daily use. They can't tell you that the feature you're most proud of is the one nobody needs.

Use both. Simulate before you ship. Then listen to the real humans after.

reddit-scrutinizer is MIT licensed at github.com/HelgeSverre/reddit-scrutinizer. Install with npm install -g reddit-scrutinizer or run directly with npx reddit-scrutinizer ./your-project --subreddit rust.