Fine-Tuning Failed. Tools Won.

Sema is a Lisp I've been building in Rust. It has LLM primitives, a bytecode VM, pattern matching, async channels — about 58,000 lines of code across 16 crates. No model knows Sema's syntax. Every LLM I've tried writes Clojure-style (let [x 1] ...) instead of Sema's (let ((x 1)) ...), and hallucinates functions that don't exist.

I wanted to fix this. The plan: fine-tune a small model on Sema code so it knows the dialect. Then give it tools so it can verify its own code. Then benchmark the whole thing and see what actually works.

I didn't do any of this myself. I ran it through opencode powered by GLM 5.2 on Fireworks — the same model I'd later benchmark. The agent wrote the extraction scripts, built the grader, launched the RFT job, wrote the benchmark harness, ran all the comparisons, and produced the analysis. I gave it direction and made decisions. It did the work.

Total cost of the experiment: $29. The fine-tuning itself was free. The results were not what I expected.

The First Attempt: System Prompts

Before fine-tuning, the baseline approach: stuff a comprehensive Sema reference into the system prompt. About 1,500 tokens covering syntax, special forms, stdlib naming conventions, common patterns, and three example programs. Send it with every request to GLM 5.2 on Fireworks serverless.

This works reasonably well for simple tasks. GLM 5.2 is a 743B parameter Mixture-of-Experts model — it has the reasoning capacity to read a system prompt and apply it. It can write factorial functions, fibonacci with named-let TCO, simple pattern matching. It scores 49% overall on the benchmark without any tools or fine-tuning.

But it consistently gets Sema-specific syntax wrong. It writes (let [x 1] ...) (Clojure vector syntax) instead of (let ((x 1)) ...) (Sema list syntax). It uses (map #(* % %) ...) correctly but then tries (string/substring ...) which doesn't exist in Sema. It knows of the language from the system prompt, but it doesn't feel it in its weights.

The system prompt approach also has a latency problem: every request sends 1,500 tokens of Sema documentation before the actual question. With prompt caching this is cheap, but it's still overhead. And the model can't verify its code — it generates a program, returns it, and hopes it's right.

The Fine-Tuning Plan

Fireworks.ai offers Reinforcement Fine-Tuning (RFT) for free on models under 16B parameters. The pitch is simple: give them a dataset of problems and a grader function, and they train the model with RL — the model generates answers, the grader scores them, and the model learns to maximize the score.

The agent proposed the approach, I approved it, and it built the pipeline.

Training Data Extraction

Sema has 1,232 test cases embedded in its Rust test suite as eval_tests! macros. Each test is a Sema expression paired with its expected value:

eval_tests! {
    big_int_add: "(let ((a 9000000000000)) (+ a a))" => Value::int(18000000000000),
    list_map: "(map (fn (x) (* x 2)) '(1 2 3))" => common::eval("'(2 4 6)"),
    get_in_nested: "(get-in {:a {:b 2}} [:a :b] 0)" => Value::int(2),
}

The expected values are Rust expressions constructing Value objects, not Sema printed forms. The agent wrote a parser that extracts the (test_name, input_string) pairs from the macro, then runs each input through sema eval --json to get the actual printed output:

Sema codebase (Rust test files)
        │
        ▼
  Parse eval_tests! macro
  ── extract (name, input) pairs
        │
        ▼
  Run each input through sema eval --json
  ── "(+ 1 2)" → {"ok": true, "value": "3"}
  ── "(map (fn (x) (* x 2)) '(1 2 3))" → {"ok": true, "value": "(2 4 6)"}
        │
        ▼
  Write JSONL training data
  ── {"prompt": "Evaluate: (+ 1 2)", "expected": "3"}
  ── {"prompt": "Evaluate: (map ...)", "expected": "(2 4 6)"}
        │
        ▼
  Split 90/10 → 1,037 train / 116 holdout

1,232 cases extracted, all verified correct by the VM. After filtering out I/O and network tests and splitting 90/10, 1,037 training problems and 116 held-out test cases.

The training data looked like:

{"prompt": "Evaluate this Sema expression:\n(string/to-float \"3.14\")", "expected": "3.14"}
{"prompt": "Evaluate this Sema expression:\n(flatten '(1 (2 3) (4 5)))", "expected": "(1 2 3 4 5)"}

Why This Was The Wrong Approach

Looking at this now, the problem is obvious. The training data teaches the model to evaluate expressions — given code, return the value. That's a recognition task. Code generation is a production task — given a description, write the code. These are different skills.

It's like training someone to grade math tests and hoping they learn to solve the problems. They learn to recognize correct answers. They don't learn to produce them.

But at the time, the reasoning was: if the model knows what Sema expressions evaluate to, it must understand Sema syntax. And if it understands the syntax, it can generate it. That turned out to be wrong — understanding and producing are different capabilities, and RL training on one doesn't transfer to the other.

The Grader

The agent wrote a Python grader that Fireworks' RFT could call as an HTTP endpoint. The grader receives the model's completion and the expected output, then:

1.0 — exact match (the completion equals the expected value)
0.3 — partial (the code ran but produced the wrong output)
0.0 — error (parse failure, runtime error, or no code found)

The grader uses sema eval --json to execute the model's output through the actual Sema VM. No mock, no approximation — the real bytecode compiler and VM.

The Training Run

Qwen3-8B: 8.2B parameters, fine-tunable on Fireworks, free for RFT. LoRA rank 16, 2 epochs, 692 rollouts per epoch. The job ran for about 40 minutes.

The average reward went from 76.9% to 87.9% over 12 checkpoints. The median score was 100% — 604 out of 692 rollouts were perfect, 80 failed, 8 partial. By every metric Fireworks reported, the training was a success.

The model learned to evaluate Sema expressions. When asked "What does (+ 1 2) evaluate to?", it said 3. When asked about (filter odd? '(1 2 3 4 5)), it said (1 3 5). The RFT training worked.

Then I tried to use it for actual programming.

The Benchmark

The agent wrote 60 Sema coding tasks across 5 difficulty levels:

Level	Description	Example
L1	Single expression evaluation	"What does `(map (fn (x) (* x 2)) '(1 2 3))` evaluate to?"
L2	Write a small function	"Write a function that checks if a string is a palindrome"
L3	Multi-feature program	"Define a multimethod that dispatches on shape type"
L4	Full program	"Write an arithmetic expression tree evaluator"
L5	Advanced	"Write a `defmacro` that implements list comprehensions"

Grading is done by running the generated code through the Sema VM. If the model says the answer is 3 but the code produces 4, it's wrong. If the code doesn't parse, it's wrong. No subjective grading — the VM is the oracle.

Four configurations:

GLM 5.2 bare — 743B frontier model, system prompt only
Qwen3-8B RFT — the fine-tuned model, system prompt only
GLM 5.2 + tools — frontier model with eval_code and docs_search tools
Qwen3-8B RFT + tools — fine-tuned model with the same tools

The Tool-Augmented Loop

The tools are the key part. Sema already has an MCP server (sema mcp) that exposes an eval tool — send it Sema code and it runs it through the VM and returns the result. The agent added a docs_search tool backed by a vector store of 804 Sema documentation entries (embedded via Jina, cosine similarity search, reranked via Jina cross-encoder).

User: "Write a function that reverses a list"
        │
        ▼
  Model generates: (define (rev lst) (foldl ...))
        │
        ▼
  Model calls eval_code("(define (rev lst) ...)")
        │
        ▼
  Sema VM executes → returns "ok" or error
        │
  ┌─────┴─────┐
  │           │
  ▼           ▼
 ok         error
  │           │
  │           ▼
  │     Model reads error,
  │     fixes code, retries
  │           │
  │           ▼
  │     eval_code("(define (rev lst) ...)")
  │           │
  ▼           ▼
  Model returns    Model returns
  final answer     final answer

The model can call eval_code to test its code before answering, and docs_search to look up functions it doesn't know. The self-correction loop runs up to 3 rounds before forcing a final answer.

The Results

Level	GLM 5.2 bare	Qwen3-8B RFT	GLM 5.2 + tools	Qwen3-8B RFT + tools
L1: Trivial	58%	58%	72%	81%
L2: Simple	61%	37%	75%	9%
L3: Medium	65%	23%	78%	4%
L4: Complex	13%	26%	56%	0%
L5: Advanced	12%	6%	66%	6%
Overall	49%	36%	71%	24%

The fine-tuned model with tools scored 24% — the worst of all four approaches. Worse than the bare model. Worse than the fine-tuned model without tools. The RFT training didn't just fail to help — it made things worse.

What Went Wrong

The RFT training taught the model to evaluate Sema expressions. It did not teach it to write Sema code. And in teaching it to evaluate, it damaged the general coding ability that Qwen3-8B started with.

The failure modes were specific and consistent:

It generates prose instead of code. When asked to "Write a Sema function that checks if a number is prime," the model would submit English text to the eval_code tool — literally the sentence "The function checks if a number is prime by..." — and get Unbound variable: The back. Then it would give up.

It uses backticks in code. The RFT training data was formatted with markdown backticks around code blocks. The model learned to include backticks in its code output. Sema's reader interprets ` as quasiquote, so every program that starts with a backtick fails to parse with Reader error: quasiquote requires an expression.

It can't iterate. GLM 5.2 with tools calls eval_code, sees an error, reads the error message, fixes the code, and tries again. The fine-tuned Qwen3-8B calls eval_code once, gets an error, and stops. The self-correction loop that makes tool-augmented inference work — the entire reason tools help — doesn't function. The model treats the error as terminal instead of diagnostic.

It hallucinated functions that don't exist. string/equals?, add1, map/merge, map/empty, -inf.0. These sound like Sema functions. They are not Sema functions. The RFT training reinforced the model's tendency to invent plausible-sounding functions rather than look them up.

The one thing the fine-tuned model did better than anything else: L1 eval-match tasks. At 81% with tools, it was the best of any configuration. When the task is "evaluate this expression and return the value," the RFT training shines — the model knows the syntax, knows the output format, and the eval_code tool verifies the answer. But the moment the task shifts from evaluation to generation, the training becomes a liability.

What Actually Works

GLM 5.2 with tools scored 71% — 22 percentage points better than the bare model, 47 points better than the fine-tuned model with tools. The tools did what fine-tuning couldn't.

The eval_code tool is the game-changer. The model generates code, calls eval_code to test it, reads the error if any, fixes the code, and tries again. L4 (complex programs) went from 13% to 56%. L5 (macros, async, lazy streams) went from 12% to 66%. These are the tasks where the model needs to iterate — where the first attempt is wrong and the model needs to see why it's wrong and try again.

The docs_search tool was used less — about 0.2 calls per task — but when the model did use it, it was looking up the right things: "how to reverse a list," "how to read file lines," "how to create a channel." The RAG pipeline (embed docs → cosine search → cross-encoder rerank → return top 5) gave the model the actual function names and signatures, which eliminated the hallucination problem.

Average tool usage: 1.7 calls per task. Most tasks need one eval_code call to verify. The harder tasks need 2-4 calls as the model iterates. The cost: about $0.02-0.05 per query on GLM 5.2 serverless.

The Six-Model Comparison

After seeing that tools beat fine-tuning, the agent ran the same 60-task benchmark with tools against six serverless models on Fireworks:

Model	Overall	L2	L4	L5	$/M tokens
Kimi K2.6	60%	75%	59%	66%	$0.95/$4
DeepSeek V4 Pro	58%	81%	49%	18%	$1.74/$3.48
Kimi K2.7 Code	58%	75%	59%	60%	$0.95/$4
Qwen 3.7 Plus	56%	73%	69%	40%	$0.40/$1.60
GLM 5.2	54%	66%	59%	26%	$1.40/$4.40
DeepSeek V4 Flash	53%	61%	59%	6%	$0.14/$0.28

Kimi K2.6 is the overall winner, and it's not the most expensive model. DeepSeek V4 Pro writes the best functions (81% on L2) but collapses on advanced tasks (18% on L5). Qwen 3.7 Plus is the best value — 56% at less than a third of GLM 5.2's price. DeepSeek V4 Flash at $0.14/$0.28 per million tokens is almost free and still scores 53%.

All models score around 30% on L1. This is a grader bug, not a model problem — the models get the right answer via eval_code but format their response with explanations like "The result is 3" and the answer extractor picks up the wrong text. With a better grader, every model would likely score 10-15 points higher.

The Infrastructure That Already Exists

The most practical finding: Sema already has the infrastructure that won.

sema mcp starts a JSON-RPC server over stdio that exposes eval, docs, fmt, run_file, and notebook tools. Any MCP-compatible LLM client — Claude Desktop, Cursor, Claude Code, anything that speaks the Model Context Protocol — can spawn sema mcp as a subprocess and get a Sema-aware coding assistant with self-correcting code evaluation.

The RAG primitives exist too: llm/embed for embeddings, vector-store/* for cosine similarity search with disk persistence, llm/rerank for cross-encoder reranking. The docs_search tool the agent built for this experiment is a 30-line deftool in a .sema file. Adding it to the MCP server is sema mcp docs-search.sema.

The 804-document vector store took about 10 minutes to build and caches to disk. The embeddings cost about $0.50 via Jina. This is not expensive infrastructure.

What I'd Do Differently

Fine-tuning isn't dead — it makes sense, but not the way I did it. Three mistakes:

Wrong training data format. I trained on eval-match pairs ("What does (+ 1 2) evaluate to?" → "3"). This teaches evaluation, not generation. The right format is code-generation pairs: "Write a function that reverses a list" → (define (my-reverse lst) (foldl (fn (acc x) (cons x acc)) '() lst)). I have 224 example files and 1,156 API doc entries that could be converted to this format.

Wrong base model. 8B parameters isn't enough to absorb domain knowledge without losing general coding ability. Qwen3-32B or GLM 5.1 would have enough capacity to learn Sema syntax while retaining their ability to write programs. Training cost on a 32B model: ~$15-30 for LoRA SFT on Fireworks.

RFT instead of SFT. Reinforcement learning teaches the model to maximize a reward signal. When the reward is "does the output match the expected value," the model learns to produce values, not code. Supervised fine-tuning on (prompt, code) pairs would teach it to generate code directly.

There's one more idea the agent came up with that I want to try: Sema has a grammar fuzzer that generates valid, varied programs with known expected outputs. Every program comes with a ; seed=N => EXPECTED_VALUE annotation. The fuzzer produces programs covering let, lambda, try/catch, async, match, foldl, channels, closures — guaranteed correct, deterministic from a seed.

The approach: generate 1,000 fuzzer programs, ask GLM 5.2 to write a one-sentence natural-language description of each, and get 1,000 (description → code) pairs where the code is verified correct by the fuzzer's differential oracle. That's synthetic training data with a correctness guarantee — the hardest part of building SFT datasets for code.

The right experiment, if I were to try again: SFT on Qwen3-32B with a mix of 30% fuzzer-generated eval pairs, 30% synthetic description→code pairs, 25% human example files, 15% API doc examples. Combined with eval_code + docs_search tools at inference. Target: 75-80% overall. Cost: ~$39.

The Verdict

The entire experiment — research, code, training, benchmarking, analysis — was done by an AI agent running on GLM 5.2 via opencode. I directed the work and made decisions. The agent wrote the extraction scripts, built the grader, launched the RFT job, wrote the benchmark harness, ran all six model comparisons, and produced the results. The same model that scored 54% on Sema coding tasks with tools built the infrastructure to test itself.

Tool-augmented inference works because it doesn't require the model to know anything permanently. The model doesn't need Sema syntax baked into its weights when it can call eval_code to test its code and docs_search to look up functions. The knowledge stays external, updatable, and verifiable. The model just needs to be smart enough to read an error message and fix its code — which 743B-parameter models are, and 8B-parameter models, after RFT training on eval-match pairs, are not.

The experiment cost $29. The fine-tuning was free. The answer was worth more than the cost.

Sema is at github.com/helgesverre/sema-lisp. The full experiment code, benchmark tasks, and results are on the experiment/rft-qwen3-sema branch. The MCP server is sema mcp — try it with any MCP-compatible client.