Building Sema: A Lisp with LLM Primitives, Built with AI Agents

Update: This post describes Sema's first five days, ending at v1.0.1. Development continued well beyond that — Sema is now at v1.11.0 with a bytecode VM, NaN-boxing, a code formatter, a package manager, a web server, and significantly more stdlib coverage. Read Part 2 for what happened next.

Sema is a Scheme-like Lisp where prompts are s-expressions, conversations are immutable data structures, and LLM calls are just another form of evaluation. At v1.0.1, it was implemented in Rust across 6 crates, had 400+ builtins across 19 modules, and supported 11 LLM providers auto-configured from environment variables. The first commit was February 11th. Version 1.0.1 shipped February 15th.

The initial release — the language, a documentation site, a WASM-powered browser playground with example programs, and a library of example scripts — shipped in 5 days using Amp Code agents.

The Question

What if calling an LLM was as natural as calling a function? Not an HTTP request wrapped in error handling wrapped in JSON parsing — just evaluation. You write an expression, it evaluates, you get a result.

Lisp is the obvious answer. S-expressions already look like structured prompts. Conversations are just lists you can cons onto. Tool definitions map cleanly to function signatures. The data-as-code philosophy means you can manipulate prompts programmatically the same way you manipulate any other data structure.

Sema takes the Scheme core — lexical scoping, proper tail calls via trampolines — and adds Clojure's ergonomic sugar: keywords (:foo), map literals ({:k v}), vector literals ([1 2 3]). Then it adds LLM primitives as first-class language constructs.

The First Five Days

Day 1: Language Foundations (Feb 11)

The first day was about getting from nothing to a working Lisp. Lexer, parser, evaluator, REPL. The crate structure was decided upfront:

sema-core — value types, environment, error handling
sema-reader — lexer and parser
sema-eval — evaluator with trampoline-based TCO
sema-stdlib — 19 modules of builtins
sema-llm — provider abstraction, tool execution, conversation values
sema — CLI binary

By end of day: basic arithmetic, define, lambda, let, if, cond, begin, quote, quasiquote, string operations, list operations. A Lisp you could actually write programs in.

The evaluator uses a trampoline for tail-call optimization — inspired by Guy Steele's 1978 "Rabbit" paper. Instead of recursive Rust calls that blow the stack, tail-position expressions return a Trampoline::Eval value that the trampoline loop picks up:

;; This runs in constant stack space
(define (loop n)
  (if (= n 0)
    "done"
    (loop (- n 1))))

(loop 10000000) ;; => "done"

Day 2: LLM Integration & Stdlib Expansion (Feb 12-13)

This is where Sema becomes more than just another Lisp. The prompt special form lets you write conversations as s-expressions where role symbols are syntax:

(llm/send
  (prompt
    (system "You are a helpful assistant.")
    (user "What is the capital of Norway?")))
;; => "The capital of Norway is Oslo."

prompt builds a prompt value — an immutable list of messages with role symbols as syntax. llm/send takes a prompt and sends it to the configured LLM provider. But prompts are also first-class values you can bind, extend, inspect, and fork:

(define conv
  (prompt
    (system "You are a pirate.")
    (user "Hello!")))

;; Extend without mutating the original
(define conv2 (prompt/append conv (prompt (user "Tell me about treasure."))))

;; Fork for parallel exploration
(define polite-conv (prompt/append conv (prompt (system "Be extra polite."))))
(define rude-conv (prompt/append conv (prompt (system "Be rude."))))

The provider system auto-configures from environment variables. Set OPENAI_API_KEY and you have OpenAI. Set ANTHROPIC_API_KEY and you have Anthropic. All 11 providers — OpenAI, Anthropic, Google Gemini, Groq, Mistral, xAI, Moonshot, Ollama for chat, plus Jina, Voyage, and Cohere for embeddings — work the same way:

;; Switch providers at runtime
(llm/set-default :anthropic)
(llm/send (prompt (user "Hello from Claude!")))

(llm/set-default :openai)
(llm/send (prompt (user "Hello from GPT!")))

The stdlib grew rapidly: file I/O, HTTP client, JSON parsing, regex, math, string manipulation, hash maps, sorting, environment variables. Each module was a well-defined, independent task — the kind of thing an agent can pick up with minimal context.

Day 3: Tooling, Polish, Ecosystem (Feb 14-15)

The final push was about everything around the language: deftool and defagent, performance optimization, the documentation site, the browser playground, and example programs.

deftool defines tools that LLMs can call during conversations. The tool execution loop is built into llm/chat — the LLM sees the tool signatures, decides to call them, Sema executes the tool bodies, feeds results back, and the conversation continues:

(deftool get-weather
  "Get current weather for a location"
  {:location {:type :string :description "City name"}}
  (lambda (location)
    (format "Weather in {}: 22°C, sunny" location)))

(llm/send
  (prompt
    (system "You have access to a weather tool.")
    (user "What's the weather in Bergen?")))
;; LLM calls get-weather with "Bergen", gets result, responds naturally

defagent goes further — it bundles a system prompt, a set of tools, and model configuration into a reusable agent:

(defagent researcher
  {:model "gpt-4o"
   :system "You are a research assistant. Use your tools to find information."
   :tools [search summarize]})

(researcher "Find recent papers on transformer architectures")

Structured extraction was another key addition. llm/extract parses LLM output into typed Sema values:

(llm/extract
  {:day {:type :string}
   :time {:type :string}
   :attendees {:type :array :items {:type :string}}}
  "The meeting is Tuesday at 3pm with Alice and Bob")
;; => {:day "Tuesday" :time "3pm" :attendees ["Alice" "Bob"]}

How Amp Code Was Used

The workflow was similar to building Token but the initial release shipped in 5 days instead of 10. Lisp interpreters have decades of academic prior art — SICP, Queinnec's "Lisp in Small Pieces", the R7RS spec — which meant agents had strong reference material to work from. Less time was spent explaining what to build and more time was spent deciding what to build.

How the Work Was Structured

My job isn't to write code anymore. It's to manage a team of agents and communicate what I want clearly. That means knowing what to ask for, knowing when to dig deeper into something I'm not sure about, and knowing when to let an agent run with a well-defined task.

A Lisp implementation has natural decomposition boundaries. The lexer doesn't need to know about the stdlib. The LLM module doesn't care about the evaluator internals. Most of the work was inherently sequential — you can't write stdlib functions before the evaluator exists — but the boundaries were clean enough that independent modules could be built in parallel when the time came. I'd typically run 2–3 agent sessions simultaneously in separate tabs: one doing code changes, one updating docs or the website, and a third running benchmarks or discovering test gaps. This works well until you push it too far — sometimes one agent breaks the build for the others, and the real bottleneck becomes me juggling too much context at once. The benefits flatten out on the curve when you're switching between more threads than you can hold in your head.

Where prior knowledge mattered most was in areas I was less familiar with. I had agents research Lisp implementation strategies, survey how other interpreters handle tail-call optimization, and present me with options for things like the environment representation. The important thing is knowing when to dig deeper — when an architectural choice has implications you might not see until later. One failure of mine here: the original design used thread_local! variables for evaluator state (call stack, module cache, eval depth). I didn't flag this as something to examine more carefully early on. It worked, it was simple, and it avoided circular dependencies between crates. But it meant you couldn't run multiple independent interpreter instances on the same thread — a problem for embedding Sema as a library. I had to refactor to an explicit EvalContext struct later, touching ~13 files and ~80 call sites. The refactor was straightforward, but it would have been cheaper to get right on day 1 if I'd thought harder about the embedding use case upfront.

The Back-and-Forth

The work didn't split neatly into "I designed" and "agents implemented." It was a loop. I'd start a session with explicit context — which crate, what Value looks like, naming conventions, what not to touch — and the agents would return a patch or a plan. I'd accept it, redirect with tighter constraints, or ask a different question in a fresh thread when the current one drifted.

For stdlib modules, the loop was short. A (string/split "a,b,c" ",") is a specification, not a conversation — here's the signature, here's what it does, here are the edge cases. But anything touching architecture or tooling was iterative by necessity.

The WASM playground was human constraints, agent execution. I knew up front the browser build needed conditional compilation: no filesystem, no network, no live LLM calls. I knew the string interner needed a WASM-compatible backend. Those constraints came from me. But when agents categorized all 61 functions that needed shimming — splitting them into "trivial" (path ops are pure string manipulation), "medium" (in-memory virtual filesystem for file/read and file/write), and "not feasible" (shell, exit, blocking stdin) — that categorization was useful and saved me time. When they tried to bridge async fetch() into the synchronous evaluator and hit the expected wall, I'd already decided on stub errors pointing to a future eval_async. The direction was mine; the mechanical work of making 61 shims compile and pass was theirs.

Benchmarks were another case where knowing what to ask for mattered. I wanted to compare Sema against other Lisps under controlled conditions — not a flattering number, but something methodologically sound. Same Docker container, same 10M-row input, same measurement approach, best of 3. Agents built the harness, wrote implementations for 14 other dialects, and generated the comparison tables. But I had to keep tightening the methodology: ensuring all implementations used integer×10 parsing for fairness, switching the Dockerfile to build from local source so I could test uncommitted optimizations, correcting drift when an implementation was accidentally benchmarking the parser instead of the hot loop. The let* flattening optimization — reducing environment allocations from 3 per row to 1 — came from an agent analyzing the profile data, and it was the right call. But knowing to profile, knowing what "fair" means across dialects, knowing when a 7.4× gap behind SBCL is respectable for a tree-walking interpreter versus embarrassing — that's domain knowledge the agents didn't have.

And sometimes the best ideas came from the agents. BTreeMap for deterministic map ordering wasn't my idea. An agent suggested it with a rationale — sorted iteration order makes debugging reproducible, which matters when you're comparing LLM responses across providers. I accepted it because it matched what I cared about. The same happened with error message design: I used the brainstorming skill, agents researched how Rust and Zig handle diagnostics, proposed three tiers of improvement, and I picked the middle one — structured hints without full source-pointing diagnostics. Their research was genuinely useful; my contribution was knowing which level of polish was worth the complexity.

This is how most of the decisions were made. Not a clean division of labor, but a loop of specifying, reviewing, correcting, and occasionally being surprised.

Design Decisions

Keywords as Map Accessors

Borrowed from Clojure: keywords in function position are map lookups.

(define person {:name "Helge" :age 30 :city "Bergen"})

(:name person)  ;; => "Helge"
(:age person)   ;; => 30

;; Works in higher-order contexts
(map :name [{:name "Alice"} {:name "Bob"}])  ;; => ("Alice" "Bob")

Deterministic Ordering

All maps use BTreeMap internally. This means iteration order is always sorted by key. It's slower than HashMap for large maps, but it makes output deterministic — important when you're debugging LLM interactions and need reproducible results.

Prompts as Immutable Values

A prompt is not a mutable session. It's a value, like a list or a map. You can bind it, pass it to functions, return it, store it in data structures. When you "extend" a prompt, you get a new value — the original is unchanged.

This matters for LLM workflows. You often want to try multiple approaches from the same prompt state, compare responses across providers, or build prompt trees. Immutable prompts make this natural:

(define base-prompt
  (prompt
    (system "You are an expert programmer.")))

;; Ask the same question to different models
(define answers
  (map (lambda (provider)
         (llm/set-default provider)
         (llm/send
           (prompt/append base-prompt
             (prompt (user "Explain monads in one sentence.")))))
       '(:openai :anthropic :google)))

Single-Threaded by Design

Sema is deliberately single-threaded. The string interner, module cache, LLM provider configuration — all thread-local state. No Arc, no Mutex, no synchronization overhead. The evaluator state lives in an explicit EvalContext struct (originally thread-local too, until the embedding use case forced a refactor). This simplified the implementation enormously and is the right trade-off for a language whose primary bottleneck is network calls to LLM APIs.

The Performance Story

I benchmarked Sema against 14 other Lisp dialects on the 1 Billion Row Challenge — processing semicolon-delimited temperature readings to compute min/mean/max per weather station. For the sake of brevity, all benchmarks were run on the 10 million row variant (not the full 1 billion) inside the same Docker container.

Starting Point

The naive implementation ran in about 29 seconds. For a tree-walking interpreter this young, this was expected but not impressive.

Optimization Passes

Each optimization was a focused agent session:

String interning — Sema symbols and keywords were being compared as heap-allocated strings. Switching to the lasso crate for interning meant symbol comparisons became integer comparisons. This was the single biggest win.

Hash map swap — Replacing the standard library HashMap with hashbrown for the hot-path environment lookups.

SIMD line scanning — Using memchr for finding newlines in the input file instead of byte-by-byte iteration.

COW map mutation — Copy-on-write semantics for map operations in tight loops, avoiding unnecessary cloning.

Mini-evaluator — A specialized fast path in the evaluator for simple arithmetic and comparison expressions that skips the full trampoline machinery.

let* flattening — Compiler pass that flattens nested let* forms to reduce environment chain depth.

Results

At v1.0.1, after optimization: 9.6 seconds natively on Apple Silicon. In Docker under x86-64 emulation (for fair comparison against other implementations), Sema landed at 7.4x behind SBCL. (These numbers changed significantly in later versions — NaN-boxing added overhead under emulation, and the bytecode VM introduced a faster execution mode. See the current benchmarks for up-to-date numbers.)

Dialect	Time (ms)	vs SBCL	Type
SBCL	2,108	1.0x	Native compiler
Chez Scheme	2,889	1.4x	Native compiler
Fennel/LuaJIT	3,658	1.7x	JIT
Gambit	5,665	2.7x	Compiled via C
Clojure	5,717	2.7x	JVM
Chicken	7,631	3.6x	Compiled via C
PicoLisp	9,808	4.7x	Interpreter
newLISP	12,481	5.9x	Interpreter
Emacs Lisp	13,505	6.4x	Bytecode VM
Janet	14,000	6.6x	Bytecode VM
ECL	14,915	7.1x	Compiled via C
Guile	15,198	7.2x	Bytecode VM
Sema	15,564	7.4x	Tree-walking interpreter
Kawa	17,135	8.1x	JVM
Gauche	23,082	10.9x	Bytecode VM

The most interesting comparison is Janet (6.6x) — architecturally the closest to Sema. Both are embeddable, single-threaded, reference-counted scripting languages. Janet's bytecode VM is faster, but the gap is narrower than you'd expect given the architectural advantage of bytecode dispatch over tree-walking. The full benchmark writeup is at sema-lang.com/docs/internals/lisp-comparison.

Building the Ecosystem

The language is only part of the project. Alongside the language work, agents built:

Documentation Site

A VitePress site at sema-lang.com covering:

Getting started guide
Language reference (data types, special forms, macros)
Every stdlib module documented with examples
LLM integration guide
Embedding API for using Sema as a library

Browser Playground

A WASM-compiled version of Sema running at sema.run with:

Code editor (plain textarea — no heavy dependencies)
Preloaded example programs
Instant evaluation (no server, runs entirely in the browser)
The full stdlib available (minus LLM calls and file I/O, for obvious reasons)

Once the 61 shims were in place and the WASM target compiled, the playground itself was straightforward — a Vite app that loads the WASM module and wires up the editor.

Example Programs

Examples ranging from basics (fibonacci.sema, fizzbuzz.sema) to LLM-specific programs:

;; multi-provider-compare.sema
;; Ask the same question across providers and compare

(define question "Explain recursion to a 5-year-old.")

(define providers '(:openai :anthropic :google))

(for-each (lambda (provider)
  (display (format "\n--- {} ---\n" provider))
  (llm/set-default provider)
  (display (llm/send (prompt (user question)))))
  providers)

;; code-reviewer.sema
;; An agent that reviews code and suggests improvements

(deftool read-file
  "Read source code from a file"
  {:path {:type :string :description "File path to read"}}
  (lambda (path) (file/read path)))

(defagent code-reviewer
  {:model "claude-sonnet-4-20250514"
   :system "You review code for bugs, performance issues, and style.
            Be specific and cite line numbers."
   :tools [read-file]})

(code-reviewer
  (format "Review the file: {}" (nth (sys/args) 3)))

The Cleanup

When you run multiple agent sessions across different parts of a codebase, each one develops its own micro-style. One session uses // ====== Section ====== separators, another doesn't. One writes doc comments on everything, another only on public functions. One prefers Value::String(Rc::new(...)), another uses the Value::string(...) helper.

This is the same problem any multi-contributor project has — style drift. It just happens faster with agents because each session starts fresh without memory of what the others did.

The cleanup pass took about an hour:

Removed 128 section separator comments that had accumulated across modules
Deleted redundant doc comments (a function called add doesn't need /// Adds two numbers)
Standardized Value::string() constructor usage across the entire codebase
Unified error handling patterns where different agents had chosen different approaches

This isn't about hiding anything. It's about not letting inconsistency accumulate into what people would eventually just dismiss as slop. Multi-agent codebases need the same kind of style normalization that any team project needs — you just need to do it more deliberately because the drift happens in hours instead of months.

What I Learned

Lisps are ideal AI agent projects. The implementation is well-documented in academic literature (SICP, Queinnec's "Lisp in Small Pieces", R7RS). Agents can reference these directly. The module boundaries are natural. Each stdlib function is independent. The evaluator is the only complex piece, and even that follows established patterns.

Time-box the first release, not the project. Shipping v1.0 in five days forced good decisions — simple architecture, clear module boundaries, no premature abstraction. The LLM integration design held up from initial sketch through months of continued development. But the project didn't stop at v1.0, and the interesting work — a bytecode VM, NaN-boxing, a package manager — came after.

Agents are a force multiplier, not a magic wand. Exceptional solo developers — a Tsoding, a Jonathan Blow — can absolutely build impressive things through raw skill and focus. AI doesn't make impossible things possible. What it does is take "that's a neat idea, maybe I'll build it someday" and turn it into a fuzzed, benchmarked, documented, tested product with a browser playground — in days instead of months. The barrier isn't lowered for toys. It's lowered for robust output.

Context management is the real skill. A single agent session has finite context. When it fills up or drifts, you need strategies: handoffs (Amp Code creates a new thread with relevant context carried forward), compaction (tools like Claude compress conversation history to reclaim context space), and planning documents that serve as shared memory across sessions. Being able to point a new agent at a previous conversation and say "continue this work" — or write a spec document that any agent can pick up cold — is more important than running ten agents at once.

Curation is the job. Agents suggest things constantly — some good, some not. No agent woke up and decided that conversations should be immutable values, or that keywords in function position should work as map accessors. The work is knowing which suggestions to accept, which to reject, and which questions to ask in the first place. You're not writing code — you're directing a project.

Why I Keep Building These

Sema is the third "big" project I've built this way. Token was a text editor in Rust. Lira is a systems language. Each one is deliberately ambitious — not because I need a Lisp interpreter or a text editor, but because they're stress tests. How far can one person push this workflow? Where does it break? What skills do you need to develop?

The answer so far: pretty far, and the skills are not what most people think.

It's not about prompting. It's about describing things clearly when agents — not humans — are the target consumer. It's about developing a repertoire of human-machine collaboration patterns. It's about spotting drift before it compounds into something unmanageable. It's about knowing when to fan out and when to go deep. These are new skills and we're all still learning them — in hobby projects and in professional settings.

The discomfort around "AI slop" and the anger at an LLM giving a bad answer to a vague prompt — these reactions are real, and usually rooted in something understandable: fear of losing craft, status, or agency to a tool that's moving too fast to feel negotiable. You see the same pattern in music right now. When tools like Suno ship, it's natural for musicians to feel threatened — not because they're anti-technology, but because identity and livelihood are tied to the process. The practical outcome tends to be the same: the tools don't disappear, they get integrated, and the differentiator shifts toward taste, direction, and the ability to shape raw output into something intentional.

I don't think the right response is e/acc cheerleading or doomer resignation. It's paying attention. The tooling is improving monthly. The workflows are maturing. The gap between "person who can direct AI agents effectively" and "person who can't" is going to matter more than the gap between "person who can write Rust" and "person who can't."

I'd rather be practicing now than scrambling later.

Sema is MIT licensed at github.com/HelgeSverre/sema. The documentation is at sema-lang.com and the playground is at sema.run.