Sema is a Scheme-like Lisp where prompts are s-expressions, conversations are immutable data structures, and LLM calls are just another form of evaluation. It's implemented in Rust across 6 crates, has 400+ builtins across 17 modules, and supports 11 LLM providers auto-configured from environment variables. The first commit was February 11th. Version 1.0.1 shipped February 15th.
The language, a documentation site, a WASM-powered browser playground with example programs, and a library of example scripts were all built over 5 days using Amp Code agents.
The Question
What if calling an LLM was as natural as calling a function? Not an HTTP request wrapped in error handling wrapped in JSON parsing — just evaluation. You write an expression, it evaluates, you get a result.
Lisp is the obvious answer. S-expressions already look like structured prompts. Conversations are just lists you can cons onto. Tool definitions map cleanly to function signatures. The data-as-code philosophy means you can manipulate prompts programmatically the same way you manipulate any other data structure.
Sema takes the Scheme core — lexical scoping, proper tail calls via trampolines — and adds
Clojure's ergonomic sugar: keywords (:foo), map literals ({:k v}), vector literals ([1 2 3]). Then it adds LLM
primitives as first-class language constructs.
Five Days
Day 1: Language Foundations (Feb 11)
The first day was about getting from nothing to a working Lisp. Lexer, parser, evaluator, REPL. The crate structure was decided upfront:
sema-core— value types, environment, error handlingsema-reader— lexer and parsersema-eval— evaluator with trampoline-based TCOsema-stdlib— 17 modules of builtinssema-llm— provider abstraction, tool execution, conversation valuessema— CLI binary
By end of day: basic arithmetic, define, lambda, let, if, cond, begin, quote, quasiquote, string
operations, list operations. A Lisp you could actually write programs in.
The evaluator uses a trampoline for tail-call optimization — inspired by Guy Steele's 1978 "Rabbit" paper. Instead of
recursive Rust calls that blow the stack, tail-position expressions return a Bounce::Continue value that the trampoline
loop picks up:
;; This runs in constant stack space
(define (loop n)
(if (= n 0)
"done"
(loop (- n 1))))
(loop 10000000) ;; => "done"
Day 2: LLM Integration & Stdlib Expansion (Feb 12-13)
This is where Sema becomes more than just another Lisp. The prompt special form lets you write conversations as
s-expressions where role symbols are syntax:
(prompt
(system "You are a helpful assistant.")
(user "What is the capital of Norway?"))
;; => "The capital of Norway is Oslo."
prompt builds a conversation value — an immutable list of messages — sends it to the configured LLM provider, and
returns the response text. But conversations are also first-class values you can bind, extend, inspect, and fork:
(define conv
(conversation
(system "You are a pirate.")
(user "Hello!")))
;; Extend without mutating the original
(define conv2 (conversation/append conv (user "Tell me about treasure.")))
;; Fork for parallel exploration
(define polite-conv (conversation/append conv (system "Be extra polite.")))
(define rude-conv (conversation/append conv (system "Be rude.")))
The provider system auto-configures from environment variables. Set OPENAI_API_KEY and you have OpenAI. Set
ANTHROPIC_API_KEY and you have Anthropic. All 11 providers — OpenAI, Anthropic, Google, Groq, Mistral, DeepSeek,
xAI, Cerebras, Perplexity, OpenRouter, Ollama — work the same way:
;; Switch providers with a single binding
(llm/with-provider "anthropic"
(prompt (user "Hello from Claude!")))
(llm/with-provider "openai"
(prompt (user "Hello from GPT!")))
The stdlib grew rapidly: file I/O, HTTP client, JSON parsing, regex, math, string manipulation, hash maps, sorting, environment variables. Each module was a well-defined, independent task — the kind of thing an agent can pick up with minimal context.
Day 3: Tooling, Polish, Ecosystem (Feb 14-15)
The final push was about everything around the language: deftool and defagent, performance optimization, the
documentation site, the browser playground, and example programs.
deftool defines tools that LLMs can call during conversations. The tool execution loop is built into llm/chat — the
LLM sees the tool signatures, decides to call them, Sema executes the tool bodies, feeds results back, and the
conversation continues:
(deftool get-weather (location)
"Get current weather for a location"
{:location {:type "string" :description "City name"}}
(format "Weather in {}: 22°C, sunny" location))
(prompt
(system "You have access to a weather tool.")
(user "What's the weather in Bergen?"))
;; LLM calls get-weather with "Bergen", gets result, responds naturally
defagent goes further — it bundles a system prompt, a set of tools, and model configuration into a reusable agent:
(defagent researcher
:model "gpt-4o"
:system "You are a research assistant. Use your tools to find information."
:tools [search summarize])
(researcher "Find recent papers on transformer architectures")
Structured extraction was another key addition. llm/extract parses LLM output into typed Sema values:
(llm/extract
(prompt (user "The meeting is Tuesday at 3pm with Alice and Bob"))
{:day {:type "string"}
:time {:type "string"}
:attendees {:type "array" :items {:type "string"}}})
;; => {:day "Tuesday" :time "3pm" :attendees ["Alice" "Bob"]}
How Amp Code Was Used
The workflow was similar to building Token but compressed into 5 days instead of 10. Lisp interpreters have decades of academic prior art — SICP, Queinnec's "Lisp in Small Pieces", the R7RS spec — which meant agents had strong reference material to work from. Less time was spent explaining what to build and more time was spent deciding what to build.
How the Work Was Structured
My job isn't to write code anymore. It's to manage a team of agents and communicate what I want clearly. That means knowing what to ask for, knowing when to dig deeper into something I'm not sure about, and knowing when to let an agent run with a well-defined task.
A Lisp implementation has natural decomposition boundaries. The lexer doesn't need to know about the stdlib. The LLM module doesn't care about the evaluator internals. Most of the work was inherently sequential — you can't write stdlib functions before the evaluator exists — but the boundaries were clean enough that independent modules could be built in parallel when the time came. I'd typically run 2–3 agent sessions simultaneously in separate tabs: one doing code changes, one updating docs or the website, and a third running benchmarks or discovering test gaps. This works well until you push it too far — sometimes one agent breaks the build for the others, and the real bottleneck becomes me juggling too much context at once. The benefits flatten out on the curve when you're switching between more threads than you can hold in your head.
Where prior knowledge mattered most was in areas I was less familiar with. I had agents research Lisp implementation
strategies, survey how other interpreters handle tail-call optimization, and present me with options for things like
the environment representation. The important thing is knowing when to dig deeper — when an architectural choice has
implications you might not see until later. One failure of mine here: the original design used thread_local!
variables for evaluator state (call stack, module cache, eval depth). I didn't flag this as something to examine
more carefully early on. It worked, it was simple, and it avoided circular dependencies between crates. But it meant
you couldn't run multiple independent interpreter instances on the same thread — a problem for embedding Sema as a
library. I had to refactor to an explicit EvalContext struct later, touching ~13 files and ~80 call sites. The
refactor was straightforward, but it would have been cheaper to get right on day 1 if I'd thought harder about the
embedding use case upfront.
The Back-and-Forth
The work didn't split neatly into "I designed" and "agents implemented." It was a loop. I'd start a session with
explicit context — which crate, what Value looks like, naming conventions, what not to touch — and the agents would
return a patch or a plan. I'd accept it, redirect with tighter constraints, or ask a different question in a fresh
thread when the current one drifted.
For stdlib modules, the loop was short. A (string/split "a,b,c" ",") is a specification, not a conversation — here's
the signature, here's what it does, here are the edge cases. But anything touching architecture or tooling was
iterative by necessity.
The WASM playground was human constraints, agent execution. I knew up front the browser build needed conditional
compilation: no filesystem, no network, no live LLM calls. I knew the string interner needed a WASM-compatible
backend. Those constraints came from me. But when agents categorized all 61 functions that needed shimming —
splitting them into "trivial" (path ops are pure string manipulation), "medium" (in-memory virtual filesystem for
file/read and file/write), and "not feasible" (shell, exit, blocking stdin) — that categorization was useful
and saved me time. When they tried to bridge async fetch() into the synchronous evaluator and hit the expected wall,
I'd already decided on stub errors pointing to a future eval_async. The direction was mine; the mechanical work
of making 61 shims compile and pass was theirs.
Benchmarks were another case where knowing what to ask for mattered. I wanted to compare Sema against other Lisps
under controlled conditions — not a flattering number, but something methodologically sound. Same Docker container,
same 10M-row input, same measurement approach, best of 3. Agents built the harness, wrote implementations for 14
other dialects, and generated the comparison tables. But I had to keep tightening the methodology: ensuring all
implementations used integer×10 parsing for fairness, switching the Dockerfile to build from local source so I could
test uncommitted optimizations, correcting drift when an implementation was accidentally benchmarking the parser
instead of the hot loop. The let* flattening optimization — reducing environment allocations from 3 per row to
1 — came from an agent analyzing the profile data, and it was the right call. But knowing to profile, knowing what
"fair" means across dialects, knowing when a 7.4× gap behind SBCL is respectable for a tree-walking interpreter
versus embarrassing — that's domain knowledge the agents didn't have.
And sometimes the best ideas came from the agents. BTreeMap for deterministic map ordering wasn't my idea. An
agent suggested it with a rationale — sorted iteration order makes debugging reproducible, which matters when you're
comparing LLM responses across providers. I accepted it because it matched what I cared about. The same happened with
error message design: I used the brainstorming skill, agents researched how Rust and Zig handle diagnostics,
proposed three tiers of improvement, and I picked the middle one — structured hints without full source-pointing
diagnostics. Their research was genuinely useful; my contribution was knowing which level of polish was worth the
complexity.
This is how most of the decisions were made. Not a clean division of labor, but a loop of specifying, reviewing, correcting, and occasionally being surprised.
Design Decisions
Keywords as Map Accessors
Borrowed from Clojure: keywords in function position are map lookups.
(define person {:name "Helge" :age 30 :city "Bergen"})
(:name person) ;; => "Helge"
(:age person) ;; => 30
;; Works in higher-order contexts
(map :name [{:name "Alice"} {:name "Bob"}]) ;; => ("Alice" "Bob")
Deterministic Ordering
All maps use BTreeMap internally. This means iteration order is always sorted by key. It's slower than HashMap for
large maps, but it makes output deterministic — important when you're debugging LLM interactions and need reproducible
results.
Conversations as Immutable Values
A conversation is not a mutable session. It's a value, like a list or a map. You can bind it, pass it to functions, return it, store it in data structures. When you "extend" a conversation, you get a new value — the original is unchanged.
This matters for LLM workflows. You often want to try multiple approaches from the same conversation state, compare responses across providers, or build conversation trees. Immutable conversations make this natural:
(define base-conv
(conversation
(system "You are an expert programmer.")))
;; Ask the same question to different models
(define answers
(map (lambda (provider)
(llm/with-provider provider
(prompt
(conversation/append base-conv
(user "Explain monads in one sentence.")))))
'("openai" "anthropic" "google")))
Single-Threaded by Design
Sema is deliberately single-threaded. The string interner, module cache, LLM provider configuration — all thread-local
state. No Arc, no Mutex, no synchronization overhead. The evaluator state lives in an explicit EvalContext struct
(originally thread-local too, until the embedding use case forced a refactor). This simplified the implementation
enormously and is the right trade-off for a language whose primary bottleneck is network calls to LLM APIs.
The Performance Story
I benchmarked Sema against 14 other Lisp dialects on the 1 Billion Row Challenge — processing semicolon-delimited temperature readings to compute min/mean/max per weather station. For the sake of brevity, all benchmarks were run on the 10 million row variant (not the full 1 billion) inside the same Docker container.
Starting Point
The naive implementation ran in about 29 seconds. For a tree-walking interpreter this young, this was expected but not impressive.
Optimization Passes
Each optimization was a focused agent session:
String interning — Sema symbols and keywords were being compared as heap-allocated strings. Switching to the lasso
crate for interning meant symbol comparisons became integer comparisons. This was the single biggest win.
Hash map swap — Replacing the standard library HashMap with hashbrown for the hot-path environment lookups.
SIMD line scanning — Using memchr for finding newlines in the input file instead of byte-by-byte iteration.
COW map mutation — Copy-on-write semantics for map operations in tight loops, avoiding unnecessary cloning.
Mini-evaluator — A specialized fast path in the evaluator for simple arithmetic and comparison expressions that skips the full trampoline machinery.
let* flattening — Compiler pass that flattens nested let* forms to reduce environment chain depth.
Results
After optimization: 9.6 seconds natively on Apple Silicon. In Docker under x86-64 emulation (for fair comparison against other implementations), Sema lands at 7.4x behind SBCL:
| Dialect | Time (ms) | vs SBCL | Type |
|---|---|---|---|
| SBCL | 2,108 | 1.0x | Native compiler |
| Chez Scheme | 2,889 | 1.4x | Native compiler |
| Fennel/LuaJIT | 3,658 | 1.7x | JIT |
| Gambit | 5,665 | 2.7x | Compiled via C |
| Clojure | 5,717 | 2.7x | JVM |
| Chicken | 7,631 | 3.6x | Compiled via C |
| PicoLisp | 9,808 | 4.7x | Interpreter |
| newLISP | 12,481 | 5.9x | Interpreter |
| Emacs Lisp | 13,505 | 6.4x | Bytecode VM |
| Janet | 14,000 | 6.6x | Bytecode VM |
| ECL | 14,915 | 7.1x | Compiled via C |
| Guile | 15,198 | 7.2x | Bytecode VM |
| Sema | 15,564 | 7.4x | Tree-walking interpreter |
| Kawa | 17,135 | 8.1x | JVM |
| Gauche | 23,082 | 10.9x | Bytecode VM |
The most interesting comparison is Janet (6.6x) — architecturally the closest to Sema. Both are embeddable, single-threaded, reference-counted scripting languages. Janet's bytecode VM is faster, but the gap is narrower than you'd expect given the architectural advantage of bytecode dispatch over tree-walking. The full benchmark writeup is at sema-lang.com/docs/internals/lisp-comparison.
Building the Ecosystem
The language is only part of the project. Alongside the language work, agents built:
Documentation Site
A VitePress site at sema-lang.com covering:
- Getting started guide
- Language reference (data types, special forms, macros)
- Every stdlib module documented with examples
- LLM integration guide
- Embedding API for using Sema as a library
Browser Playground
A WASM-compiled version of Sema running at sema.run with:
- Code editor (plain textarea — no heavy dependencies)
- Preloaded example programs
- Instant evaluation (no server, runs entirely in the browser)
- The full stdlib available (minus LLM calls and file I/O, for obvious reasons)
Once the 61 shims were in place and the WASM target compiled, the playground itself was straightforward — a Vite app that loads the WASM module and wires up the editor.
Example Programs
Examples ranging from basics (fibonacci.sema, fizzbuzz.sema) to LLM-specific programs:
;; multi-provider-compare.sema
;; Ask the same question across providers and compare
(define question "Explain recursion to a 5-year-old.")
(define providers '("openai" "anthropic" "google"))
(for-each (lambda (provider)
(display (format "\n--- {} ---\n" provider))
(display
(llm/with-provider provider
(prompt (user question)))))
providers)
;; code-reviewer.sema
;; An agent that reviews code and suggests improvements
(deftool read-file (path)
"Read source code from a file"
{:path {:type "string" :description "File path to read"}}
(file/read path))
(defagent code-reviewer
:model "claude-sonnet-4-20250514"
:system "You review code for bugs, performance issues, and style.
Be specific and cite line numbers."
:tools [read-file])
(code-reviewer
(format "Review the file: {}" (car *args*)))
The Cleanup
When you run multiple agent sessions across different parts of a codebase, each one develops its own micro-style.
One session uses // ====== Section ====== separators, another doesn't. One writes doc comments on everything, another
only on public functions. One prefers Value::String(Rc::new(...)), another uses the Value::string(...) helper.
This is the same problem any multi-contributor project has — style drift. It just happens faster with agents because each session starts fresh without memory of what the others did.
The cleanup pass took about an hour:
- Removed 128 section separator comments that had accumulated across modules
- Deleted redundant doc comments (a function called
adddoesn't need/// Adds two numbers) - Standardized
Value::string()constructor usage across the entire codebase - Unified error handling patterns where different agents had chosen different approaches
This isn't about hiding anything. It's about not letting inconsistency accumulate into what people would eventually just dismiss as slop. Multi-agent codebases need the same kind of style normalization that any team project needs — you just need to do it more deliberately because the drift happens in hours instead of months.
What I Learned
Lisps are ideal AI agent projects. The implementation is well-documented in academic literature (SICP, Queinnec's "Lisp in Small Pieces", R7RS). Agents can reference these directly. The module boundaries are natural. Each stdlib function is independent. The evaluator is the only complex piece, and even that follows established patterns.
Keep it under a week. Long enough to build something real, short enough that architectural decisions made on day 1 don't need revisiting. The LLM integration design held up from initial sketch to final implementation.
Agents are a force multiplier, not a magic wand. Exceptional solo developers — a Tsoding, a Jonathan Blow — can absolutely build impressive things through raw skill and focus. AI doesn't make impossible things possible. What it does is take "that's a neat idea, maybe I'll build it someday" and turn it into a fuzzed, benchmarked, documented, tested product with a browser playground — in days instead of months. The barrier isn't lowered for toys. It's lowered for robust output.
Context management is the real skill. A single agent session has finite context. When it fills up or drifts, you need strategies: handoffs (Amp Code creates a new thread with relevant context carried forward), compaction (tools like Claude compress conversation history to reclaim context space), and planning documents that serve as shared memory across sessions. Being able to point a new agent at a previous conversation and say "continue this work" — or write a spec document that any agent can pick up cold — is more important than running ten agents at once.
Curation is the job. Agents suggest things constantly — some good, some not. No agent woke up and decided that conversations should be immutable values, or that keywords in function position should work as map accessors. The work is knowing which suggestions to accept, which to reject, and which questions to ask in the first place. You're not writing code — you're directing a project.
Why I Keep Building These
Sema is the third "big" project I've built this way. Token was a text editor in Rust. Lira is a systems language. Each one is deliberately ambitious — not because I need a Lisp interpreter or a text editor, but because they're stress tests. How far can one person push this workflow? Where does it break? What skills do you need to develop?
The answer so far: pretty far, and the skills are not what most people think.
It's not about prompting. It's about describing things clearly when agents — not humans — are the target consumer. It's about developing a repertoire of human-machine collaboration patterns. It's about spotting drift before it compounds into something unmanageable. It's about knowing when to fan out and when to go deep. These are new skills and we're all still learning them — in hobby projects and in professional settings.
The discomfort around "AI slop" and the anger at an LLM giving a bad answer to a vague prompt — these reactions are real, and usually rooted in something understandable: fear of losing craft, status, or agency to a tool that's moving too fast to feel negotiable. You see the same pattern in music right now. When tools like Suno ship, it's natural for musicians to feel threatened — not because they're anti-technology, but because identity and livelihood are tied to the process. The practical outcome tends to be the same: the tools don't disappear, they get integrated, and the differentiator shifts toward taste, direction, and the ability to shape raw output into something intentional.
I don't think the right response is e/acc cheerleading or doomer resignation. It's paying attention. The tooling is improving monthly. The workflows are maturing. The gap between "person who can direct AI agents effectively" and "person who can't" is going to matter more than the gap between "person who can write Rust" and "person who can't."
I'd rather be practicing now than scrambling later.
Sema is MIT licensed at github.com/HelgeSverre/sema. The documentation is at sema-lang.com and the playground is at sema.run.
