Sema After the First Week: VM, NaN-Boxing, and the Real Project

Part 1 covered shipping Sema v1.0.1 in five days — a tree-walking Lisp with LLM primitives, a documentation site, and a browser playground. That was February 15th.

It's now February 24th. Sema is at v1.11.0. There have been 350 more commits, 9 crates instead of 6, 25 stdlib modules instead of 19, and two execution backends instead of one. The project didn't end after the first week — it turned out the first week was just the foundation.

Why I Kept Going

The v1.0.1 release proved the core idea worked: LLM calls as s-expressions, conversations as immutable values, tool definitions as data. But it also exposed the limits of a tree-walking interpreter. The 1BRC benchmarks showed Sema at 7.4x behind SBCL — respectable for a tree-walker, but the architecture had a hard ceiling. Every expression evaluation walked the AST, every variable lookup chased an environment chain, every function call allocated.

The question after v1.0 wasn't "does this language make sense?" It was "how far can I push it?"

The Brainstorming Pipeline

After v1.0, I developed a workflow for figuring out what to build next — and it started by accident.

I was on my phone, scrolling Twitter, and saw this:

I have a customer with a ton of PDFs they want an LLM on top of, but we're hitting context window limits

Is there a high-level API that lets me upload a bunch of PDFs, and then provides a "tool" that I can give to an LLM?
— Steve Krouse (@stevekrouse) February 18, 2026

I opened the Claude app and asked it to implement this using Sema — just gave it the sema-lang.com URL and the problem. It produced a ~60 line script:

;; pdf-rag-agent.sema — the script Claude produced from a tweet and a URL

(define pdf-dir (if (> (length (sys/args)) 3) (nth (sys/args) 3) "./pdfs"))
(define store-name "pdf-knowledge")
(define embed-model {:model "text-embedding-3-small"})

;; Create or reload the vector store
(if (file/exists? "pdf-knowledge.json")
  (vector-store/open store-name "pdf-knowledge.json")
  (vector-store/create store-name))

;; Ingest every PDF: extract pages, embed, store
(for-each
  (lambda (filename)
    (define pages (pdf/extract-text-pages (string-append pdf-dir "/" filename)))
    (define page-num 0)
    (for-each
      (lambda (page-text)
        (set! page-num (+ page-num 1))
        (when (> (string-length page-text) 50)
          (vector-store/add store-name
            (format "~a::p~a" filename page-num)
            (llm/embed page-text embed-model)
            {:text page-text :file filename :page page-num})))
      pages))
  (filter (fn (f) (string/ends-with? f ".pdf")) (file/list pdf-dir)))

(vector-store/save store-name "pdf-knowledge.json")

;; The "tool" Steve is asking for
(deftool search-docs
  "Search the uploaded PDF documents. Returns the most relevant passages."
  {:query {:type :string :description "A natural language search query"}}
  (lambda (query)
    (string/join
      (map (fn (hit)
        (format "[~a p.~a | score: ~a]\n~a"
          (:file (:metadata hit)) (:page (:metadata hit))
          (:score hit) (:text (:metadata hit))))
        (vector-store/search store-name (llm/embed query embed-model) 5))
      "\n\n---\n\n")))

;; Wrap it in an agent
(defagent pdf-assistant
  {:system "You answer questions about uploaded PDFs. Always use search-docs first."
   :tools [search-docs] :model "claude-sonnet-4-20250514" :max-turns 5})

;; Interactive loop
(define (repl)
  (display "You: ")
  (define input (read-line))
  (when (and input (> (string-length input) 0))
    (println (format "\nAssistant: ~a\n" (agent/run pdf-assistant input)))
    (repl)))
(repl)

The original had two minor errors — list-ref (doesn't exist in Sema, should be nth) and string/length (should be string-length) — the kind of hallucination you get when an LLM infers API names from conventions rather than documentation. Two-line fix. The structure, the use of deftool, defagent, vector store operations, PDF extraction — all correct. That's the thing about Sema's design: the APIs are regular enough that an LLM can mostly guess them right from the docs site.

But the interesting part wasn't the script — it was what happened next. The conversation drifted from "implement this" to "what's missing from Sema that would make this better?" to "what would a web server look like?" to "suggest 10 more feature ideas" to "how would a package manager work?" One brainstorming session on my phone, over the course of an evening, produced the entire post-v1.0 roadmap.

The pattern that emerged:

Brainstorm with Claude.ai — long, freeform conversations. "Look at my language. What's missing? Where are the gaps? What would make someone choose this over LangChain?" These sessions produced massive design documents — 200-500 lines of code examples, architecture decisions, and rationale.
Store as GitHub issues — I was doing these sessions on the Claude app on my phone, and GitHub issues were the easiest way to file the output somewhere that agents could access later via gh CLI. Each brainstorming output became an issue — not a bug report, but a design document. Issue #6 was 20 ergonomic improvements with priority rankings. Issue #7 was a complete web server API design. Issue #8 was 10 feature ideas ranked by competitive impact. Issues #9-12 covered sema build, the package manager, metaprogramming, and prompt combinators.
Score and prioritize with Amp — I'd point agents at the issues and ask them to evaluate effort vs. impact, flag dependencies, and suggest implementation order. Issue #6's ergonomic improvements got ranked into four phases by effort/gain ratio: Phase 1 (string interpolation, threading macros — low effort, high gain), Phase 2 (get-in, short lambdas), Phase 3 (destructuring, pattern matching — higher effort, very high gain), Phase 4 (regex literals, named arguments — backlog).
Create implementation plans — agents turned the scored issues into concrete plan documents with numbered tasks, checkboxes, and dependencies. These plans became the shared memory across agent sessions.
Execute — agents worked through the plans, often in parallel.

This loop — brainstorm → issue → score → plan → implement — was how most post-v1.0 features were born. No agent decided that Sema needed a web server or a package manager. Those ideas came from directed conversations about gaps and competitive positioning. But the agents did the work of turning "this would be cool" into a prioritized backlog with estimated effort, and then executing against it.

The best example was issue #6 (ergonomic improvements). Claude.ai generated 23 items — from f-strings and threading macros to pattern matching and multimethods. Amp scored them, slotted them into phases, and agents implemented all four phases in three days. Every item from the original brainstorm that wasn't deferred shipped: f-strings, destructuring, pattern matching, short lambdas, threading macros, when-let/if-let, match, defmulti/defmethod, regex literals, REPL improvements. The design documents didn't even need much editing — they were already written as specifications, not conversations.

The Bytecode VM (v1.3.0 — Feb 17)

Two days after v1.0.1, Sema had a bytecode VM.

The pipeline: macro expansion → CoreExpr lowering → slot resolution → bytecode compilation → VM execution. The compiler translates the AST into a flat instruction sequence — LoadLocal, CallGlobal, JumpIfFalse, TailCall — and the VM executes it in a dispatch loop. No more tree walking for the hot path.

The VM was opt-in from the start: sema --vm script.sema. Both backends share the same stdlib, the same environment, the same LLM integration. You can switch between them with a flag, which made correctness testing straightforward — dual_eval_tests! runs every test through both backends and asserts identical results.

True tail-call optimization came naturally with the VM. Instead of the trampoline that the tree-walker uses (return a Trampoline::Eval and loop), the VM just overwrites the current call frame's locals and jumps back to the top of the dispatch loop. No allocation, no stack growth.

What Made It Hard

Closure semantics. The tree-walker captures the entire environment by reference — closures just hold an Rc<Env> and lookup works. The VM uses a flat stack with numbered local slots, so closures need to explicitly capture upvalues. Getting this right — especially for closures that capture variables from multiple nesting levels — took several rounds of bug fixes. Self-referential closures (a lambda that calls itself via a define in its enclosing scope) needed special injection at the compilation level.

Interop with the stdlib was the other challenge. Sema's stdlib is implemented as native Rust functions that take Vec<Value> arguments. The VM needs to bridge between its stack-based calling convention and these native functions. The solution was a NativeFn fallback path — when the VM encounters a call to a native function, it pops arguments from the stack, builds a Vec, calls the Rust function, and pushes the result.

NaN-Boxing (v1.4.0 — Feb 17)

The same day the VM shipped, I started NaN-boxing.

The Value type went from a 24-byte Rust enum (tag + payload + padding) to a single 8-byte u64. The trick: IEEE 754 doubles have a massive space of NaN representations — any double where the exponent bits are all 1 and the significand is non-zero is NaN. There are 2^52 such values. We only need one for "actual NaN." The rest become tag space for integers, booleans, nil, symbols, and pointers to heap-allocated objects.

The immediate benefit was cache locality. Values on the VM stack went from 24 bytes to 8 bytes each — 3x more values per cache line. For the VM's tight dispatch loop, this mattered. Benchmarks showed 8-12% improvement on the VM path.

The cost: NaN-boxing added overhead under x86-64 emulation (Docker on Apple Silicon). The bit manipulation that's cheap on native ARM became expensive under Rosetta translation. This is why the Docker benchmark numbers got worse even as native performance improved — a trade-off I'd make again, since the Docker benchmarks are for comparison purposes and native is what users actually run.

VM Optimizations (v1.7.0 — v1.9.0)

After NaN-boxing, the VM got progressively faster through a series of targeted optimizations:

Intrinsic recognition (v1.9.0) — The compiler recognizes calls to common builtins (+, -, *, /, <, >, =, not, etc.) and emits specialized inline opcodes instead of CallGlobal. This eliminates the global hash lookup, Rc downcast, argument Vec allocation, and function pointer dispatch for the most frequent operations. The *Int variants include NaN-boxed fast paths that operate directly on raw u64 bits without ever constructing a Value. TAK benchmark: 4,352ms → 1,250ms (−71%).

Specialized slot opcodes (v1.7.0) — LoadLocal0 through LoadLocal3 are single-byte instructions that skip operand decoding for the first four local variable slots — the ones used most often.

Fused CallGlobal (v1.7.0) — Combines LOAD_GLOBAL + CALL into a single instruction for non-tail calls to global functions. Avoids pushing and popping the function value on the stack.

Constant folding (v1.11.0) — A post-lowering optimization pass that folds constant arithmetic, comparisons, boolean operations, and dead code in begin blocks at compile time.

Stdlib intrinsics (v1.11.0) — car, cdr, cons, null?, pair?, length, append, get, contains? and more compiled as inline opcodes, bringing the total intrinsified operations to 23.

The Performance Story, Revisited

The Part 1 benchmarks showed the v1.0.1 tree-walker at 15.5s (Docker) / 9.6s (native). Here's where things stand now:

Mode	Docker x86-64	Native Apple Silicon	vs SBCL (Docker)
Tree-walker (v1.0.1)	15,564ms	9,600ms	7.4x
Tree-walker (v1.11.0)	46,291ms	~28,400ms	22.4x
Bytecode VM (v1.11.0)	23,117ms	~15,900ms	11.2x

The tree-walker got slower. NaN-boxing's bit manipulation overhead is amplified under x86-64 emulation, and the mini-evaluator (a specialized fast path for simple arithmetic) was removed to unblock VM development. Natively, the regression is smaller but still present.

The VM is the intended execution path going forward. At 11.2x behind SBCL in Docker (and ~15.9s natively), it's competitive with Janet (a bytecode VM written in C) and faster than Gauche and Kawa. For a language whose primary bottleneck is network calls to LLM APIs, this is more than sufficient.

The most honest thing I can say about the performance story is that it's messy. Optimizing for one metric (native throughput) sometimes hurts another (emulated throughput). NaN-boxing was the right architectural choice for the VM's future, but it made the tree-walker's Docker numbers look terrible. If I'd been optimizing for benchmark optics, I'd have kept the mini-evaluator and skipped NaN-boxing. Instead I optimized for the execution model I actually believe in.

The Web Server

Issue #7 was a complete web server design that came out of a brainstorming session about what Sema needed to be more than a CLI scripting tool. The design constraints were explicit from the start:

Requests are maps. Responses are maps. No special types.
Middleware is function wrapping. No middleware protocol.
Routes are data — vectors in a list.
No ORM, no template engine, no session management. JSON APIs only. It's 2026.

The target feel was "Ring (Clojure) meets Flask (Python) meets 'oh wait, I can just call llm/complete in my handler.'" The result:

(http/serve
  (http/router
    [[:get "/api/analyze" (fn (req)
       (let [text (:text (:query req))
             result (llm/extract
                      {:sentiment {:type :string}
                       :topics {:type :array :items {:type :string}}}
                      text)]
         (http/ok result)))]
     [:get "/health" (fn (_) (http/ok {:status "ok"}))]
     [:static "/assets" "./public"]])
  {:port 3000})

The implementation uses Axum under the hood with a channel-bridged architecture — the Axum server runs on a Tokio async runtime while Sema handlers execute synchronously on the main thread. The channel bridge was necessary because Sema is single-threaded with Rc (not Arc), so handlers can't run on Tokio worker threads directly.

SSE streaming and WebSocket support followed naturally from the channel design. http/stream returns an SSE response with a send callback. http/websocket upgrades the connection and gives you ws/send and ws/recv.

The Package Manager

The package manager story is worth telling in detail because it demonstrates the full prototype-first workflow.

Phase 1: The Design (Claude.ai → Issue #10)

The brainstorming session that produced issue #10 explored how other languages handle packages. The conclusion was to follow Go's pre-modules approach: packages are URLs, sema pkg add github.com/user/repo clones into ~/.sema/packages/, and (import "github.com/user/repo") resolves from there. No registry, no SAT solver, no version resolution. Git refs (@v1.0, @main, @abc123) are your version pins. Simple enough for a language with a tiny community, extensible later.

Phase 2: The Prototypes (AI-Generated Screens)

Before writing any backend code, I had agents create HTML prototypes for the package registry — what the eventual hosted service would look like. Five pages:

Homepage — hero search bar, featured packages grid, recently updated list
Search results — filterable package cards with download counts and star ratings
Package detail — two-column layout with README/code examples on the left, metadata sidebar (version, license, dependencies, install command) on the right
Login — tab-switching login/signup with GitHub OAuth
Account dashboard — API token management, published packages list

These were single-file HTML pages with a shared dark-theme CSS design system (Cormorant serif headings, JetBrains Mono for code, gold #c8a855 accent). They included Shiki syntax highlighting for Sema code blocks using a custom TextMate grammar. All AI-generated, all static — no backend, no JavaScript framework. Just HTML and CSS that showed exactly what the final thing should look like.

This prototype-first approach meant that when agents started on the real implementation, the design decisions were already made. The registry backend was scaffolded as an Axum app with SQLite storage and Askama templates — chosen specifically so the prototypes could translate almost directly into server-rendered pages with no frontend build step.

Phase 3: The Backend

The implementation plan had 10 tasks: scaffold → database migrations → auth → API tokens → publish endpoint → read endpoints → ownership → web UI → GitHub OAuth → Docker. Agents worked through them sequentially, with me reviewing after each task.

The registry went live on Fly.io at pkg.sema-lang.com — a single Axum binary with SQLite on a persistent volume, auto-scaling to zero when idle. $5/month. The CLI commands (sema pkg add, sema pkg publish, sema pkg search) talk to it via a simple REST API.

Phase 4: The Lock File

sema.lock was a later addition for reproducible builds — recording exact commit SHAs for Git packages and SHA256 checksums for registry packages. sema pkg install --locked fails if the lock is out of sync with sema.toml, which is the behavior you want in CI.

What Else Shipped

Beyond the VM, performance work, web server, and package manager, v1.1.0 through v1.11.0 added:

Custom LLM providers (v1.1.0) — llm/define-provider lets you define providers entirely in Sema with a :complete lambda. llm/configure accepts any OpenAI-compatible endpoint via :base-url, so self-hosted models, proxy endpoints, and new providers work without waiting for native support.

Sandboxing (v1.2.0, v1.8.0) — --sandbox for capability-based permission denial, --allowed-paths for filesystem restriction with canonicalized path checks. WASM VFS quotas (1MB/file, 16MB total, 256 files max) prevent runaway memory in the browser playground.

Bytecode serialization (v1.7.0) — .semac binary format with a 24-byte header, deduplicated string table, and function table. sema compile produces bytecode files, sema disasm inspects them. The VM auto-detects .semac files on load.

Standalone executables (v1.11.0) — sema build traces imports recursively, bundles source into a VFS archive appended to the binary. The result is a single executable that runs without the Sema runtime installed. Cross-compilation via --target linux|macos|windows.

Code formatter (v1.11.0) — sema fmt with Lisp-aware indentation, comment preservation, and configurable style via sema.toml. A whole new crate (sema-fmt) that needed trivia token support in the lexer — comments and whitespace that the parser normally discards had to be preserved for formatting. Exposed in the WASM playground as a "Fmt" button.

Language features — Destructuring bind in let/define/lambda. Pattern matching (match). Multimethods (defmulti/defmethod). F-strings (f"Hello ${name}"). Threading macros (->, ->>, some->). Short lambdas (#(+ %1 %2)). Regex literals (#"pattern"). Auto-gensym (foo#) for hygienic macros. while loops.

Editor support — Tree-sitter grammar with an external scanner for nested block comments, Zed extension with Go to Symbol, VS Code/Vim/Emacs/Helix syntax files. Shell completions via sema completions --install.

Distribution — Homebrew tap (brew install helgesverre/tap/sema-lang), cargo-dist for multi-platform binaries, npm packages (@sema-lang/sema, @sema-lang/sema-wasm) for JavaScript embedding with pluggable VFS backends (Memory, LocalStorage, SessionStorage, IndexedDB).

Error messages — Colorized output with ANSI colors. Source line snippets with --> location markers and ^ caret pointers. Type errors show the offending value. Arity errors show the call form. Unbound variable errors suggest similar names using Levenshtein distance plus "veteran hints" — typing setq or funcall tells you the Sema equivalent.

How the Agent Workflow Evolved

The workflow from Part 1 — 2-3 agent sessions in parallel tabs — continued, but the nature of the work changed.

The Brainstorm-to-Backlog Pipeline

The biggest workflow evolution was using Claude.ai as a brainstorming partner and Amp as an execution engine. Claude.ai sessions were conversational — "look at my language, what's missing, what would you add?" — and produced the raw material. Then I'd create GitHub issues from the outputs, point Amp agents at the issues, and have them score items by effort/gain, identify dependencies, and produce implementation plans with numbered tasks.

This split worked because the two tools have different strengths. Claude.ai is better at freeform exploration — "what if we added a pipe operator? How would that interact with threading macros?" — while Amp is better at structured execution against a plan. Using both in sequence meant ideas got vetted before being implemented, and implementation had clear success criteria.

Agents Got Better at Architecture

The v1.0 work was mostly "implement this well-defined function." Post-v1.0, the tasks got more architectural: "design a bytecode instruction set," "add NaN-boxing to the Value type without breaking the stdlib," "implement upvalue capture for closures." These required more context, more iteration, and more of my attention on the design side.

The bytecode VM was the best example. I couldn't just say "build a VM" — I had to specify the instruction set design philosophy (register-free stack machine, sized operands, explicit tail call instructions), the compilation pipeline stages, and how native function interop should work. The agent did the implementation, but the architecture was a conversation.

The Dual-Eval Pattern

Once the VM existed, every new feature had to work in both backends. The dual_eval_tests! macro was the agents' idea — one test definition that runs through both the tree-walker and VM, asserting identical results. This caught dozens of subtle divergences: the VM returning nil where the tree-walker threw an error, match guard fallthrough behaving differently, prompt building values through different code paths.

This pattern — testing against two independent implementations of the same semantics — is something I'd do again for any project with multiple execution backends.

Security Hardening Was Agent-Driven

The bytecode serialization work (v1.7.0) is where agent-driven security review proved its value. I asked agents to review the .semac deserialization code for safety, and they found real issues: unchecked allocation sizes (DoS vector), missing section boundary enforcement, an unsafe Spur transmute that could produce dangling pointers. The fixes were methodical — recursion depth limits, allocation caps, section payload consumption verification, operand bounds checking. I wouldn't have been as thorough reviewing this manually.

Prototype-First for UI Work

The package registry prototypes taught me that static HTML mockups are an excellent shared artifact between me and agents. I describe the vibe ("dark theme, serif headings, gold accent, minimal"), agents produce complete pages with real content and styling, and those pages become the ground truth for the real implementation. No Figma, no design tokens, no component library — just HTML files that look exactly like the final product should look.

Fighting Documentation Drift

One meta-lesson: hardcoded counts in documentation go stale fast when agents are shipping features daily. The docs originally said "460+ builtins across 22 modules" — which was accurate for about three hours before the next feature merged. The fix was deliberate: a single commit replaced every specific count across 18 documentation files with durable phrasing. "460+ builtins" became "hundreds of built-in functions." "22 modules" became "a comprehensive standard library." Specific numbers were moved to auto-generated reference pages where they could be verified programmatically.

This is a small thing, but it matters. When you're shipping 10+ features per day with agents, anything that requires manual updating will be wrong within hours. Design your documentation for that cadence.

What I'd Do Differently

Start with the VM. The tree-walker was the right choice for the first five days — it's simpler to implement, easier to debug, and you get a working language faster. But if I'd known the project would continue, I'd have designed the value representation for VM execution from the start. NaN-boxing after the fact meant touching every crate, every pattern match on Value, every constructor call. It was a clean migration (the agents handled the mechanical parts well), but it would have been cheaper as a day-1 decision.

Design the module system earlier. The package manager and module imports were bolted on late. If I'd designed (import "pkg-name") resolution and sema.toml manifests earlier, several downstream features (build system, VFS interception) would have been simpler.

Keep the benchmark numbers honest. Part 1 presented the v1.0.1 benchmarks as the performance story. When NaN-boxing made the Docker numbers worse, there was a temptation to just not talk about it. The better approach: show both numbers, explain the trade-off, and let readers decide if native performance or emulated benchmark parity matters more to them.

Close the GitHub issues. Several issues (#7, #9, #10) are substantially implemented but still show as open because the original brainstorm documents contained more ideas than were implemented. The open issues give the wrong impression that these features don't exist. Better to close with a comment listing what shipped and what's deferred.

Where It's Going

Sema is a project I use for two things: as a practical tool for scripting LLM workflows, and as a testbed for human-agent collaboration patterns. Both continue.

On the language side, the package registry is live but needs more polish — GitHub OAuth for publishing, download counts, a proper search index. The VM needs more optimization passes. I'm exploring async evaluation for non-blocking LLM calls. The web server support opens up Sema as a backend scripting language, not just a CLI tool. And the brainstorming backlog in issues #8 and #11 still has ambitious items: defapi for auto-generating tools from OpenAPI specs, defpipe for typed LLM pipelines, and LLM-assisted macros that use models during code generation.

On the workflow side, every version of Sema teaches me something about working with agents at scale. The Part 1 lessons still hold — context management matters more than parallelism, curation is the job, architectural decisions need human attention. But the post-v1.0 work added new lessons: the brainstorm-to-backlog pipeline as a repeatable process, the value of static prototypes as shared artifacts, dual-eval testing for multi-backend correctness, agent-driven security review, and the importance of designing documentation to survive high-velocity development.

350 commits in 10 days. The tools keep getting better. The projects keep getting more ambitious. The skills keep shifting.

Sema is MIT licensed at github.com/HelgeSverre/sema. The documentation is at sema-lang.com and the playground is at sema.run.