Building sql-splitter: Correctness Is the Product

sql-splitter shipped nine subcommands in 48 hours. Split, merge, analyze, validate, sample, shard, convert, diff, redact — all working, all tested. AI agents are excellent at building new commands when the architecture is clean.

That's the fast part. It makes for a good demo. But shipping fast doesn't mean shipping correctly, and correctly is the only thing that matters when someone points your tool at a production database dump.

This is about what happened after the fast part.

The Origin

The project started in October 2025 as a Go tool. A simple need: split a large mysqldump file into individual table files. The Go version got to 314 MB/s on the first night — fast enough to be useful, not interesting enough to keep working on.

Two and a half months later I came back to it with a different ambition. Not just MySQL — PostgreSQL and SQLite too, with MSSQL following later. Not just splitting — the full lifecycle of working with SQL dump files. And I wanted streaming I/O that could handle files larger than RAM without breaking a sweat.

The Go implementation was deleted. The Rust rewrite started December 20th.

The Fast Part

v1.0.0 through v1.6.0 shipped on December 20th. v1.7.0 through v1.10.0 shipped December 21st. Nine subcommands plus multi-dialect and compression support in two days:

Version	Command	What It Does
v1.0.0	`split`	Split dump files into per-table files
v1.0.0	`analyze`	Statistics: table count, INSERT count, bytes per table
v1.1.0	—	Multi-dialect: MySQL, PostgreSQL, SQLite
v1.3.0	—	Compressed files: gzip, bzip2, xz, zstd (auto-detected)
v1.4.0	`merge`	Combine split files back into a single dump
v1.5.0	`sample`	FK-aware sampling for dev/test databases
v1.6.0	`shard`	Extract tenant-specific data from multi-tenant dumps
v1.7.0	`convert`	Convert between MySQL, PostgreSQL, and SQLite dialects
v1.8.0	`validate`	Check dump integrity, FK consistency, data type validation
v1.9.0	`diff`	Compare two dumps: schema changes, data changes
v1.10.0	`redact`	Anonymize PII with 7 strategies (null, hash, mask, fake…)

Each command was a well-defined Build session — the kind of work AI agents handle cleanly. I'd write a spec with the CLI flags, the input/output contract, and the edge cases, then let the agent implement it. The streaming architecture made this possible: every command reads from the same parser and writes through the same buffered writer pool. Adding a new command meant adding a new consumer, not a new pipeline.

v1.11.0 (graph — ERD generation) and v1.12.0 (query — embedded DuckDB for SQL analytics on dump files) followed within the week. By December 27th, sql-splitter had 12 subcommands in src/cmd/, plus utility commands like completions and schema. The codebase was around 54,000 lines of Rust across 929 tests.

The architecture was clean. The tests were green. None of this meant it worked on real SQL files.

The Benchmark Story

I benchmarked sql-splitter against competitor tools in Docker for reproducibility. The suite started with 6 tools in late December and grew to 10 by late January as I discovered more competitors. The results were humbling.

100MB Test File (February 2026, 10 tools)

Tool	Mean	Throughput	Relative
mysqldbsplit (PHP)	84 ms	1232 MB/s	1.00 (fastest)
mysql-dump-splitter (Go)	95 ms	1091 MB/s	1.13x slower
mysqldump-splitter (Rust)	108 ms	960 MB/s	1.28x slower
mysqldumpsplit (Go)	150 ms	689 MB/s	1.79x slower
sql-splitter (Rust)	226 ms	457 MB/s	2.70x slower
mysql_splitdump (csplit)	264 ms	392 MB/s	3.14x slower
mysqldumpsplit (Node.js)	424 ms	244 MB/s	5.06x slower
mysql-dump-split (Ruby)	919 ms	112 MB/s	10.9x slower
mysqldumpsplitter (Bash/awk)	956 ms	108 MB/s	11.4x slower
extract-mysql-dump (Python)	1363 ms	76 MB/s	16.2x slower

A PHP tool is the fastest splitter in the benchmark. Not marginally — 2.7x faster than sql-splitter and faster than every compiled tool I tested. I've verified this across multiple runs over two months. It's real.

The reason: mysqldbsplit doesn't parse SQL. It scans for mysqldump's comment markers (-- Table structure for table) and splits on those boundaries. It's a string search, not a parser. That's extremely fast — and it works perfectly on mysqldump output.

5GB Stress Test (December 2025, 6 tools)

Tool	Time	Throughput	Relative
sql-splitter (Rust)	18.4s	283 MB/s	1.00 (fastest)
mysqldumpsplit (Go)	27.1s	191 MB/s	1.47x slower
mysqldumpsplit (Node.js)	28.7s	181 MB/s	1.56x slower
mysqldumpsplitter (Bash/awk)	55.5s	94 MB/s	3.02x slower
mysql_splitdump (csplit)	82.5s	63 MB/s	4.48x slower
mysql-dump-split (Ruby)	103s	50 MB/s	5.60x slower

At 5GB, sql-splitter is the fastest tool. The Go competitor that was faster at smaller sizes buffers everything in memory — at scale, that strategy falls apart. The Go tool also deadlocks on non-interleaved dumps (all INSERTs for table A, then all for table B); I had to fork and patch it to even include it in the benchmarks.

sql-splitter uses streaming I/O: 64KB read buffer, 256KB write buffers per table, periodic flushes. For streaming commands like split and analyze, peak memory stays around 10-15MB regardless of file size. Commands that need broader context — validate with FK checking, diff comparing two dumps — use more, but the core splitting pipeline scales linearly.

The Real Differentiator

But speed isn't the actual differentiator. This is:

Every competitor only works with standard mysqldump output. They scan for comment markers that mysqldump generates. Point them at a TablePlus export, a DBeaver export, a pg_dump file, or a sqlite3 .dump — they produce zero tables.

sql-splitter parses actual SQL statements. CREATE TABLE, INSERT INTO, COPY FROM stdin, GO batch separators. It works on any valid SQL file from any tool in any of the four supported dialects. That's slower than scanning for comments, but it's the only approach that generalizes.

Publishing these benchmarks — including the ones where I lose — was a deliberate choice. If you're evaluating tools and you only need mysqldump format on files under 1GB, mysqldbsplit is genuinely the better tool. I'd rather tell you that and earn trust than hide the numbers.

What Real-World Testing Found

Generated test data is clean. It has uniform encoding, consistent quoting, no surprises. Real SQL dumps have all the surprises.

sql-splitter's real-world test suite downloads 27 public SQL dumps — MySQL's Sakila, the PostgreSQL Pagila port, Chinook, Northwind, Employees, AdventureWorks — and runs split, validate, convert, query, graph, and redact against each one. The bugs this found were not the kind you catch with unit tests.

The 375-900x Regression

The query command loads SQL into an embedded DuckDB instance for analytics. On PostgreSQL's Pagila dataset, it was taking 15-27 seconds. The same file should process in about 0.03 seconds.

Root cause: an accidental O(n²) path triggered by pg_dump's formatting. pg_dump puts comments before COPY blocks:

--
-- Data for Name: actor; Type: TABLE DATA; Schema: public
--

COPY public.actor (actor_id, first_name, ...) FROM stdin;
1	PENELOPE	GUINESS	2006-02-15 09:34:33

Semicolons inside those comments were treated as statement terminators. COPY mode was detected too late. The parser ended up repeatedly re-processing a growing buffer. The file "worked" — it just got catastrophically slower as input grew, showing clear O(n²) behavior: 1.85 seconds at 20k lines, 8.71 seconds at 30k lines.

The fix touched multiple interacting pieces: comment tracking in the statement reader, proactive table-existence checks to skip COPY data for missing tables, explicit COPY mode management, and leading comment stripping. The regression test suite grew by 16 PostgreSQL COPY edge cases: comments before COPY, schema-prefixed table names, single-column tables, escape sequences, unicode data, empty values vs NULLs.

The BIGINTernal_note Bug

The query command converts MySQL types to DuckDB types via regex — INT → INTEGER, TINYINT → TINYINT, etc. The regex matched substrings in column names:

-- Input
CREATE TABLE tickets (
  id INT PRIMARY KEY,
  internal_note TEXT
);

-- Output (broken)
CREATE TABLE tickets (
  id INTEGER PRIMARY KEY,
  BIGINTernal_note TEXT  -- oops
);

This is the kind of bug real-world dumps surface immediately. Verification against production dumps — taskflow_production.sql (62 tables), boatflow_latest_2.sql (52 tables) — exposed it. Generated test fixtures don't have columns named after SQL types. Real databases do.

The fix was ensuring the type conversion regex only matched complete type tokens, not substrings inside identifiers. Eight new tests cover column names containing type substrings.

The Lost Data

PostgreSQL COPY format uses tab-separated values with no tabs for single-column tables:

COPY single_col_table FROM stdin;
value1
value2
\.

The looks_like_copy_data() function checked for the presence of tab characters. Single-column data has no tabs, so it was classified as non-COPY data. The data was silently dropped. Subsequent SQL statements that referenced those rows would fail with cryptic errors.

This was found by the postgres-periodic test case — a small dataset with lookup tables that have single-column foreign key references.

SQLite AUTOINCREMENT

SQLite dumps contain INTEGER PRIMARY KEY AUTOINCREMENT. DuckDB doesn't support AUTOINCREMENT. Every SQLite table with an auto-incrementing primary key failed to import.

Not a subtle bug — but not one that generated test data would catch, because the fixture generator uses DuckDB-compatible syntax.

The Pattern

Every one of these bugs has the same shape: generated test data doesn't contain it, real SQL dumps do.

The fix isn't just patching each bug. It's building a test suite that exercises the full surface area of real SQL. The 27 public dumps in the real-world test suite are there because:

Sakila/Pagila cover MySQL and PostgreSQL with foreign keys, views, triggers, stored procedures
Employees is large enough to exercise streaming (300K+ employees with dependent tables)
Northwind has every data type: dates, decimals, binary, long text
Chinook tests cross-dialect conversion (available in all four dialects)
AdventureWorks has schema-prefixed tables, unicode data, complex constraints

Each bug found through real-world testing becomes a regression test. The test suite grows monotonically. Today it has 929 tests, and the real-world subset runs against every PR in CI.

Product Decisions

The command list didn't grow randomly. Each addition had a specific use case:

sample --preserve-relations exists because every team that works with production dumps needs a smaller version for local development. Naive sampling breaks foreign keys — you sample 10% of orders but the referenced customers aren't in the sample. FK-aware sampling walks the dependency graph and includes parent rows automatically.

redact exists because GDPR. You need to anonymize production data before sharing it with developers or third parties. Seven strategies — null it, hash it, mask it (show first/last N characters), replace with fake data, shuffle within column, skip the table entirely — cover most anonymization requirements without a separate tool.

query exists because sometimes you need to answer a question about a dump without importing it into a running database. "How many orders are in this backup?" shouldn't require spinning up a MySQL instance. DuckDB is embedded and compiled into the binary — zero external dependencies.

convert exists because database migrations happen. Converting a MySQL dump to PostgreSQL syntax — backtick quoting to double-quote, AUTO_INCREMENT to SERIAL, TINYINT(1) to BOOLEAN, backslash escaping to dollar-quoting — is mechanical but error-prone. Getting it right for all edge cases across four dialects is exactly the kind of exhaustive work that agents handle well.

diff exists because deployments need verification. Compare the dump before and after a migration: which tables changed, which columns were added, which rows were modified. Schema diff plus data diff in a single command.

How It Was Built

The methodology was the same framework described in Building Token — Build/Improve/Sweep modes, feature specs before implementation, gap documents for partially-complete features. But the project history shaped the architecture in ways that a greenfield build wouldn't have.

The Rust rewrite inherited Go's lessons. The original Go implementation hit 314 MB/s on its first night — fast enough to validate the approach. Over two months of occasional use, it revealed which optimizations actually mattered: Peek/Discard on the read buffer for zero-copy scanning (19% improvement over naive reads), hand-rolled byte scanning for CREATE TABLE/INSERT INTO markers (4.9x faster than regex-only parsing), and specific buffer sizes (64KB read, 256KB write) tuned for CPU cache behavior. When the Rust rewrite started, these weren't things to discover — they were things to port. The Rust architecture used fill_buf/consume from BufRead (the equivalent of Go's Peek/Discard), memchr for SIMD-accelerated byte searching, and ahash for the writer pool lookups. The Go implementation was deleted, not abandoned — its optimizations lived on in different syntax.

The architecture had natural command boundaries. Each subcommand is a module in src/cmd/ that consumes the shared parser. Adding redact doesn't touch validate. Adding query doesn't touch convert. This meant Build sessions could run with minimal context — just the parser API, the command spec, and the test patterns from existing commands. More parallel, less coordination. Most new commands were a single Build session with a spec listing CLI flags, input/output contract, and edge cases — the agent implemented it against the parser API without needing to understand how other commands worked.

The query command broke this pattern. Embedding DuckDB meant building a second transformation pipeline: SQL-to-DuckDB type conversion, MySQL/PostgreSQL/SQLite syntax stripping, bulk loading via the Appender API. This pipeline had its own bugs independent of the parser — the BIGINTernal_note regex, the SQLite AUTOINCREMENT stripping, the COPY performance regression. The query command accounted for most of the v1.12.x bugfix releases because it was the most complex command and the last one to get real-world testing. Every other command consumed the parser's output directly; query transformed it into a different database's dialect, which doubled the surface area for bugs.

Real-world testing replaced gap documents. For Token, gap documents tracked what was partially working. For sql-splitter, the real-world test suite served the same purpose — but better, because it found gaps I didn't know existed. I never would have written "test column names that contain SQL type keywords" in a gap document. The Sakila database found it for me.

What I'd Do Again

Benchmark against competitors early. The competitive benchmark suite forced honesty about performance and identified sql-splitter's actual value proposition — not speed, but format compatibility and streaming architecture. If I hadn't benchmarked, I'd probably be optimizing the wrong things.

Download real SQL dumps. Generated test data is necessary but not sufficient. The 27-dump real-world test suite caught bugs that no amount of unit testing would find. The cost of maintaining it (download caching, CI bandwidth) is trivial compared to the bugs it catches.

Publish honest numbers. Showing that a PHP tool beats you builds more credibility than hiding it. The people evaluating your tool will benchmark it themselves anyway — you might as well show them you already know.

What I'd Change

Run real-world tests against every code path. There was a real-world verification script from day one — a bash script that downloaded 11 public SQL dumps and ran split and analyze against them. The split command worked fine on real data from the start. But when the query command shipped on December 26th with its own SQL-to-DuckDB transformation pipeline, I didn't run the Sakila dump through it until December 27th. Every real-world bug that week was in the new code path, not the original one.

Invest in profiling infrastructure earlier. The memory profiling script (scripts/profile-memory.sh) with size tiers from tiny (0.5MB) to giga (10GB) should have existed before the first optimization, not after. Profiling without reproducible fixtures is guessing.

Fewer versions, more testing between them. Shipping v1.0 through v1.10 in 48 hours meant each version had minimal testing before the next feature landed. December 27th saw six releases in a single day — v1.12.1 through v1.12.6 — three fixing real-world bugs in the query command, three adding MSSQL support and completing redact functionality. That density suggests the preceding releases moved too fast. Velocity is not velocity if you're shipping bugs.

sql-splitter is MIT licensed at github.com/HelgeSverre/sql-splitter. The documentation and benchmarks are at sql-splitter.dev.