Earlier today phpmlkit/onnxruntime appeared on my radar — a brand new PHP library providing FFI bindings to Microsoft's ONNX Runtime. The repo had been public for just a few hours. It's designed for running ML models — think image classification, embeddings, that sort of thing. Simple stuff: feed in a tensor, get a tensor back.
But I immediately had a dumb question: could you run an actual LLM with it?
Not through an API. Not by shelling out to Python. But token-by-token autoregressive text generation, entirely in PHP, using the ONNX Runtime C API through FFI.
The answer is yes. It's slow. But it works.
Why This Is Harder Than It Sounds
Running an LLM isn't like running a classifier. A classifier is one forward pass: input goes in, prediction comes out. An LLM generates text one token at a time in a loop:
- Feed the prompt tokens into the model
- Get back logits (probability scores for every token in the vocabulary)
- Pick the next token
- Feed it back in, along with cached state from the previous step
- Repeat until you hit the end token or max length
That "cached state" is the KV-cache — key and value tensors from every attention layer in the transformer. For Granite 3.0 2B, that's 40 layers × 2 tensors (key + value) = 80 tensors that need to be passed between each inference call.
So each step involves calling InferenceSession::run() with 82 input tensors and receiving 81 output tensors.
That's the kind of thing the PHP FFI bindings weren't specifically designed for, but nothing about it is technically
impossible.
The Pieces
There are three things the onnxruntime PHP package doesn't ship that you need for text generation:
1. A BPE Tokenizer
LLMs work with token IDs, not raw text. Granite uses ByteLevel BPE (same family as GPT-2), so the first piece is a
tokenizer that loads HuggingFace's tokenizer.json format directly:
$tokenizer = BpeTokenizer::fromFile('models/granite/tokenizer.json');
$ids = $tokenizer->encode("Hello, world!");
// [8279, 30, 5788, 19]
$text = $tokenizer->decode($ids);
// "Hello, world!"
The implementation handles the byte-to-unicode mapping, iterative BPE merges, special token splitting, and the full encode/decode round-trip. About 250 lines of PHP, no external dependencies.
2. A Sampling Strategy
The model outputs raw logits — a float array with 49,155 entries (one per token in the vocabulary). You need to convert that into a single token ID. Options:
- Greedy: just pick the highest value (
argmax) - Temperature: scale the logits before sampling to control randomness
- Top-K: only consider the K most likely tokens
- Top-P (nucleus): only consider tokens whose cumulative probability exceeds a threshold
All of these are pure PHP math on arrays. Nothing exotic.
// Greedy (deterministic)
$sampler = Sampler::greedy();
// Creative (temperature + top-p)
$sampler = Sampler::creative(temperature: 0.7, topP: 0.9);
3. The Autoregressive Loop with KV-Cache
This is the core of it. The generation loop needs to:
- Create empty KV-cache tensors for the first step (shape
[1, 8, 0, 64]— zero sequence length) - After each step, take the
present.*.key/present.*.valueoutput tensors and feed them back aspast_key_values.*.key/past_key_values.*.valueinputs - Only feed the newly generated token as
input_ids(not the entire sequence again — that's what the KV-cache is for) - Extend the
attention_maskby one position each step
// Simplified version of the generation loop
for ($step = 0; $step < $maxTokens; $step++) {
$inputs = buildInputs($currentToken, $kvCache, $pastSeqLen);
$outputs = $session->run($inputs);
$logits = $outputs['logits']->toArray();
$nextToken = $sampler->sample($logits[0]);
if ($nextToken === $eosTokenId) break;
// The KV-cache output becomes the next step's input
$kvCache = extractKvCache($outputs);
$pastSeqLen += 1;
}
The Model
I used IBM Granite 3.0 2B Instruct in ONNX format from the
onnx-community on HuggingFace. Specifically the
model_bnb4.onnx variant — a 4-bit bitsandbytes quantized version at 1.77GB as a single self-contained file. The
KV-cache tensors are float32 in this variant, which makes them straightforward to work with through the FFI bindings.
The model has a chat template with special tokens:
<|start_of_role|>system<|end_of_role|>You are helpful.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What is PHP?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
Running It
$generator = TextGenerator::fromFiles(
modelPath: 'models/granite/model_bnb4.onnx',
tokenizerPath: 'models/granite/tokenizer.json',
);
$prompt = $generator->formatChat([
['role' => 'system', 'content' => 'Be concise.'],
['role' => 'user', 'content' => 'What is PHP?'],
]);
$result = $generator->generate(
prompt: $prompt,
maxTokens: 100,
sampler: Sampler::greedy(),
onToken: fn($token) => print($token), // Stream to stdout
);
Performance
Let's be honest about this: it's not fast. On an M1 MacBook Pro:
- Model loading: ~1.5 seconds
- Token generation: ~3 tokens/second
- A 27-token response: 8.3 seconds
- A 149-token JSON extraction: 49.6 seconds
For comparison, Ollama runs Granite 3.3 2B at 50+ tokens/second on the same hardware using optimized GGUF format with Metal acceleration. We're about 15x slower.
But speed isn't really the point. The point is that it works at all — pure PHP, no Python runtime, no external services, fully offline. And there's significant room for improvement: adding IO binding support to the FFI bindings would eliminate the per-token memory allocation overhead, and the ONNX Runtime itself supports CoreML/Metal acceleration on macOS which could close much of the gap.
PDF Data Extraction
To make this practical, you can combine it with spatie/pdf-to-text for a data extraction pipeline:
// Extract text from PDF
$pdfText = (new Pdf())->setPdf('invoice.pdf')->text();
// Ask the LLM to extract structured data
$prompt = $generator->formatChat([
['role' => 'system', 'content' => 'Extract data as JSON. No explanation.'],
['role' => 'user', 'content' => "Extract: company, date, total, line_items.\n\n$pdfText"],
]);
$json = $generator->generate($prompt, maxTokens: 512);
$data = json_decode($json, true);
This gives you a fully offline document processing pipeline. No API keys, no network, no external dependencies beyond
pdftotext and the ONNX model file.
What I Learned
The ONNX Runtime C API is more capable than the PHP bindings expose. The PHP library wraps 76 of 212 available C API functions. For basic inference this is fine, but for LLM generation, you're missing IO binding — the ability to pre-allocate tensor memory and reuse it across calls. Without it, every token generation allocates and deallocates 80+ tensors. Adding IO binding support (about 11 C API functions) would significantly improve throughput.
ByteLevel BPE tokenization is simpler than it looks. The core algorithm is ~50 lines: build a byte mapping, split on
a regex, iteratively merge character pairs using a priority list. The HuggingFace tokenizer.json format is
self-contained and easy to parse.
KV-cache management is the hard part. Not algorithmically hard — just tedious. You need to create correctly-shaped
empty tensors for the first step, then shuttle 80 tensors between steps, matching input names
(past_key_values.{N}.key) to output names (present.{N}.key).
PHP's FFI is genuinely useful for this. The ability to create typed C buffers, pass them to native code, and get results back without serialization overhead makes this viable. You couldn't do this with a pure PHP ONNX parser.
Should You Use This?
Probably not in production. But there are legitimate use cases:
- Air-gapped environments where you can't call external APIs
- Edge processing where you want to keep data local
- Prototyping ML pipelines in PHP before moving to a more optimized stack
- Learning how LLM inference actually works under the hood
The code is on a branch at HelgeSverre/onnxruntime if you want to try it yourself.
What's Next
The obvious improvement is adding IO binding support to the PHP FFI wrapper — this would let you pre-allocate all tensor
memory once and reuse it across the generation loop, eliminating the allocation overhead per token. The 11 missing C API
functions (CreateIoBinding, BindInput, BindOutput, RunWithBinding, etc.) are well-documented and straightforward
to wrap.
Beyond that, someone could build a proper onnxruntime-genai binding for PHP, which would handle the generation loop at
the C level instead of in PHP. But honestly, the pure PHP approach has a certain charm to it.
