HELGE SVERREAll-stack Developer
Bergen, Norwayv13.0
est. 2012  |  300+ repos  |  4000+ contributions
Tools  |   Theme:
Running LLMs in PHP with ONNX Runtime: Offline AI Text Generation Without Python
March 30, 2026

Earlier today phpmlkit/onnxruntime appeared on my radar — a brand new PHP library providing FFI bindings to Microsoft's ONNX Runtime. The repo had been public for just a few hours. It's designed for running ML models — think image classification, embeddings, that sort of thing. Simple stuff: feed in a tensor, get a tensor back.

But I immediately had a dumb question: could you run an actual LLM with it?

Not through an API. Not by shelling out to Python. But token-by-token autoregressive text generation, entirely in PHP, using the ONNX Runtime C API through FFI.

The answer is yes. It's slow. But it works.

Why This Is Harder Than It Sounds

Running an LLM isn't like running a classifier. A classifier is one forward pass: input goes in, prediction comes out. An LLM generates text one token at a time in a loop:

  1. Feed the prompt tokens into the model
  2. Get back logits (probability scores for every token in the vocabulary)
  3. Pick the next token
  4. Feed it back in, along with cached state from the previous step
  5. Repeat until you hit the end token or max length

That "cached state" is the KV-cache — key and value tensors from every attention layer in the transformer. For Granite 3.0 2B, that's 40 layers × 2 tensors (key + value) = 80 tensors that need to be passed between each inference call.

So each step involves calling InferenceSession::run() with 82 input tensors and receiving 81 output tensors. That's the kind of thing the PHP FFI bindings weren't specifically designed for, but nothing about it is technically impossible.

The Pieces

There are three things the onnxruntime PHP package doesn't ship that you need for text generation:

1. A BPE Tokenizer

LLMs work with token IDs, not raw text. Granite uses ByteLevel BPE (same family as GPT-2), so the first piece is a tokenizer that loads HuggingFace's tokenizer.json format directly:

$tokenizer = BpeTokenizer::fromFile('models/granite/tokenizer.json');

$ids = $tokenizer->encode("Hello, world!");
// [8279, 30, 5788, 19]

$text = $tokenizer->decode($ids);
// "Hello, world!"

The implementation handles the byte-to-unicode mapping, iterative BPE merges, special token splitting, and the full encode/decode round-trip. About 250 lines of PHP, no external dependencies.

2. A Sampling Strategy

The model outputs raw logits — a float array with 49,155 entries (one per token in the vocabulary). You need to convert that into a single token ID. Options:

  • Greedy: just pick the highest value (argmax)
  • Temperature: scale the logits before sampling to control randomness
  • Top-K: only consider the K most likely tokens
  • Top-P (nucleus): only consider tokens whose cumulative probability exceeds a threshold

All of these are pure PHP math on arrays. Nothing exotic.

// Greedy (deterministic)
$sampler = Sampler::greedy();

// Creative (temperature + top-p)
$sampler = Sampler::creative(temperature: 0.7, topP: 0.9);

3. The Autoregressive Loop with KV-Cache

This is the core of it. The generation loop needs to:

  • Create empty KV-cache tensors for the first step (shape [1, 8, 0, 64] — zero sequence length)
  • After each step, take the present.*.key / present.*.value output tensors and feed them back as past_key_values.*.key / past_key_values.*.value inputs
  • Only feed the newly generated token as input_ids (not the entire sequence again — that's what the KV-cache is for)
  • Extend the attention_mask by one position each step
// Simplified version of the generation loop
for ($step = 0; $step < $maxTokens; $step++) {
    $inputs = buildInputs($currentToken, $kvCache, $pastSeqLen);
    $outputs = $session->run($inputs);

    $logits = $outputs['logits']->toArray();
    $nextToken = $sampler->sample($logits[0]);

    if ($nextToken === $eosTokenId) break;

    // The KV-cache output becomes the next step's input
    $kvCache = extractKvCache($outputs);
    $pastSeqLen += 1;
}

The Model

I used IBM Granite 3.0 2B Instruct in ONNX format from the onnx-community on HuggingFace. Specifically the model_bnb4.onnx variant — a 4-bit bitsandbytes quantized version at 1.77GB as a single self-contained file. The KV-cache tensors are float32 in this variant, which makes them straightforward to work with through the FFI bindings.

The model has a chat template with special tokens:

<|start_of_role|>system<|end_of_role|>You are helpful.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What is PHP?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>

Running It

$generator = TextGenerator::fromFiles(
    modelPath: 'models/granite/model_bnb4.onnx',
    tokenizerPath: 'models/granite/tokenizer.json',
);

$prompt = $generator->formatChat([
    ['role' => 'system', 'content' => 'Be concise.'],
    ['role' => 'user', 'content' => 'What is PHP?'],
]);

$result = $generator->generate(
    prompt: $prompt,
    maxTokens: 100,
    sampler: Sampler::greedy(),
    onToken: fn($token) => print($token), // Stream to stdout
);

Performance

Let's be honest about this: it's not fast. On an M1 MacBook Pro:

  • Model loading: ~1.5 seconds
  • Token generation: ~3 tokens/second
  • A 27-token response: 8.3 seconds
  • A 149-token JSON extraction: 49.6 seconds

For comparison, Ollama runs Granite 3.3 2B at 50+ tokens/second on the same hardware using optimized GGUF format with Metal acceleration. We're about 15x slower.

But speed isn't really the point. The point is that it works at all — pure PHP, no Python runtime, no external services, fully offline. And there's significant room for improvement: adding IO binding support to the FFI bindings would eliminate the per-token memory allocation overhead, and the ONNX Runtime itself supports CoreML/Metal acceleration on macOS which could close much of the gap.

PDF Data Extraction

To make this practical, you can combine it with spatie/pdf-to-text for a data extraction pipeline:

// Extract text from PDF
$pdfText = (new Pdf())->setPdf('invoice.pdf')->text();

// Ask the LLM to extract structured data
$prompt = $generator->formatChat([
    ['role' => 'system', 'content' => 'Extract data as JSON. No explanation.'],
    ['role' => 'user', 'content' => "Extract: company, date, total, line_items.\n\n$pdfText"],
]);

$json = $generator->generate($prompt, maxTokens: 512);
$data = json_decode($json, true);

This gives you a fully offline document processing pipeline. No API keys, no network, no external dependencies beyond pdftotext and the ONNX model file.

What I Learned

The ONNX Runtime C API is more capable than the PHP bindings expose. The PHP library wraps 76 of 212 available C API functions. For basic inference this is fine, but for LLM generation, you're missing IO binding — the ability to pre-allocate tensor memory and reuse it across calls. Without it, every token generation allocates and deallocates 80+ tensors. Adding IO binding support (about 11 C API functions) would significantly improve throughput.

ByteLevel BPE tokenization is simpler than it looks. The core algorithm is ~50 lines: build a byte mapping, split on a regex, iteratively merge character pairs using a priority list. The HuggingFace tokenizer.json format is self-contained and easy to parse.

KV-cache management is the hard part. Not algorithmically hard — just tedious. You need to create correctly-shaped empty tensors for the first step, then shuttle 80 tensors between steps, matching input names (past_key_values.{N}.key) to output names (present.{N}.key).

PHP's FFI is genuinely useful for this. The ability to create typed C buffers, pass them to native code, and get results back without serialization overhead makes this viable. You couldn't do this with a pure PHP ONNX parser.

Should You Use This?

Probably not in production. But there are legitimate use cases:

  • Air-gapped environments where you can't call external APIs
  • Edge processing where you want to keep data local
  • Prototyping ML pipelines in PHP before moving to a more optimized stack
  • Learning how LLM inference actually works under the hood

The code is on a branch at HelgeSverre/onnxruntime if you want to try it yourself.

What's Next

The obvious improvement is adding IO binding support to the PHP FFI wrapper — this would let you pre-allocate all tensor memory once and reuse it across the generation loop, eliminating the allocation overhead per token. The 11 missing C API functions (CreateIoBinding, BindInput, BindOutput, RunWithBinding, etc.) are well-documented and straightforward to wrap.

Beyond that, someone could build a proper onnxruntime-genai binding for PHP, which would handle the generation loop at the C level instead of in PHP. But honestly, the pure PHP approach has a certain charm to it.




<!-- generated with nested tables and zero regrets -->