Norwegian å characters can render broken in PDFs generated by DomPDF (and similar non-shaping engines) from rich-text input. A tiny ring drifts above a bare "a", or a appears where the å should be. The fix is two passes of Unicode normalization plus one regex. The trap is that the obvious version of that regex silently mangles Arabic, Hebrew, Hindi, and Thai.
What it looks like
The visible symptom: Norwegian å in the rich-text editor renders correctly in the editor, correctly in the HTML
preview, and breaks the moment the document is passed through DomPDF. The PDF shows
either a bare "a" with a tiny ring floating above the next character, or the missing-glyph fallback where the å
should be.
Pulling the rows out of the database showed two paragraphs with what looked like the same word:
row 1: må hex: 6d 61 cc 8a
row 2: må hex: 6d c3 a5
Identical visible text. Different bytes. The first paragraph stored m + a + U+030A (COMBINING RING ABOVE). The
second stored m + U+00E5 (the single precomposed å). The editor displayed them identically. DomPDF didn't.
Not a font issue
Swapping fonts is the obvious first instinct, and it doesn't help. The base a renders fine. The combining ring renders
fine. They just render in two different places. The font already has both glyphs. What's missing is the layout step
that stacks them.
Two ways to spell må
Unicode has two ways to encode å. The precomposed form is a single codepoint, U+00E5. The decomposed form is two
codepoints: U+0061 (the letter a) followed by U+030A (a non-spacing combining mark).
UAX #15 defines the canonical composition that maps one to the other; the
UnicodeData.txt entry for U+00E5 lists its
decomposition as exactly U+0061 U+030A.
NFC (precomposed)
NFD (decomposed)
The two forms have names: NFC (precomposed, "canonical composition") and NFD (decomposed, "canonical decomposition").
Either is a legal way to spell å. If everything in the pipeline supports both, it doesn't matter which one you use.
Here's an interactive view. Type or paste anything, toggle between NFC and NFD, and tick "Simulate DomPDF" to see what happens when a non-shaping renderer tries to place a combining mark:
rendered (browser shaping)
Blåbærsyltetøy på smørbrød
26 codepoints
Why the editor shows it fine
Browsers render combining marks correctly because they ship a text-shaping engine. Chromium and Edge use
HarfBuzz (via Blink), Safari uses it through CoreText on macOS, Firefox uses HarfBuzz
directly. Text shaping is the step between "here are some Unicode codepoints" and "here are pixels": it consults the
font's GSUB and GPOS tables, figures out which glyph IDs to draw, and positions combining marks over the base
character they belong to. That last step is the one that matters here.
That's why pasting må (NFD) into a contenteditable looks fine and saves to the database looking fine. The browser
composed it at render time. The bytes in the DOM never changed.
Why DomPDF doesn't
DomPDF doesn't have a shaping engine. It maps codepoints to glyph IDs one-to-one, draws each glyph at the next
horizontal position, and moves on. No GPOS, no mark stacking, no ligature substitution.
The DomPDF maintainers have acknowledged this in long-running issues (dompdf#553, discussion #3049). The recommended workaround is always the same: ensure your input uses precomposed characters, and that the font has a glyph at that exact codepoint. In other words: don't send NFD into DomPDF.
DomPDF isn't unique in this. The picture for PHP/HTML-to-PDF stacks, per each engine's documentation:
| Engine | Combining marks |
|---|---|
| DomPDF | No shaping — relies on precomposed glyphs |
| mPDF (pre-6) | Limited; mPDF docs recommend precomposed characters |
| mPDF (6.0+) | Supports OpenType layout (GSUB/GPOS) |
| TCPDF | Hardcoded diacritic + Arabic shaping tables; no general OpenType layout |
| wkhtmltopdf | Shapes correctly via QtWebKit |
| Browsershot (headless Chromium) | Shapes correctly via Chromium's pipeline |
These rows come from each engine's documentation, not from independent rendering tests.
Where the NFD comes from
There's no public Microsoft Word spec stating that copy emits NFD. The closest authoritative source on the macOS side is Apple's Technical Note TN1150, the canonical "macOS uses NFD" reference. But it specifies HFS+ filenames, not the clipboard. APFS, the modern macOS filesystem, accepts and preserves whatever normalization form you give it; it's normalization-insensitive at the lookup level but doesn't force NFD on storage.
Observed behavior: pasting Norwegian text from Word for Mac into a Tiptap editor produces NFD byte sequences in the underlying field. Same with Google Docs in Safari and Chrome on macOS. The same Word document opened on Windows and copy-pasted into the same editor produces NFC. Consistent with the OS-layer theory: macOS-side text plumbing tends to emit NFD, Windows-side tends to emit NFC. Behavior, not spec.
The definitive source on the storage side is TN1150's decomposition table. Determining what any specific clipboard pipeline does requires testing it directly.
The naive fix
PHP's intl extension ships a Normalizer class that implements UAX #15. Composing the rendered HTML to NFC before
handing it to DomPDF is a one-liner:
<?php
// app/Util/UnicodeHelper.php (first version)
namespace App\Util;
use Normalizer;
class UnicodeHelper
{
public static function normalizeForPdf(string $text): string
{
$normalized = Normalizer::normalize($text, Normalizer::FORM_C);
return $normalized === false ? $text : $normalized;
}
}
Call it between rendering the view and handing the HTML to DomPDF:
$html = view('pdf.document', ['document' => $document])->render();
$html = UnicodeHelper::normalizeForPdf($html);
$pdf = Pdf::loadHTML($html);
return $pdf->stream('document.pdf');
NFC composition is a no-op on already-precomposed text, so it's safe to call on every render.
A failing test that asserted the rendered HTML contained må (NFC) and not the bare U+030A turned green. The
paragraph from row 1 above now rendered with the å stacked correctly.
But the å was still broken
A second paste pattern triggers the same visible failure even after NFC composition. The byte sequence in the database looks like:
p (U+0070) + å (U+00E5) + ̊ (U+030A)
A precomposed å. Followed by a free-standing combining ring. NFC composition couldn't help: U+00E5 is already the
precomposed form, and there's no Unicode character for "letter a with two rings above". Normalizer::FORM_C saw the
precomposed base, looked for something to merge the trailing combining mark into, found nothing it could compose, and
left both codepoints in place. DomPDF drew the å, then tried to draw U+030A as a standalone glyph, didn't find one
in the font's regular glyph table, and fell back to .
The first fix solved the case where two codepoints could be composed into one. It didn't solve the case where the second codepoint had nowhere to go.
How did the orphan get there
There's no reliable reproduction. The plausible mechanism: a paste deposits an NFD å, the browser shapes it, then a
later edit (IME quirk, paste-over-paste, autocorrect, input rule) leaves an extra U+030A next to the already-shaped
å. Re-saving preserves both codepoints. Once the bytes land in the database, the rendering bug is downstream of
whatever produced them.
The smarter fix
The first attempt at handling the orphan was to strip every non-spacing mark left after NFC composition:
$normalized = Normalizer::normalize($text, Normalizer::FORM_C);
return preg_replace('/\p{Mn}/u', '', $normalized);
\p{Mn} is the Unicode category "Mark, Nonspacing": every floating combining diacritic in every script. After NFC
composition, anything still in that category is by definition a mark that couldn't compose into a precomposed character.
DomPDF can't draw it. Stripping it produces readable text.
It also destroys vocalized Arabic.
Arabic vocalization marks (harakat: fatha, damma, kasra, shadda, sukun) are \p{Mn}. Hebrew niqqud are \p{Mn}. The
Devanagari virama, which joins consonants into conjunct clusters, is \p{Mn}. Thai tone marks, the ones that
distinguish น้ำ ("water") from นำ ("to lead"), are \p{Mn}. A blanket strip silently turns each of these into
broken text.
The bug in question (an orphan combining mark following a Latin base character) has a much narrower signature. Scoping the strip with a Latin-script lookbehind keeps the fix and drops the collateral damage:
<?php
// app/Util/UnicodeHelper.php (final)
namespace App\Util;
use Normalizer;
class UnicodeHelper
{
public static function normalizeForPdf(string $text): string
{
$normalized = Normalizer::normalize($text, Normalizer::FORM_C);
if ($normalized === false) {
$normalized = $text;
}
return preg_replace('/(?<=\p{Latin})\p{Mn}+/u', '', $normalized) ?? $normalized;
}
}
(?<=\p{Latin})\p{Mn}+ matches one-or-more nonspacing marks that immediately follow a Latin letter. Arabic harakat sit
after Arabic letters: no match. Hebrew niqqud sit after Hebrew letters: no match. Devanagari and Thai marks sit after
their own consonants: no match. Emoji variation selectors (U+FE0F) sit after Symbol codepoints: no match. The only
thing that matches is the actual pattern the bug uses: a stray combining mark next to a Latin letter, after NFC has
already done everything it could.
What the lookbehind actually saves
Both versions of the regex run live below, on the same 11 strings. Look at the Naive row when it diverges from Scoped:
that's the article's whole concern made concrete. Vocalized Arabic loses every harakat. Thai น้ำ ("water") drops to
นำ ("to lead"). Hebrew loses its niqqud. Devanagari loses the virama that holds conjunct clusters together. The Latin
rows show the reassuring half of the story: for the scripts the bug actually affects, both regexes produce the same
fix.
One known weakness: if a paste artifact produces a stray combining mark following a non-Latin base (e.g., an Arabic
letter followed by U+030A), the lookbehind won't match and the orphan will survive into the PDF. No such case has
surfaced in practice. If one does, the regex grows another alternative, and the unit tests will surface anything else
the change breaks.
What it means
If your stack is DomPDF, or mPDF before 6, or TCPDF, and you have international users, you have this bug. The bytes that
work in your editor don't work in your renderer because the renderer doesn't have a shaping engine. One line of
Normalizer::normalize($html, Normalizer::FORM_C) covers most of it; the Latin-scoped strip above handles the rest.
Inserted right before the HTML reaches the PDF library, it repairs every legacy row in your database without a
backfill.
The lookbehind scope is what makes the helper safe to call on user content of unknown origin. Without it, the same fix
strips legitimate vowel marks from Arabic and Hebrew, the virama from Devanagari (which collapses conjunct clusters),
and tone marks from Thai, where tones are lexically distinctive and น้ำ ("water") becomes นำ ("to lead"). The scope
is the difference between a normalization helper and a silent corruption layer.
