Why combining marks break in DomPDF

Norwegian å characters can render broken in PDFs generated by DomPDF (and similar non-shaping engines) from rich-text input. A tiny ring drifts above a bare "a", or a appears where the å should be. The fix is two passes of Unicode normalization plus one regex. The trap is that the obvious version of that regex silently mangles Arabic, Hebrew, Hindi, and Thai.

What it looks like

The visible symptom: Norwegian å in the rich-text editor renders correctly in the editor, correctly in the HTML preview, and breaks the moment the document is passed through DomPDF. The PDF shows either a bare "a" with a tiny ring floating above the next character, or the missing-glyph fallback where the å should be.

Pulling the rows out of the database showed two paragraphs with what looked like the same word:

row 1:  må     hex: 6d 61 cc 8a
row 2:  må     hex: 6d c3 a5

Identical visible text. Different bytes. The first paragraph stored m + a + U+030A (COMBINING RING ABOVE). The second stored m + U+00E5 (the single precomposed å). The editor displayed them identically. DomPDF didn't.

Not a font issue

Swapping fonts is the obvious first instinct, and it doesn't help. The base a renders fine. The combining ring renders fine. They just render in two different places. The font already has both glyphs. What's missing is the layout step that stacks them.

Two ways to spell må

Unicode has two ways to encode å. The precomposed form is a single codepoint, U+00E5. The decomposed form is two codepoints: U+0061 (the letter a) followed by U+030A (a non-spacing combining mark). UAX #15 defines the canonical composition that maps one to the other; the UnicodeData.txt entry for U+00E5 lists its decomposition as exactly U+0061 U+030A.

NFC (precomposed)

åU+00E5Lo

Browser:må

DomPDF:må

NFD (decomposed)

aU+0061Ll

̊U+030AMn

Browser:må

DomPDF:ma˚← ring drifts free

The two forms have names: NFC (precomposed, "canonical composition") and NFD (decomposed, "canonical decomposition"). Either is a legal way to spell å. If everything in the pipeline supports both, it doesn't matter which one you use.

Here's an interactive view. Type or paste anything, toggle between NFC and NFD, and tick "Simulate DomPDF" to see what happens when a non-shaping renderer tries to place a combining mark:

Simulate DomPDF (no shaping)

rendered (browser shaping)

Blåbærsyltetøy på smørbrød

26 codepoints

Hex	Char	Cat	Name
U+0042	B	—	—
U+006C	l	Ll	LATIN SMALL LETTER L
U+00E5	å	Ll	LATIN SMALL LETTER A WITH RING ABOVE
U+0062	b	—	—
U+00E6	æ	Ll	LATIN SMALL LETTER AE
U+0072	r	Ll	LATIN SMALL LETTER R
U+0073	s	Ll	LATIN SMALL LETTER S
U+0079	y	—	—
U+006C	l	Ll	LATIN SMALL LETTER L
U+0074	t	Ll	LATIN SMALL LETTER T
U+0065	e	Ll	LATIN SMALL LETTER E
U+0074	t	Ll	LATIN SMALL LETTER T
U+00F8	ø	Ll	LATIN SMALL LETTER O WITH STROKE
U+0079	y	—	—
U+0020	␣	Zs	SPACE
U+0070	p	Ll	LATIN SMALL LETTER P
U+00E5	å	Ll	LATIN SMALL LETTER A WITH RING ABOVE
U+0020	␣	Zs	SPACE
U+0073	s	Ll	LATIN SMALL LETTER S
U+006D	m	Ll	LATIN SMALL LETTER M
U+00F8	ø	Ll	LATIN SMALL LETTER O WITH STROKE
U+0072	r	Ll	LATIN SMALL LETTER R
U+0062	b	—	—
U+0072	r	Ll	LATIN SMALL LETTER R
U+00F8	ø	Ll	LATIN SMALL LETTER O WITH STROKE
U+0064	d	—	—

Why the editor shows it fine

Browsers render combining marks correctly because they ship a text-shaping engine. Chromium-based browsers (Chrome, Edge) and Firefox use HarfBuzz. Safari uses Apple's CoreText on macOS and iOS — a separate framework, but one that does the same job. Text shaping is the step between "here are some Unicode codepoints" and "here are pixels": it consults the font's GSUB and GPOS tables, figures out which glyph IDs to draw, and positions combining marks over the base character they belong to. That last step is the one that matters here.

That's why pasting må (NFD) into a contenteditable looks fine and saves to the database looking fine. The browser composed it at render time. The bytes in the DOM never changed.

Why DomPDF doesn't

DomPDF doesn't have a shaping engine. It maps codepoints to glyph IDs one-to-one, draws each glyph at the next horizontal position, and moves on. No GPOS, no mark stacking, no ligature substitution.

The DomPDF maintainers have acknowledged this in long-running issues (dompdf#553, discussion #3049). The recommended workaround is always the same: ensure your input uses precomposed characters, and that the font has a glyph at that exact codepoint. In other words: don't send NFD into DomPDF.

DomPDF isn't unique in this. The picture for PHP/HTML-to-PDF stacks, per each engine's documentation:

Engine	Combining marks
DomPDF	No shaping — relies on precomposed glyphs
mPDF (pre-6)	Limited; mPDF docs recommend precomposed characters
mPDF (6.0+)	Supports OpenType layout (`GSUB`/`GPOS`)
TCPDF	Hardcoded diacritic + Arabic shaping tables; no general OpenType layout
wkhtmltopdf	Shapes correctly via QtWebKit
Browsershot (headless Chromium)	Shapes correctly via Chromium's pipeline

These rows come from each engine's documentation, not from independent rendering tests.

Where the NFD comes from

There's no public Microsoft Word spec stating that copy emits NFD. The closest authoritative source on the macOS side is Apple's Technical Note TN1150, the canonical "macOS uses NFD" reference. But it specifies HFS+ filenames, not the clipboard. APFS, the modern macOS filesystem, accepts and preserves whatever normalization form you give it; it's normalization-insensitive at the lookup level but doesn't force NFD on storage.

Observed behavior: pasting Norwegian text from Word for Mac into a Tiptap editor produces NFD byte sequences in the underlying field. Same with Google Docs in Safari and Chrome on macOS. The same Word document opened on Windows and copy-pasted into the same editor produces NFC. Consistent with the OS-layer theory: macOS-side text plumbing tends to emit NFD, Windows-side tends to emit NFC. Behavior, not spec.

The definitive source on the storage side is TN1150's decomposition table. Determining what any specific clipboard pipeline does requires testing it directly.

The naive fix

PHP's intl extension ships a Normalizer class that implements UAX #15. Composing the rendered HTML to NFC before handing it to DomPDF is a one-liner:

<?php
// app/Util/UnicodeHelper.php (first version)

namespace App\Util;

use Normalizer;

class UnicodeHelper
{
    public static function normalizeForPdf(string $text): string
    {
        $normalized = Normalizer::normalize($text, Normalizer::FORM_C);
        return $normalized === false ? $text : $normalized;
    }
}

Call it between rendering the view and handing the HTML to DomPDF:

$html = view('pdf.document', ['document' => $document])->render();

$html = UnicodeHelper::normalizeForPdf($html);

$pdf = Pdf::loadHTML($html);
return $pdf->stream('document.pdf');

NFC composition is a no-op on already-precomposed text, so it's safe to call on every render.

A failing test that asserted the rendered HTML contained må (NFC) and not the bare U+030A turned green. The paragraph from row 1 above now rendered with the å stacked correctly.

But the å was still broken

A second paste pattern triggers the same visible failure even after NFC composition. The byte sequence in the database looks like:

p (U+0070) + å (U+00E5) + ̊ (U+030A)

A precomposed å. Followed by a free-standing combining ring. NFC composition couldn't help: U+00E5 is already the precomposed form, and there's no Unicode character for "letter a with two rings above". Normalizer::FORM_C saw the precomposed base, looked for something to merge the trailing combining mark into, found nothing it could compose, and left both codepoints in place. DomPDF drew the å, then tried to draw U+030A as a standalone glyph, didn't find one in the font's regular glyph table, and fell back to .

The first fix solved the case where two codepoints could be composed into one. It didn't solve the case where the second codepoint had nowhere to go.

How did the orphan get there

There's no reliable reproduction. The plausible mechanism: a paste deposits an NFD å, the browser shapes it, then a later edit (IME quirk, paste-over-paste, autocorrect, input rule) leaves an extra U+030A next to the already-shaped å. Re-saving preserves both codepoints. Once the bytes land in the database, the rendering bug is downstream of whatever produced them.

The smarter fix

The first attempt at handling the orphan was to strip every non-spacing mark left after NFC composition:

$normalized = Normalizer::normalize($text, Normalizer::FORM_C);
return preg_replace('/\p{Mn}/u', '', $normalized);

\p{Mn} is the Unicode category "Mark, Nonspacing": every floating combining diacritic in every script. After NFC composition, anything still in that category is by definition a mark that couldn't compose into a precomposed character. DomPDF can't draw it. Stripping it produces readable text.

It also destroys vocalized Arabic.

Arabic vocalization marks (harakat: fatha, damma, kasra, shadda, sukun) are \p{Mn}. Hebrew niqqud are \p{Mn}. The Devanagari virama, which joins consonants into conjunct clusters, is \p{Mn}. Thai tone marks, the ones that distinguish น้ำ ("water") from นำ ("to lead"), are \p{Mn}. A blanket strip silently turns each of these into broken text.

The bug in question (an orphan combining mark following a Latin base character) has a much narrower signature. Scoping the strip with a Latin-script lookbehind keeps the fix and drops the collateral damage:

<?php
// app/Util/UnicodeHelper.php (final)

namespace App\Util;

use Normalizer;

class UnicodeHelper
{
    public static function normalizeForPdf(string $text): string
    {
        $normalized = Normalizer::normalize($text, Normalizer::FORM_C);
        if ($normalized === false) {
            $normalized = $text;
        }
        return preg_replace('/(?<=\p{Latin})\p{Mn}+/u', '', $normalized) ?? $normalized;
    }
}

(?<=\p{Latin})\p{Mn}+ matches one-or-more nonspacing marks that immediately follow a Latin letter. Arabic harakat sit after Arabic letters: no match. Hebrew niqqud sit after Hebrew letters: no match. Devanagari and Thai marks sit after their own consonants: no match. Emoji variation selectors (U+FE0F) sit after Symbol codepoints: no match. The only thing that matches is the actual pattern the bug uses: a stray combining mark next to a Latin letter, after NFC has already done everything it could.

What the lookbehind actually saves

Both versions of the regex run live below, on the same 11 strings. Look at the Naive row when it diverges from Scoped: that's the article's whole concern made concrete. Vocalized Arabic loses every harakat. Thai น้ำ ("water") drops to นำ ("to lead"). Hebrew loses its niqqud. Devanagari loses the virama that holds conjunct clusters together. The Latin rows show the reassuring half of the story: for the scripts the bug actually affects, both regexes produce the same fix.

11 scripts · naive strip vs Latin-scoped strip · live

Show codepoints

Norwegian (NFD)Identical output

InputDet må gjøres

NaiveDet må gjøres

ScopedDet må gjøres

Both compose a + U+030A → å. Naive happens to be safe here.

Norwegian (orphan ring)Identical output

Inputpå̊

Naivepå

Scopedpå

Both strip the stray U+030A after the Latin å.

French (NFD)Identical output

Inputcafé résumé

Naivecafé résumé

Scopedcafé résumé

Both compose e + U+0301 → é.

Vietnamese (NFD)Identical output

Inputtiếng

Naivetiếng

Scopedtiếng

Both compose e + U+0302 + U+0301 → ế.

Vocalized ArabicNaive corrupts

Inputاَلسَّلامُ

Naiveالسلام

Scopedاَلسَّلامُ

Naive strips all harakat — every short-vowel mark is gone.

Hebrew with niqqudNaive corrupts

Inputשָׁלוֹם

Naiveשלום

Scopedשָׁלוֹם

Naive strips every niqqud — the vocalization disappears.

Hindi (Devanagari)Naive corrupts

Inputनमस्ते

Naiveनमसत

Scopedनमस्ते

Naive strips the virama; consonant clusters fall apart.

Thai (with tone)Naive corrupts

Inputน้ำ

Naiveนำ

Scopedน้ำ

Naive strips the tone mark: "water" (น้ำ) becomes "to lead" (นำ).

Simplified ChineseIdentical output

Input你好世界

Naive你好世界

Scoped你好世界

Ideographs are atomic; neither strip touches them.

Korean Jamo (decomposed)Identical output

Input안

Naive안

Scoped안

Both compose ㅇ + ㅏ + ㄴ → 안 (Hangul composition is its own thing).

Emoji + VS16Naive corrupts

Input❤️

Naive❤

Scoped❤️

Naive strips VS16 (it's Mn). Scoped leaves it alone because U+FE0F follows a Symbol, not a Latin letter.

One known weakness: if a paste artifact produces a stray combining mark following a non-Latin base (e.g., an Arabic letter followed by U+030A), the lookbehind won't match and the orphan will survive into the PDF. No such case has surfaced in practice. If one does, the regex grows another alternative, and the unit tests will surface anything else the change breaks.

What it means

If your stack is DomPDF, or mPDF before 6, or TCPDF, and you have international users, you have this bug. The bytes that work in your editor don't work in your renderer because the renderer doesn't have a shaping engine. One line of Normalizer::normalize($html, Normalizer::FORM_C) covers most of it; the Latin-scoped strip above handles the rest. Inserted right before the HTML reaches the PDF library, it repairs every legacy row in your database without a backfill.

The lookbehind scope is what makes the helper safe to call on user content of unknown origin. Without it, the same fix strips legitimate vowel marks from Arabic and Hebrew, the virama from Devanagari (which collapses conjunct clusters), and tone marks from Thai, where tones are lexically distinctive and น้ำ ("water") becomes นำ ("to lead"). The scope is the difference between a normalization helper and a silent corruption layer.