Fixing a Complicated PDF Bug

UPDATE: I have a new and improved solution described here.

I am generating PDF files using the Linux Libertine fonts and XeTeX. When I view the files with an ordinary PDF reader, they appear fine. However, when I open them with the PDF.js viewer built into Firefox, the ligatures appear as odd foreign characters.

This problem appears to be known, as it is discussed in this bug report. However, there does not appear to be much progress in that thread as to solving it. I haven’t exactly pinned down the problem, but I did at least find a workaround.

The bug is incredibly specific (though, thankfully, easily reproducible). It only occurs when I compile a document on OS X and view the document in the PDF.js viewer on OS X and Firefox. The bug does not show up in any of these situations as I’ve tested:

  • Compiling the document on Linux
  • Viewing the document in Firefox on Windows
  • Viewing the document on PDF.js in Safari on a Mac
The mistake characters seem to be a result of the PDF font subsetting algorithm. On Linux, when the fonts are subsetted, the encoding slots continue to match the original font (and also the Unicode code points, since the fonts are encoded consistent with Unicode). On OS X, however, the encoding slots for the subsetted fonts equal the slots for the original font ignoring unused slots. This means that glyphs far down in the font—ligatures, for example—end up with assigned slots numerically far less than their Unicode code points. And the erroneously displayed glyphs appear to be for characters corresponding to the reassigned slots.

As an example: the “Th” ligature is Unicode code point 0xe049. However, in the Linux Libertine Roman font, it is the 0x095f’th glyph listed, not counting blank slots. Unicode character 0x095f is Devanagari character Yya: य़. And that is the wrongly displayed character shown on the bug report in the place of the “Th” of the word “The.”

Strangely, though, the problem appears only to affect ligatures. The Linux Libertine character at Unicode 0x0e42 is the 0x0958’th glyph, only a few slots away from the problematic Th ligature. There is also a Devanagari character at that position, but PDF.js displays the Linux Libertine character fine.

After some testing, I discovered that simply changing the glyph names for the ligatures would solve the problem. The ligatures in Linux Libertine are named with underscores between letters (e.g., f_f_i or T_h). Merely deleting the underscores corrected the problem entirely. It’s not clear why that is so, but I have noticed that OS X seems to have some special cases for handling ligature characters, and perhaps that is related.

The following Perl script will automatically change the names of glyphs in the fonts to be correct:

#!/usr/bin/perl -w

Make directories

mkdir “old-ttx” or die “mkdir old-ttx: $!"; mkdir “new-ttx” or die “mkdir new-ttx: $!"; mkdir “new-otf” or die “mkdir new-otf: $!";

Convert fonts to ttx

system “ttx”, “-d”, “old-ttx”, @ARGV;

Fix each file

for my $old_ttx (<old-ttx/*.ttx>) { my $new_ttx = $old_ttx; $new_ttx =~ s/^old/new/; print “Fixing $old_ttx\n”; fixfile($old_ttx, $new_ttx); }

system “ttx”, “-d”, “new-otf”, glob(“new-ttx/*");

sub fixfile { my ($oldttx, $newttx) = @_;

open OLD, $oldttx or die "open $newttx: $!";
my %ligatures = ();
while (&lt;OLD&gt;) {
    if (/^ *&lt;Ligature .* glyph="([^"]*_[^"]*)"\/&gt;$/) {
        my $lig = $1;
        my $modlig = $lig; $modlig =~ s/_//g;
        $ligatures{$lig} = $modlig;
    }
}
my @ligatures = sort { length($b) &lt;=&gt; length($a) } keys %ligatures;
seek OLD, 0, 0;
open NEW, '&gt;', $newttx or die "open $newttx: $!";
while (&lt;OLD&gt;) {
    for my $lig (@ligatures) {
        my $modlig = $ligatures{$lig};
        s/\b$lig\b/$modlig/g;
    }
    print NEW $_;
}

}

The program requires the ttx command line program to operate.

To run, paste the above contents into a file, make it executable, and run it with the arguments being all the OTF files for the Linux Libertine fonts. The program will create three new directories for you; the one called “new-otf” is the one of interest. That folder will contain the new, corrected font files.

I hope that someone actually determines the source of this bug, rather than relying on this admittedly hackish solution.