Fixing a PDF Bug: Part 2

Share on:

So I previously wrote about my difficulties with the Linux Libertine font in PDF files being displayed by the PDF.js viewer. The brief overview is this: PDF files using the Linux Libertine font, as compiled by XeTeX on OS X and displayed in PDF.js via Firefox, will have ligatures show up as various incorrect characters, based on where the ligatures were slotted in the original font’s table of characters. I fixed the problem by renaming the ligature glyphs.

However, this technique led to a problem: text using ligatures could no longer be text-searched or copied. This is because PDF files ordinarily record how to decompose ligatures into individual characters, in a table stored in the PDF file called ToUnicode. With the original glyph names, the ToUnicode table was generated correctly, but when I renamed them they no longer showed up in the table.

It was strange that they were not showing up in the table. One would think that XeTeX would be able to determine the composition of ligatures, since each font contains data indicating how individual characters are aggregated into a ligature. So it would seem logical that the ToUnicode table would simply be a translation of the font’s ligature data. This turned out not to be the case.

As I explained in the previous post, the ligature glyphs were originally named as the individual characters, separated by underscores. So the “ffl” ligature would have been named “f_f_l” in the font. This is based on an Adobe convention, that specifies that ligatures should be named for their component characters, separated by underscores.

Surprisingly, XeTeX (or whatever PDF-generating library it uses) actually relies on that convention: instead of using the ligature data in the font to generate the ToUnicode table, it actually uses the underscore-separated name. So, for example, if I renamed the “ffl” ligature to “X_Y_Z” and generated a PDF with the word “waffles” in it, then copying that word would place in the clipboard the text “waXYZes.”

This meant that I was back at square one: if I put the underscores in the ligature names, then PDF.js would display characters incorrectly, but if I used different names, then searching and copy-paste would not work. So I needed a different solution.

Remember how I said that the incorrectly displayed characters were based on the position where the ligatures were slotted in the font? It occurred to me that, if I changed the slots for the ligatures, then I could possibly change the results and potentially even solve the problem. Looking at some other fonts suggested this was the right track: those fonts placed ligatures around the 300th characters in the font, while Linux Libertine placed the ligatures at about position 3000.

So I just moved the ligatures into approximately slot 300 and up. And it worked!

Here is the updated script that will convert the fonts, again using ttx.

#!/usr/bin/perl -w

Make directories

mkdir “new-ttx” or die “mkdir new-ttx: $!"; mkdir “new-otf” or die “mkdir new-otf: $!";

Convert fonts to ttx

unless (-d ‘old-ttx’) { mkdir “old-ttx” or die “mkdir old-ttx: $!"; system “ttx”, “-d”, “old-ttx”, @ARGV; }

Fix each file

for my $old_ttx (<old-ttx/*.ttx>) { my $new_ttx = $old_ttx; $new_ttx =~ s/^old/new/; print “Fixing $old_ttx\n”; fixfile($old_ttx, $new_ttx); }

system “ttx”, “-d”, “new-otf”, glob(“new-ttx/*");

sub fixfile { my ($oldttx, $newttx) = @_;

open OLD, $oldttx or die "open $newttx: $!";
my @ligatures = ();
while (&lt;OLD&gt;) {
    if (/^ *&lt;GlyphID .* name="[^"]*_[^"]*"\/&gt;$/) {
        push @ligatures, $_;
seek OLD, 0, 0;
open NEW, '&gt;', $newttx or die "open $newttx: $!";
while (&lt;OLD&gt;) {
    if (/^ *&lt;GlyphID .* name="(zcaron|uniFFFD)"\/&gt;$/) {
        print NEW $_;
        print NEW @ligatures;
        @ligatures = ();
    } elsif (/^ *&lt;GlyphID .* name="[^"]*_[^"]*"\/&gt;$/) {
    } else {
        print NEW $_;
warn "LIGATURES NOT PRINTED" if @ligatures