A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF
Converting files to Portable Document Format (PDF) is popular due to the format's many
advantages. For example, PDF allows an author to control or preserve the rendering of a digital
document, distribute it to other systems, and ensure that it displays in a viewer as intended.
From the perspective of Human Language Technology (HLT), however, PDFs are problematic.
PDF is a display-oriented digital document format; the point of PDF is to preserve the
appearance of a document, not to preserve the original electronic text. We observed errors in
PDF-extracted text indicating that either the PDF generator or extractor, or both, mishandled the
document structure, character data, and/or entire textual objects. And we learned that other HLT
researchers reported data loss when extracting electronic text from PDFs. This motivated further
study of digital document data exchange using PDFs.
MITRE conducted an exploratory study of data exchange using PDF in order to investigate the
data loss phenomenon. We limited our study to Middle Eastern electronic text: specifically
Arabic and Persian. The study included a test for scoring PDF generation methods—(a) using a
common, best-practice setup to generate PDFs and extract text, and (b) using character accuracy
to quantify the quality of PDF-extracted text. We ranked 8 methods according to the resulting
accuracy scores. The 8 methods map to 3 core PDF generation classes. At best, the Microsoft
Word class resulted in 42% Overall Accuracy. Best scores for the PDFMaker and Acrobat
Distiller/PScript5.dll classes were 95% and 96%, respectively.
This paper explains our tests and discusses the results, including evidence that using PDF for
data exchange of typical Arabic and Persian documents results in a loss of important electronic
text content. This loss confuses human language technologies such as search engines, machine
translation engines, computer-assisted translation tools, named entity recognizers, and
Furthermore, most of the spurious newlines, spurious spaces in tokens, spurious character
substitutions, and entity errors observed in the study were due to the PDF generation method,
rather than the PDF text extractor. So, using a common configuration to convert reliable
electronic text to PDF for data exchange causes irretrievable loss of electronic text on the
Digital Documents, File Conversion, Reliable Electronic Text, Human Language Technology, Portable Document Format, PDF, Microsoft Word, DOCX, Arabic, Persian, Character Error Rate, Data Exchange