What's in a PDF? The Challenges of the Popular Portable Document Format
Editor's note: AccessWorld Solutions, the consulting arm of the American Foundation for the Blind, has worked with Adobe since August 2003 to help them improve and enhance accessibility and usability of several Adobe products for people with disabilities, including Adobe Acrobat and Reader products 6.0 and 7.0 versions. Jamal Mazrui is not affiliated with AccessWorld Solutions or the American Foundation for the Blind.
Portable Document Format (PDF) is an electronic file format developed by Adobe Systems of San Jose, California. PDF has become one of the most popular file formats for publishing documents on the Web and is thus a common medium for the dissemination of knowledge. This article identifies features behind the popularity of PDF, analyzes their impact on accessibility, and discusses the use of the Adobe Reader program with a screen reader, such as JAWS or Window-Eyes.
Adobe publishes an official specification of PDF, which has evolved over the years to version 1.6 at present. Compared to other formats that can be used for storing and distributing documents electronically, such as HTML or Microsoft Word, PDF is distinguished by at least four features: visual fidelity, compact storage, security settings, and cross-platform portability.
By preparing a document in PDF, one can be reasonably confident that the precise visual appearance that is intended is presented to the reader, including layout, fonts, colors, and pictures. This is true whether the output is displayed on the computer screen or printed as hard copy. Since a PDF file is internally divided into pages of output, each page of an author's work will have the look and feel that he or she wants to convey. This visual fidelity is a reason why PDF is widely used for distributing publications in electronic form.
A document in HTML format is typically divided into multiple files that are presented as separate pages on a web site. Moreover, pictures are further separated as graphics files that are linked to the text pages. Thus, distributing a document in HTML usually involves collecting various files at the source and placing them in an appropriate arrangement at the destination for the document to be coherent.
If a document is prepared in PDF, on the other hand, all the text and graphics are bound in a single file. In addition, this file is compressed: Techniques are used for storing repeating sequences of data in more compact ways, thus reducing the total size. The software for viewing a PDF file automatically decompresses the data as it presents its content in readable form. This compact storage means that a web site can store publications in a single file that corresponds to each document, a user can download them faster, and both sending and receiving are easier.
PDF contains optional settings that an author can incorporate to limit how a PDF file is used. Without such restrictions, the Adobe Reader program permits a user to view a PDF file on the screen, print it, copy it to the clipboard, and save it to disk in plain text format. With security settings, however, any of the uses besides on-screen viewing may be blocked completely or limited in some way. For example, only a portion may be copied to the clipboard or only a range of pages may be printed once a week. Stricter settings can prevent a PDF file from being viewed on any computer that does not contain a license key for a specific PDF file. The mechanism is similar to those that are sometimes used to prevent unauthorized copying of software to other computers. These security settings mean that authors can choose to limit who uses their documents and how.
An integral piece of PDF support is the free software that Adobe also develops for viewing PDF files on several different computer platforms or operating systems, including Microsoft Windows, Apple Macintosh, UNIX, and handheld personal digital assistants. The Adobe Reader program ensures that a PDF file can be viewed with the same visual fidelity on almost any type of computer. Since these programs may be obtained without charge, the cost of the Adobe Reader software is not an obstacle to viewing a document that is available in PDF. This cross-platform portability means that authors can disseminate their works widely.
The popularity of PDF as a means of distributing publications has some benefit for people who are blind or have impaired vision. In general, electronic publications offer more potential for accessible, independent reading than do print publications, since computer programs can produce output in flexible and alternative ways, including synthetic speech, braille, and magnified text. This means that an intermediary sighted assistant is not needed, thus providing convenience and privacy. The benefits of PDF, previously discussed, help to increase the amount of reading material that is published in electronic form. In addition, someone who is visually impaired benefits directly, as others do, from particular PDF features, such as compact storage.
Yet, some PDF features that provide benefits of a general nature have had inadvertent adverse side effects for nonvisual readers. To understand why, this section explains some technical inner workings of PDF. The specification for the current version 1.6 is over 1,200 pages long. To keep within the scope of this article, the discussion will necessarily simplify a technical explanation of the format, focusing on the concepts most relevant to accessibility.
The PostScript Language
PDF originates in a specialized programming language, called PostScript, developed by Adobe in the 1980s. Part of the power of PostScript derives from its flexibility about the order in which parts of output are placed on a page. The order does not have to be from left to right and top to bottom. A PostScript-enabled printer produces output a page at a time. Each page of output is transmitted as a batch after all drawing operations on it are complete. An observer of the visual page may guess, but does not actually know, in what order the output was drawn.
Three Components of Output
Producing output may be subdivided into drawing three components: textual characters, vector graphics, and photographic images. How these different objects are used and combined has implications for accessibility, as explained later.
Textual characters are based on a font table: a set of associations between the visible form of a character and its numeric value in a system called Unicode. The historically popular code called ASCII (American Standard Code for Information Interchange) defines about 250 possible characters, which typically suffice for expressing English and other European languages. Unicode, by comparison, defines tens of thousands of characters in order to support numerous written languages of the world, as well as many specialized symbols used in particular subject areas. A PostScript program draws a string of characters on a page by using the Unicode value of each character and looking up its associated shape in a font table.
Besides textual characters, many other kinds of shapes may be drawn on a page based on mathematical calculations. Such shapes--called vector graphics--may be straight or curved lines, geometric designs such as circles or squares, or filled areas according to a pattern. In fact, PostScript can draw vector graphics to create a picture of almost anything on a page.
A third component of output is a photographic image, which may be thought of as an array of colored dots that create a literal picture. PostScript does not know the internal structure of an image, so it essentially copies rather than generates it to a particular location on the page. Such images are typically defined in a format called TIFF (Tag Image File Format).
The PDF File Type
Adobe built PDF as a file type on the foundation of PostScript as a printing language. PDF is a way that documents can be viewed on the screen and exchanged among users, not just printed onto paper. PDF uses the same "imaging model" as PostScript for describing how a page looks. A PDF file contains an abbreviated set of PostScript instructions: basically, a sequence of drawing operations without other programming constructs such as conditions and loops.
Hence, a PDF document is a file that contains PostScript instructions and the data they use. The commands and data follow certain rules that Adobe has defined as the specification for Portable Document Format. As opposed to a file format whose internal structure is only known by its developers, the PDF specification is published and open rather than private and proprietary. It is copyrighted and controlled by Adobe, but anyone is free to use it for developing software that either creates or views PDF files within general licensing terms. Adobe also publishes a free viewing and printing program for many different devices so that all understand PDF in the same way. Adobe has, therefore, established the combination of a file format and software interpreter that enables authors to publish documents with a certain look and feel for potential readers in a broad variety of environments.
Three Types of PDF Files
PDF files may be subdivided into three types: image-only, searchable image, and formatted text and graphics. These types differ in their use of the different components just described--textual characters, vector graphics, and photographic images.
An image-only PDF contains a photographic image representing each page, and virtually no textual characters or vector graphics. Although text may appear on a page, the text is actually a surface picture without underlying characters. Individual characters are needed for translation into speech or braille, so an image-only PDF file is inaccessible.
Image-only PDF files are usually created by scanning hard-copy documents into a computer with attached scanning equipment. Essentially, the system takes a picture of each printed page and then packages the pages in a PDF file. It is possible to use optical character recognition (OCR) software to create textual characters in the PDF file, but this is often not done because the process takes much longer: minutes for OCR compared to seconds for photographic snapshots. Another reason for avoiding OCR is that the resulting text usually contains recognition errors that require manual proofreading and correction to be accurate, thereby involving more staff time and skill.
Scanning documents into image-only PDF files has been a common way of storing information for archival purposes because electronic media are much smaller and less cumbersome than is paper storage. The more that documents originate in electronic, rather than hard-copy, form, the less likely that documents need to be scanned to be archived. Thus, as authors rely more on computers as the original source of documents, the accessibility problem of image-based PDF may lessen over time.
Searchable-image PDF also contains an image for each page, but this type includes a text layer as well. The textual characters are produced from an OCR process, which analyzes each image for what appear to be characters. Wherever characters are recognized in the image, the software draws a layer of text under them. An observer of the page sees the surface image only, as with image-only PDF.
The text layer enables a PDF file to be searched for phrases of interest to a reader who is viewing the document. This text also enables PDF files to be indexed with keywords in a collection of electronic documents, thus permitting a researcher to find particular ones worth further study.
Adding a text layer increases the size of a PDF file, so text may be omitted if compactness is of primary importance. Usually, however, the ability to search, for sighted as well as visually impaired readers, outweighs the cost in extra size, especially since the text is compressed, as previously mentioned. Since nonvisual access to PDF content requires text, adding searchability to a PDF file also benefits accessibility.
Formatted Text and Graphics
A third PDF type, called formatted text and graphics, minimizes the use of photographic images in favor of textual characters and vector graphics. No image layer rests on top of a text layer. Instead, textual characters and vector graphics are drawn wherever they can represent the content of a page. Photographic images are used only when they are pictures that cannot be generated from building blocks of textual characters and vector graphics. This type of PDF is usually the result of conversion from another electronic file format, such as Microsoft Word. This type is the most compact (often 10% of an image-only file with the same content). Also, since this type is built from more structured components, it may be used more flexibly for other purposes. For example, such a PDF file might be converted to HTML for display as web pages or converted to Microsoft Word for editing as part of another document.
A PDF file composed as formatted text and graphics is likely to be more accessible than one composed as searchable image. Although both types contain textual characters, the quality of the text is almost necessarily better in the latter type because it serves the purpose of presentation as well as searchability. If the PDF file was created by scanning, more work has probably been done than with the searchable-image type in order to correct OCR errors and achieve presentable text. If the PDF file was created by converting another electronic format, then the textual components are probably more complete, since they derive directly from character fonts rather than indirectly from recognized images. Despite the accessibility potential of this PDF type, however, other problems of a structural nature may pose significant accessibility problems, as subsequently explained.
Textual characters are a necessary condition for the accessibility of PDF, but they are not sufficient on their own. Some PDF-creation tools do not leave enough information about the fonts used for a PDF viewing program to decipher all the characters in terms of a well-understood computer alphabet. The viewing program sees shapes that it knows are characters drawn on the page. The program then has to do a back-translation of their drawing operations, looking up the Unicode value for each shape and rendering it as a standard screen character. If the original font table is embedded in the PDF file, the viewing program can decode the characters. Decoding is also possible if a common font was used, such as one built into the operating system. Without an available font table, however, the viewing program does not know what textual characters exist because it does quick table lookups rather than sophisticated OCR.
Even if complete character decoding is possible, a PDF file may be inaccessible because of problems in "reading order." This term refers to the order of words, sentences, and paragraphs. Can they be extracted from the text of the PDF file in a coherent, linear order, or are they mixed together in disconnected, confusing ways?
For example, the text of a PDF file may appear visually like newspaper columns, where a line stops midway across the page and continues underneath, rather than continuing across to the right margin. Visually, on a screen or printout, the structure of the document is apparent because of extra spacing or a border line that indicates where one column of text ends and another begins. Information about this document structure, however, must be represented in the PDF file for the reading order to be rendered in an intelligible manner by assistive technology. Without structural information that groups and separates regions of the page, the document may be inaccessible to nonvisual readers.
Since PDF is frequently chosen for publications that are intended to look fancier than single-column text, PDF files often contain irregular page layouts with multiple columns, sidebars, and picture captions. If these files lack an internal structure, a nonvisual interpretation of them necessarily involves guesses about reading order, and mistakes can seriously undermine the comprehension of their content.
Tagging PDF Files
To address such accessibility problems, Adobe introduced an extension to PDF called "tagging." The concept is similar to tags in the HTML format. As background, the World Wide Web Consortium (W3C) did pioneering work with HTML tags to incorporate the document structure that was needed for accessibility as the HTML standard evolved.
HTML encloses portions of text with markers that indicate the structure or purpose of the text. For example, a phrase may be tagged as the heading of a section, the caption of an image, or a cell within a table. Some tags are necessary for proper visual display in a web browser that interprets HTML files, whereas other tags--although still a standard part of the HTML language--are recommended specifically to aid accessibility. For example, accessibility tags include an indication of the row and column labels of a table, which enables a screen reader to tell the user about the context of each cell. The cell information may be useless or confusing without knowing the associated row and column labels. Collectively, the HTML tags that are needed for accessibility are sometimes called "accessible markup."
The tagged PDF that Adobe developed provides similar functionality. Tags mark portions of PDF content and are organized in a sequence that conveys the suggested reading order. Whereas HTML files are readable text with tags as words enclosed in brackets, however, PDF files are in a compressed, binary form with tags that can be viewed only with special software, such as Adobe Acrobat.
Accessibility Standards and Incentives
The W3C has defined standards for accessible markup, called the "Web Content Accessibility Guidelines" (WCAG 1.0). The U.S. government has also defined accessibility standards for web sites, software, and other information technology in regulations that were first published in 2001 to implement Section 508 of the Rehabilitation Act, as amended. (See For More Information at the end of this article for a link to these regulations.) Section 508 mandates that federal agencies provide information to people with disabilities in a manner that is comparable to that provided to people without disabilities.
Section 508 does not require software manufacturers to make accessible products, but it does provide them with significant market incentives to do so because the federal government is a large customer that is interested in products that meet minimum standards of accessibility. Indeed, Congress adopted Section 508 partly with the stated purpose of creating voluntary market incentives to develop technologies that benefit people across a broad range of physical characteristics, not just those with typical levels of eyesight, hearing, manual dexterity, and other traits.
Adobe, like other companies that sell to the federal government, has noticeably increased the accessibility of its products in recent years, and its web site includes information on compliance with Section 508 standards. The tagged PDF format is an accessibility innovation that the company introduced in 2001. Besides the free program for viewing PDF files, called Adobe Reader, Adobe sells a commercial program for creating PDF files, including tagged PDF files, called Adobe Acrobat. The program is available in both a Standard and Professional version, with the latter having the most tagging features and being recommended by Adobe to customers who are concerned with accessibility.
The basic content and layout of a PDF document is usually created and revised using a word-processing program, such as Microsoft Word or Corel WordPerfect, and is then converted to PDF to create the final form, exploiting features like visual fidelity, compact storage, security settings, and cross-platform portability, as previously described. Adobe Acrobat enables one to convert a document into PDF from other formats, including plain text, HTML, and popular word-processing programs. It lets one combine multiple source documents into a single PDF file, such as a report consisting of a Microsoft Word narrative and a Microsoft Excel spreadsheet. It then allows the author or designer to touch up the appearance for the precise presentation that is desired.
Adobe Acrobat includes a feature that analyzes the accessibility of a PDF file. It reports potential problems, such as characters that are unidentifiable, structure that is ambiguous, or pictures that are unlabeled. A related feature adds tags when this can be done with a high degree of certainty about what markup is appropriate in the context of the document. For example, it may associate each page footer with a corresponding tag when the analysis finds significant space between the rest of the page and the last line of text and that line contains a page number.
Adobe Acrobat cannot tell what a picture contains, so an author needs to enter a caption tag for the picture manually. Tables also present a challenge. Does the left column of the table consist of labels for the rows to the right, or does it consist of actual data in a table with column labels but no row labels?
The accessibility report that is produced by Acrobat identifies potential problems that one typically corrects by selecting a portion of the document and picking a tag to indicate its purpose. This manual tagging process may involve significant time and skill, depending on the complexity of the document.
Using Adobe Reader
Adobe and Screen Readers
Assistive technology companies, such as Freedom Scientific, the developer of JAWS, and GW Micro, the developer of Window-Eyes, have worked with Adobe to make their screen readers understand the tags of a PDF file that is viewed in Adobe Reader (or Acrobat) and thereby render more accessible output in speech or braille. At the time of this writing, the latest release of Adobe Reader is version 7.0.3, which requires Windows 2000 or XP. When Adobe Reader is launched, it detects whether a screen reader is running. If so, it presents a dialog box of configuration options that affect accessibility and sets the default choices to ones that Adobe Reader finds are the most likely to work best.
The most significant accessibility setting is called "infer reading order from document." With this setting active, Adobe Reader will analyze an untagged PDF file and add temporary tags to optimize its reading order. The analysis examines spacing between blocks of text, for example, to decide whether there are multiple columns of information.
Although the automatic tagging process is beneficial for reading order, it has three drawbacks. First, with a large PDF file, containing more than 50 pages, the process may take a few minutes or more to complete, depending on the complexity of the document and the speed of the computer. Second, one may not be able to work with other programs while a document is being tagged because the tagging process may slow the other programs to an unusable crawl. Third, the tagging process does not signal when it is complete, so one has to keep checking with the screen reader to determine whether the file is ready for reading.
Because of the drawbacks of automatic tagging, Adobe Reader asks the user to confirm whether to add tags before initiating the process each time it opens a file. The user will usually want the tagging for better reading order. If the extra confirmation step seems inefficient or annoying, however, one can turn it off. The downside is that the computer will then become unusable for a few minutes whenever a large PDF file is opened and automatic tagging occurs for the whole file. This tagging process occurs even if the same file has been opened before--such tags are temporary and not saved by Adobe Reader from one session to another.
If the confirmation setting is on and the user declines to add tags to the whole file up front, the user can still read a large PDF file a page at a time. Whenever the user navigates to a new page, however, there is a pause of a few seconds while Adobe Reader adds temporary tags for that page and communicates them to the screen reader.
The many configuration settings of Adobe Reader are located in the Preferences dialog box under the Edit menu. A hot key for this dialog box is Control-K. Users of JAWS versions prior to 6.1 should note that pressing its bypass key, Insert-3, may be necessary before pressing Control-K because JAWS uses Control-K for other purposes.
Accessibility-related settings of Adobe Reader are located primarily in two tab pages of the Preferences dialog box, those named Accessibility and Reading. Adobe Reader also groups most accessibility settings in another dialog box, however, called the Accessibility Setup Assistant, which is a choice on the Help menu. This convenient dialog box lets you configure screen-reader settings, screen-magnifier settings, or both. It lets you either accept all recommended settings or customize settings through a series of wizard pages. It is suggested that you accept all recommended settings initially and then explore possible modifications later if your results are unsatisfactory.
Since screen reader users rely on common hot keys, rather than pointing and clicking with the mouse, an application may be more challenging if it involves nonstandard keystrokes. This is partly true of the screen reader interface to Adobe Reader. For example, one has to learn that Control-Shift-PageUp, rather than Control-Home, goes to the top of the document. Configuration options are on the Edit menu, rather than on the View or Tools menu. Some unconventional interface elements may exist because Adobe makes versions of its Reader software for several operating systems, so may trade some Windows conventions for cross-platform consistency.
The problem of unconventional interface, however, is also due to screen reader adjustments made to accommodate the two different tag modes available: single page or whole document. Using the example above, Control+Home is, in fact, the hot key for going to the top of a document in Adobe Reader, just like other Windows programs. When a screen reader is running, however, it uses Control+Home to go to either the top of document or top of page, depending on whether document or page mode is active. Therefore, Control+Shift+PageUp is implemented as a way to always go to the top of document.
Useful Hot Keys
Some nonstandard but useful hot keys of Adobe Reader are as follows:
- Control-PageDown or Control-PageUp: Go to the next or previous page
- Control-Shift-PageDown or Control-Shift-PageUp: Go to the bottom or top of the document
- Control-K: Go to the Preferences dialog box
- Control-D: Display document properties, including security settings and tagged status that affect accessibility
- Control-Shift+6: Check for accessible reading order
- Alt-F then V: Save to text
- Alt-H then T: Accessibility Setup Assistant
JAWS vs. Window-Eyes
Accessibility comparisons between JAWS and Window-Eyes are often challenging to make because each program may adopt and add to features that the other started six months before. Both companies claim to provide support for Adobe Reader that is comparable to their support for Internet Explorer. With JAWS 6.20 and Window-Eyes 5.0, we observed progress toward this end.
The table navigation commands of JAWS, which previously worked with web pages in Internet Explorer, now also work with PDF files in Adobe Reader. The Adobe Reader Find command, invoked with Control-F, does not work with JAWS. It does work with Window-Eyes, but after a noticeable delay. Both screen readers, however, have implemented alternate Find commands that work better: Control-Insert-F using JAWS or Control-Shift-F using Window-Eyes. Neither screen reader fully identifies security settings in the Document Properties window without requiring navigation of the window using mouse-simulation keys.
In general, both screen readers are sluggish in Adobe Reader, enough so that we sometimes felt frustrated by the experience of inefficiency (when run under Windows 2000 on a Pentium 4 computer at 1.9 GHz).
The Bottom Line
PDF files are widespread and necessary to access by people who are blind or have impaired vision. Although the original format made accessibility difficult, the newer, tagged format holds promise, and recent versions of Adobe Reader work better with screen readers.
For More Information
Adobe Systems accessibility page: <www.adobe.com/accessibility>.
Adobe page on Section 508 compliance: <www.adobe.com/enterprise/accessibility/section508.html>.
Adobe Reader download page: <www.adobe.com/products/acrobat/alternate.html>.
Using Accessible PDF Documents with Adobe Reader 7.0: A Guide for People with Disabilities: <www.adobe.com/enterprise/accessibility/reader/main.html>.
Online PDF conversion tool: <www.adobe.com/products/acrobat/access_onlinetools.html>.
Creating Accessible PDF Documents with Adobe Acrobat 7.0: <www.adobe.com/enterprise/accessibility/pdfs/acro7_pg_ue.pdf>.
Web content accessibility guidelines: <www.w3.org/WAI/GL/2005/06/f2f-agenda.html>.
Section 508 regulations: <www.access.gpo.gov/nara/cfr/waisidx_04/36cfr1194_04.html>.
Section 508 technical assistance: <www.accessboard.gov/sec508/guide/1194.22.htm>.
The opinions expressed in this article are those of the author and do not necessarily represent the views of the Federal Communications Commission or the United States Government.
Previous Article | Next Article |
Table of Contents
Copyright © 2005 American Foundation for the Blind. All rights reserved. AccessWorld is a trademark of the American Foundation for the Blind.