PDF Analysis

Abstract

A description of the PDF Analysis baseline, where it comes from, what it does, and what our plans ahead are.

Introduction

Adobe’s PDF has long been a popular format on the World Wide Web. Launched in 1992 as a successor to the ubiquitous print standard language PostScript, PDF is a document description format, created to easily encapsulate so–called “rich” information.

Whereas HTML is a language for describing content by marking up sections as paragraphs, headers, and so forth, PDF describe documents – i.e. how they will look on screen or, more often, when printed.

A PDF document can, at one end of the scale, preserve the original look of its source to a high degree of precision and is therefor the format of choice for material where the visual presentation is important.

Books destined for print, historical manuscripts, educational texts, brochures and similar documents are prime examples of PDF usage.

This can make it difficult in non–visual contexts, such as for people with abilities outside the norm — e.g. low or no eyesight. Luckily PDF as a format has evolved to meet this challenge.

As of version 1.3, PDF files can be structured by way of short textual elements know as tags — just as in HTML. These describe the nature of information in the document, as well as overall structure such as tables–of–content and outlines.

Testing PDF Accessibility

Both the W3C — as part of their WCAG 2 effort — and Adobe, through the PDF/UA activity — has done work on creating guidelines and recommendations for how to improve PDF accessibility.

No single set of guidelines exists today. For a majority of testers, however, it is likely that WCAG 2 is the goal to aim for.

PDF and siteSifter

siteSifter — per May 2012 — use a combination of techniques to test PDF files. The tests are collected in the PDF Analysis baseline, which is available to all clients, regardless of subscription.

The baseline, while fully operational, is also evolving. One example is checkpoint 3, which examine the existence of textual alternatives to images. At this moment in time only a subset of images will be processed, but we’ll add more features in revisions to come.

We are also working on implementing more of the WCAG 2 techniques. As usual we cannot automate them all; some, such as PDF4 “Hiding decorative images with the Artifact tag” cannot be programmatically determined — what, after all, is a decorative image?

Even so, we can determine that images are marked as Artifact(s), and flag such issues for review.

Other things can be flagged as an error, such as PDF16 “Setting the default language … ”. If a Lang tag is nowhere to be found, the language has not been set and we can report a failure to meet that particular test.

PDF Analysis reports are normally created in the Generic HTML format. A complete description of the various tests involved in the baseline is available in the user manual.

References

Please note that these references are not necessarily endorsed by Greytower. Links are valid as of the 23rd of May 2012.

Adobe PDF History
http://www.adobe.com/products/acrobat/adobepdf.html
Adobe, —

PDF Accessibility
http://webaim.org/techniques/acrobat/
WebAIM, —

What are “PDF tags” and why should I care?
http://acrobatusers.com/tutorials/what–are–pdf–tags–and–why–should–i–care
Duff Johnson, 1st of February 2006

Acrobat Accessibility Full Check — Common Repair Hints
http://acrobatusers.com/tutorials/acrobat–accessibility–full–check–common–repair–hints
Kurt Foss, 1st of June 2009

WCAG 2.0 and PDF/UA
http://blogs.adobe.com/accessibility/2012/05/wcag–2–0–and–pdfua.html
Adobe, 7th of May 2012

PDF Techniques for WCAG 2.0
http://www.w3.org/WAI/GL/WCAG20–TECHS/pdf.html
W3C, 10th of March 2011