Preparing searchable PDFs for the Fauna of New Zealand monograph series

Trevor K. Crosby
Series editor, Fauna of New Zealand
Landcare Research, Private Bag 92170, Auckland 1142, New Zealand
CrosbyT@LandcareResearch.co.nz

15 May 2009; updated 20 July 2010

Background

At the January 2008 meeting of the Fauna of New Zealand Editorial Board, it was decided that PDFs would be made available of all numbers on the Fauna of New Zealand website at no charge to the user. These were to be facsimile PDFs, but of a lower quality than the printed version. It was expected that the information contained within the monographs would become more widely available and consequently more used and cited, especially by students and casual users. It also was an endorsement by the Editorial Board of the principle of Open Access to scientific information on New Zealand invertebrates. On 16 July 2010 the final group of PDFs was posted for 4 numbers published in 1994 and 1995.

Preparing the PDFs

Numbers 1–29.

Dr Stephen Pawson provided image files in PNG (Portable Network Graphics) format prepared as part of the BUGZ project funded through TFBIS (see project details in Didham, R. K.; Pawson, S. M. 2007. The launch of the ‘BUGZonline’ database — the bibliography of New Zealand terrestrial invertebrates 1775–1993 online. The Weta 34: 1–10).

Numbers 30-39, and 41

These were provided by Dr Stephen Pawson as PDF files. Additional preliminary and end pages were scanned and saved as either TIFF images which were then converted to PNG format by the series editor with Adobe PhotoShop CS, or as PDF files.

The series editor imported these images into OmniPage Professional 16.1 and used OCR (Optical Character Recognition) to recognise the text. The recognised text was checked and corrected so searches could successfully retrieve instances of the search terms in the PDF: this included taxon names embedded in illustrations. Mis-spelt words in the original publication were not changed.

The main problems in text recognition included:

the symbols for Male and Female were not recognised as consistent symbols so therefore were changed to [m] and [f] respectively;
when words in the original text were close together they resulted in words being joined, for example, “along” rather than “a long”;
when an adjoining “r” and an “n” had no space between the characters in a word, this often was recognised as an “m”;
the multiplication (times) symbol “×” was frequently recognised as an “x” or Greek "chi" symbol ;
when characters had an extraneous speck printed about them they may have been converted to accented characters;
subscripted or superscripted characters were sometimes not recognised correctly; and
the numeral “1” and lower case letter “l” were sometimes misrecognised, especially in sans serif type words, and shown as lower case "l", numeral "1", or capital letter "I".

Hyphenated words sometimes remained hyphenated with an added space after the hyphen, meaning the full word will not be retrieved in a search, and it will show as a hyphenated word with a space after the hyphen in extracted text. The PDFs were printed at resolutions of 150 dpi (screen quality) and 300 dpi (printer, medium quality). Adobe Acrobat 8.0 Standard was then used used to add Properties, reduce the file size (for reading in Acrobat Reader versions 6.0 onwards), and to convert them to a Fast Web view.

Numbers 40, and 42 onwards.

These were printed to PDF from Adobe Pagemaker 7.0 files using Acrobat Distiller 3011.104, at resolutions of 144 dpi (screen quality) and 300 dpi (printer, medium quality) with images at the optimised subsampling setting. The final page providing information about use of the PDF was appended as a PDF page using Adobe Acrobat 8.0 Standard when the information in the Properties of the PDF was added.

The underlying text is accurate and all occurrences of particular words can be found in a search. However, taxon names embedded in illustrations are not available for searching: only those names in captions are searchable. Note that special symbols can not be searched for in Adobe Acrobat Reader, such as the symbols for Male and Female – the underlying text for these shows as “%” and “&” respectively and searches need to be made using these characters. These characters are also substituted in any text that is extracted from the PDF.

In addition, for all PDFs the formatting of the extracted text may not be retained as in the original, such as the use of italic and bold. As well, each line that is extracted is displayed in a word-processing program as a separate paragraph.

Personalising a downloaded PDF

Users are able to annotate downloaded PDFs using appropriate software, either a commercial program such as Adobe Acrobat 8.0 Standard, or a free program such as PDF-XChange Viewer from Softonic. One author, the late Ian McLellan (1924–2008) provided an annotated PDF of his Fauna of New Zealand 27 volume incorporating errata, and made this available to users as his personalised PDF.