The George Washington Papers: Building the Digital Collection

The George Washington Papers at the Library of Congress is the first manuscript collection
to be digitized in its entirety from the Library's vast collection of microfilm produced
by the Library of Congress Photoduplication Service. The George Washington Papers was microfilmed
in 1964 as part of a larger project, the Presidential Papers Project, which was instituted
by Congress in 1957. The goal of this program was to microfilm and disseminate the papers
of presidents held by the Library of Congress. The 124-reel Washington collection, captured
on 35 millimeter roll microfilm, was produced as a part of this program.

Microfilm collections of historical documents present a number of issues for digitization
resulting from the quality of the microfilm being scanned. In addition, there are the issues
of original document condition, a wide range of tonal values, document sizes, and document
orientation on the microfilm. For optimal capture of detail, the Washington Papers microfilm
was raster scanned from a duplicate negative microfilm, which was generated for this purpose.
The negative can reduce the appearance in digital images of flaws, such as dust, which
can be found on the scanning microfilm. The negative was printed directly from the archival
microfilm and produced for scanning by both the scanning contractor, Preservation Resources,
and the Library of Congress Photoduplication Service. Great care was taken in the duplication
process in order to compensate for the high density range of the master microfilm.

The scanning was performed offsite by Preservation Resources in Bethlehem, Pennsylvania,
under contract to the National Digital Library Program.

The digital images were produced in JPEG File Interchange Format (JFIF), a compressed
grayscale format often used in digitizing historical manuscript documents because of its
ability to capture and display a wide range of tonal variations from those in the document
paper itself to diverse qualities of pencil and ink. This 8-bit grayscale capture was also
found to suppress the bleedthrough typical of handwritten documents in the collection.
Grayscale GIF images were then created for preview access online. The great majority of
GIFs were created from grayscale TIFF images by Preservation Resources, digitizers of the
George Washington Papers. National Digital Library Program staff created GIFs from delivered
JPEGs for Series 2 and parts of Series 1 and 4. Four-bit grayscale GIF images provide maximum
legibility since the JPEG archival image requires considerable time to download. All of
the original capture "master" TIFF images, to which LZW lossless compression was applied,
were transferred to the NDL via magneto-optical disks and now reside in the NDL digital
file repository for American Memory.

The total number of digital images that compose the George Washington Papers is approximately
456,000: that is, 152,000 each of JPEG, GIF, and TIFF files. The complete collection of
digital files occupies approximately 300 GB of server space.

In the George Washington Papers, the majority of booklike materials, such as letterbooks,
account books, and the like, were originally filmed in open-book format with two pages
to a frame. In digitization the frame was split into single-page images to improve visual
access. In a few exceptions, such as in account books, in which loss of content meaning
would result, the frame was not split. Splitting of two-page formats of booklike materials,
which are uniform in presentation, does not compromise the viewer's sense of the original
artifact. This is not the case in individual manuscript letters or memoranda. Individual
manuscript leaves, originally folded to make two to four pages or writing surfaces, have
not been split.

Custom cropping was applied to the varying formats in the Washington Papers, which range
from journals, commonplace books, and account books to individual manuscripts mounted in
bound volumes by conservators. Occasionally, a cropping margin does not exist on film,
and the 1-inch margin cropping rule at the document for the digital image is unattainable.
All available document and text captured on the microfilm appears in this digital collection.

Book or manuscript pages containing text not oriented for reading in the microfilm were
re-oriented for reading as digital images. Pages containing texts oriented in a variety
of directions were left in their original orientation.

Preservation Resources staff used Photoshop's "unsharp mask filter" tool to enhance
ink-to-background contrast in the images of the manuscript volume pages in Series 1b.

Preservation Resources also produced digital images from the National Archives Records
Administration's microfilm of letterbooks 28, 29, and 30 in Series 2 of the George Washington
Papers, employing the same scanning, cropping, splitting, and orientation specifications
described above. Negative format photostatic copies of these letterbooks had originally
been microfilmed with the George Washington Papers. Digital images of these were replaced
with images from National Archives' microfilm of these letterbooks.

Most of the items from Series 9, the Addenda to the George Washington Papers, were scanned on an i2S Digibook scanner in the Information Technology Services Digital Scan Center at the Library of Congress. Oversize materials were scanned by an overhead Phase One camera. The original items were digitized as 300-dpi grayscale images, which were compressed using JPEG compression, producing images in the JPEG File Interchange Format (JFIF). GIF images were also created.

The digital images reflect the original physical condition of the Addenda items. Some of the manuscripts are discolored or have faded ink. Others may have tears, holes, and fold marks. Several documents received conservation treatment before digitization. The Digital Scan Center staff took great care in the handling of the manuscripts

This collection reproduces page images and searchable texts from Donald Jackson and Dorothy Twohig, eds., The Diaries of George Washington, 6 vols. (Charlottesville:
University Press of Virginia, 1976-79), a series of The Papers of George Washington.
Copyright is held by the Rector and Visitors of the University of Virginia and use is by
permission of the publisher. The publisher is not responsible for the correctness and completeness
of the images and texts as they appear in this online collection.

The printed volumes of the Diaries, The Writings of Washington, and Letters, to Washington were digitized by Systems Integration Group
(SIG) of Lanham, Maryland. Each volume was reproduced as facsimile page images. The image
capture took place at the Library of Congress. The master or archival version of the textual pages (containing typography and line art) is a 300-dots-per-inch (dpi) bitonal image in
the TIFF format, with ITU Group IV compression. For the Diaries, pages with printed halftone illustrations,
finely detailed line drawings, and color frontispieces were captured as 8-bit grayscale or 24-bit color images, as appropriate, and stored in the JFIF image format, with JPEG
compression. The browser-display images for all volume pages are in the GIF format. GIFs
were created by National Digital Library Program staff from the master TIFFs and JPEGs. Searchable text for The Diaries of George Washington was created as described below.

Text transcriptions from The Writings of Washington from the Original Manuscript
Sources, 1745-1799 (39 vols.; Washington, D.C.: Government Printing Office, 1931-44)
and Letters to Washington and Accompanying Papers (5 vols.; Boston; New York: Houghton
Mifflin and Company; Cambridge: Riverside Press, 1898), and The Diaries of George Washington
(6 vols.; Charlottesville: University Press of Virginia, 1976-79) were converted at an
accuracy rate of 99.95 and encoded with Standard Generalized Markup Language (SGML) according
to the American Memory DTD. All
text was translated with an OmniMark program to HTML 3.2 for indexing and viewing with
Web browsers.

Linking from text transcriptions in The Writings of Washington and Letters
to Washington to individual manuscript documents in the Washington Papers was accomplished
by the insertion of a unique identifier from the encoded text into the bibliographic database
record for the document images.

Access to the George Washington Papers is through a database created from the printed
Index to the microfilm edition of the George Washington Papers and through searchable
text transcriptions noted above. Every record in the database contains the name of the
author of the document, the associated date, and a link to the set of document images.
In addition, three other fields capture appropriate information: the correspondence recipient's
name, brief explanatory notes and a link to a transcription where available.

The George Washington Papers has been presented online in six releases from 1998 through 2000. The First Release of Series 2 was in February 1998 (about 28,000 images); the Second Release of Series 3 and 5 was in August 1998 (together, about 50,900 images); and the Third Release of the first installment of Series 4 and all of Series 6, 7, and 8 was in February 1999 (together, about 46,000 images). A Fourth Release consisted of an Update of Series 4 in June 1999 (about 86,600 images) and a Fifth Release of Series 1 in November 1999 (about 8400 images), which brought the total number of images online to approximately 219,900. The Sixth Release in September 2000 consisted of an update of Series 4 General Correspondence (about 84,200 images) and release of The Diaries of George Washington (6 vols.; Charlottesville: University Press of Virginia, 1976-79). The six-volume Diaries, which are not part of the Library's Washington Papers proper, consists of 5,818 page images, with searchable text for each volume. The Seventh Release completed the online presentation of the George Washington Papers with the addition of a selection from the Addenda to the George Washington Papers Series, forty-five items, totaling 224 images. The George Washington Papers online consists of approximately 65,000 items, which comprise 304,000 digital images, including both file formats, GIF and JPEG, and approximately 13,000 text transcriptions.