Musings on a Multimodal Analysis of Scissors-and-Paste Journalism (Part 1)

The concept of scissors-and-paste journalism is not a new. Indeed, the practice of obtaining, selecting and faithfully reproducing news content (without attributing its original author) dates to before the advent of what we would now call the newspaper and into the years of the handwritten newsletter. That historians have not, until very recently, explored the specific nature and nuance of these reprinting practices is simple pragmatism. Whether they are attempting to uncover the dissemination pathway of single article, or to understand the exchange practices of a particular newspaper title, the task is a daunting one.

In order to achieve one-hundred-percent confidence in any given dissemination map, a historian would need to have read every newspaper ever printed, along with the personal papers of every newspaper editor, compiler and printer that has ever lived, and, for good measure, they would need to develop robust methods for examining the conversations that had taken place in every coffee house, tavern and postal exchange across the breadth of the world and throughout the entirety of time.

This, of course, is probably too much to ask of any historian, even a very diligent one.

Although we may never achieve one-hundred-percent confidence, recent developments have made achieving a reasonable degree of certainty more likely. The ongoing digitisation of historical newspapers has made it possible to obtain regular access to a larger percentage of possible reprints. Of even greater assistance are the efforts of certain digitisation projects to provide users with direct access to the machine-readable transcriptions of these digitised images. These texts, obtained through optical-character recognition software (such as that employed by Chronicling America) or manual transcription and OCR-correction (such as that championed by Trove) lay hidden behind most searchable databases, but only a select few providers have thus far made them accessible, indeed mine-able, by the general public. Yet, even when they remain hidden, and even if their quality remains highly variable, their existence has revolutionised research into dissemination pathways and provided the intrepid reprint hunter with two novel modes of inquiry.

The first is for the historian to select a set of articles for which there is good reason to believe a reprint exists, or for which he or she has already identified a number of reprints in the past. Having identified an appropriate text, the historian can then search for a selection of keyword phrases, or nGrams, in the relevant newspaper databases in order to obtain a reasonable number of hits.

This method of reprint analysis is hampered by many limitations. At best, the historian has a limited idea of where reprints, or indeed the original version, of any given text may appear. The proven commercial viability of newspaper digitisation, for genealogical and historical research, as well as the efforts of public or part-public projects, has led to an ever-growing number of online repositories. For any dissemination map to be considered robust, each of these must be searched with a consistent list of keyword strings, representing different portions of the article.

More importantly, mechanical limitations, such as variances in search interfaces or the quality of machine-readable transcriptions, often obscure the true reach of a given text. Even if a legitimate version of article does exist within the database, these variances means that there is no guarantee the researcher will ever find it without manually examining each individual page.

Finally, even supposing the historian is able to identify all versions of a given text within all current newspaper databases, this still represents only a tiny percentage of all possible prints. As with any preservation project, the costs associated with digitisation have led to the subjective selection of popular, representative or historically important titles from an already reduced catalogue of surviving hard-copy newspapers. Likewise, even if a newspaper has been selected for preservation, multiple editions and non-surviving issues mean that true certainty will always remain elusive, even with manual examination.

Another method for determining reprints is to retrieve machine-readable transcriptions en masse and analyse them for duplicated phrases or word groupings, a method currently employed by the Viral Texts project. This methodology has significant advantages over manual search-and-inspection research. First, the historian no longer needs to make an initial identification of an article for which there are likely reprints; instead, all articles can be compared with all other articles, highlighting new and perhaps wholly unexpected ‘viral texts’. Second, by using a computer processor, rather than the eyes and mind of a single historian, the time spent in research is vastly reduced, perhaps transforming a lifetime of work into a few dozen hours.

There are, of course, also disadvantages. Although seemingly more efficient than using a database’s propitiatory search interface, this method requires full access to the raw OCR data, something provided by only a minority of databases. It also required a highly specialised procedure for cleaning that data to a level at which no reprints will be excluded—a procedure, moreover, which must be refined to accommodate a range of dialects, typefaces and discourses. Finally, and worryingly, complete reliance on computer matching means that significant OCR errors, those that cannot be overcome through pre-designed replacement protocols, will forever obscure some reprints.

The irretrievability of a certain percentage of reprints is not, of course, a primary concern of the Viral Text project, whose aim is to examine which ‘qualities—both textual and thematic—helped particular news stories, short fiction, and poetry “go viral” in nineteenth-century newspapers and magazines.’ For others, such as myself, who are primarily concerned with the path these texts took, and the practical mechanisms associated with their transmission, we are seemingly left with the unsatisfying conclusion that no true map of the dissemination networks can ever be devised.

Yet, all hope is not lost.

In my next two posts, I will set forth what I believe can be the foundation of robust, high-confidence dissemination pathway mapping: a multimodal research methodology—combining the advantages of manual and digital analysis—and the development of specific digital tools for determining directionality in historical newspaper reprints.