You are here

Exploring Historical Sources: abstracts of presentations

To what extent can corpora illuminate the past? Can the close readings traditionally associated with the study of history gain from the tools and methods of corpus analysis? In this talk I will talk about work I have undertaken with historian Dr. Helen Baker looking at an issue in social history - prostitution in seventeenth century England. Social history in particular represents an interesting topic where the corpus might contribute - while the documentary sources and analyses associated with major historical events and figures are typically many and well analysed, the documents associated with the everyday, the unexceptional, are more sparse. In the case of marginalised or criminalised groups the documentary evidence outside of court proceedings is widely scattered and typically indirect. Prostitutes (we deal only with female prostitutes in this talk) are a good example of such a marginalised group - indeed in such a case the marginalization is enhanced by class and gender as well as criminality. I begin by considering what social historians have claimed about prostitution in this period. I then move to look at what the corpus shows us, using the latest version of the EEBO corpus available at Lancaster University - 1.5 billion words of lemmatised, POS tagged and spelling regularised written texts from the 15th, 16th, 17th and 18th centuries. Using corpus techniques to explore the texts, I show what a corpus may show the historian about prostitution in the period - and what historians can offer to corpus linguists who are approaching texts from this period.

This paper makes innovative use of critical theory to explore the media representation of the suffrage movement. The British women's suffrage movement was a complex, diverse campaign that emerged in the mid-nineteenth century. The suffrage movement was not a unified one; it was composed of various groups with differing backgrounds, ideologies and aims. Historians working with suffragist-produced texts have noted different terminology used to describe different factions of the movement. Less attention has been paid to how the suffrage movement was perceived by those outside the movement, and particularly how it was represented in the press.

Focusing on The Times, I examine how suffrage campaigners' differing ideologies were conflated in the newspaper, particularly in connection with their support of or opposition to militant direct action. I use a range of methodological approaches drawn from corpus linguistics and discourse analysis including collocational analysis, examination of consistent significant collocates, critical discourse analysis and Deleuze and Guattari's (1987) argument that polyvocal, heterogeneous entities are simplified and erased by those in power. Historians' research indicates that two terms were used to describe suffrage campaigners, each with different profiles. The term 'suffragist' tended to be used to describe constitutionalists who campaigned by lobbying Parliament. The term 'suffragette' was originally a pejorative and was used to describe campaigners who, variously, saw the vote as an end unto itself, were members of a militant organisation and/or were prepared to engage in
direct action (Holton 1986).

I identified and examined direct action termsin The Times, obtained through calculating Mutual Information with a score of 3 or higherthey are used to describe different activities, encounters with different groups of people and at different points in the escalating direct action campaign. I found that these terms were associated with suffragist rather than suffragette; this runs counter to how suffrage campaigners tended to self-identify and how they are described in the historical research. I explored six direct action terms in detail: disturbance*, outrage*, violence, crime*, incident?
and disorder. Through a combination of collocational analysis and detailed understanding of the historical context, I established different semantic profiles for the terms and show how due to principles of newsworthiness (Galtung and Ruge 1965), direct action was heavily reported while constitutionalist tactics such as lobbying MPs received less press attention. The effect of this reporting is to first conflate suffrage campaigners into a homogeneous mass and secondly to position all suffrage campaigners as proponents or supporters of militant direct action.

Through this combination of established historical approaches, discourse analysis and corpus linguistic methodologies, this investigation refines our understanding of the suffrage movement in its socio-historical context and offers an insight into how language used by those in power can create and solidify identities imposed on fluid, polyvocal groups,
particularly groups that lack access to their own media representation.

The digital humanities project Welt der Kinder (”Children and their world“) started in May 2014 and is designed to serve as a template for similar projects in the future. By fostering close cooperation between historians, information scientists and computer scientists it aims to gain new insights into the period from 1850 to 1918; a time in which the accelerated production of knowledge was dominated by both globalization and nationalisation simultaneously. The project is a cooperative venture between two institutions of the Leibniz Association – the Georg Eckert Institute for International Textbook Research in Braunschweig (GEI) and the German Institute for International Educational Research (DIPF) located in Frankfurt and Berlin –, four universities (Darmstadt, Hildesheim, Göttingen, and Zurich), and the Bavarian State Library (BSB) in Munich.

The research will provide access to German-language mass sources that originated between 1850 and 1918. This material reflected contemporary world interpretation patterns and elements of cultural memory yet, equally, helped form the same. Due to their sheer size, however, such sources cannot be penetrated by hermeneutic methods alone. In interdisciplinary and exploratory research, tools are therefore being developed that facilitate the analysis of large (digital) corpora. Diachronic and synchronic analyses of the (trans)formation of knowledge inventories are employed, as are meta analyses, to consolidate the potential provided by insights gained through digital methods. The development process implements user-centred methods in order to support the research goals of historians. The tools developed help to discover semantic structures and patterns in a variety of nineteenth century educational media. This enables us, as historians, to take an entirely new approach to the analysis of digitised material, which had previously been limited to full-text searches.

As its basic research resource, the project uses a continuously expanding corpus of several thousand books, which have been scanned and digitised using OCR technology capable of reading Gothic typefaces. The collection comprises more than 600,000 pages so far. Welt der Kinder has four main goals:

1. Historical research about representations and interpretations of the world in the given period, during which, knowledge about the world could generally not be gathered through one’s own experiences or travels, nor through audiovisual media. That is why textbooks and other printed media formed the main information source for young adults other than oral traditions, which cannot be used as scientific sources.
2. Exploration of a specific media type (textbooks and juvenile literature) that subsequently influenced millions of citizens, but which has never been investigated in an integrated approach considering media type, circulation, and transformation of collective knowledge.
3. Combining established hermeneutic methodology with innovative methods and technologies for the exploration and scientific annotation of large amounts of textual data.
4. Fundamental research in computational linguistics; developing and adapting methods of semantic analysis and opinion mining for the challenges presented by the language of the nineteenth century and by this specific media type.

We shall present the resources and methodologies applied in the project as well as its progress so far. We will discuss challenges, experiences, and problems encountered in the early stages of the project.

In my presentation I will describe A.L.C.I.D.E. (Analysis of Language and Content In a Digital Environment), a system developed by the Digital Humanities group at Fondazione Bruno Kessler (Trento, Italy). A.L.C.I.D.E. is a web-based environment that uses human language technologies to help historians to extract, explore and analyze information within historical documents in real time.
Human Language Technologies have evolved to a point where they can provide Humanities with some relevant technologies such as: named entities recognition (e.g. identification of names of persons and locations within texts); extraction of semantic relations between entities (e.g. motion relations between persons and locations); temporal processing (i.e. identification of temporal expressions and events and extraction of relations between them); geographical information processing, key-concept extraction; distributional semantics analysis (i.e. quantification and categorization of semantic simi-
larities between linguistic elements), sentiment analysis (i.e. determine the attitude of a writer with respect to some topic, identify the general polarity positive, negative, neutral of a text or of a statement). One of the goals of A.L.C.I.D.E. is to exploit these technologies by adapting them to the historical domain, and to make the results of linguistic and semantic analysis more interesting and usable by Humanistic researchers through effective visualization techniques. As case study, we focused on the electoral debate between Nixon and Kennedy in the U.S. elections of 1960 (about 850 documents and 1.5 millions words). Our web platform allows researchers to browse and analyze the content of a document (or collection of documents), easily extract the needed information, and visualize it in a convenient interface, with the support of charts, dynamic graphs, maps, clouds of key-concepts and timelines.

This platform is the first version of a more complex online environment we plan to develop, in which any added document will be automatically analyzed not only at the word, but also at the semantic level. In the perspective of extending the functionality of A.L.C.I.D.E. with semantic features we are now exploring how to compare, starting from their documents, the opinions of different authors to automatically evaluate whether they agree or not on a topic (e.g. if the political stances of Nixon and Kennedy are similar or not on some topics) Another kind of analysis we are developing is the evaluation of the level of innovation of the ideas introduced by an historical document, as well as his impact on the ideas in the following years. The goal pursued by A.L.C.I.D.E. is to offer a powerful, yet, easy to use, set of functions that can support historians in their research and help them to deal with the increasing amount of digital data available.

This talk will give an account of some of the challenges and successes associated with the Cendari Project’s approach to leveraging Natural Language Processing and other Software Infrastructure to research in Historical Archives. Cendari takes a particular focus on collaboration between Information Experts, Technological Experts and Domain Experts. These different constituencies of users have significantly different perception of the available features, the potential value and perhaps even the merits of adopting NLP in Humanities Scholarship.

Cendari is developing a set of flexible infrastructural services designed to support historical inquiry. This includes, from a natural language processing perspective, tasks such as information extraction for entity recognition, entity-driven search and annotation and sharing of research results.

The project has manifested as different tools which are applied to user environments such as a Virtual Note-taking tool, and Archival Research Guides as well as support services for identity, access and provenance.

This talk will discuss results in four areas:
• The results of user trials and user-centric querying of humanists
• The results of data and archival interactions and the creation of an archival guide
• The results of nlp work to assist humanists
• The creation of infrastructural architecture to leverage services and test cases in different historical domains, particularly in the overall EU context

Of key importance in Cendari have been:
1) The End Users’ needs and requirements.
* for non-technical users, they are often not aware of what is possible
* for technologists, they are often not aware of what is useful
2) The available data
* Archive descriptions
* Full Text
* Metadata at different levels
3) The limitations and requirements of specific technologies
* Natural Language and dialect
* Assumptions about time or format (eg calendar)
* Normalisation, consistency and variation over time

The project has reached approximately the half-way mark, and in addition to discussing these concrete outcomes, it would be our objective to present opportunities for collaborative investigation, mutual application of technology and other knowledge sharing opportunities across the Nedimah network, and with the CLARIN project in particular.

The talk will include some details of specific results an findings, as well as overall results of how the process of joining different research communities has evolved.

The historiography of European integration can be divided in three phases. In the seventies, historians from a broad spectrum of backgrounds started to analyse the origins of European cooperation. Subsequently, from the second half of the eighties, this fragmented field began not only to integrate but also to interact across disciplinary boundaries with international relations theory, judicial history and political philosophy. In the first half of the nineties, the international relations perspective gained dominance framing Europe as a political entity theoretically situated somewhere on a scale between a federation of states and a federal state.

The third phase took off with a devastating critique of the dominance of the state as actor in the history and theory of international relations. In his Social Theory of International Politics, Alexander Wendt develops a theory of the international system as a social construction. Following in his footsteps, a whole series of so-called constructivist studies focussed on the role of non-state actors, civil society and public opinion in international relations. Within the context of European integration historiography, constructivism got a firm boost from the 2005 rejection of the Treaty establishing a Constitution for Europe. Signed in October 2004 and quickly ratified by some of the member states, the ratifying process stranded only a summer later on the rejection of the treaty by French and Dutch voters. Especially the high turnout in both referenda sparked academic interest in the public image of and popular support for European integration.
The mass digitization of books, newspapers and other historical materials and the introduction of new digital techniques promise new possibilities here. In a recent article, Hans-Jörg Tretz has named the media as an important but mostly overlooked player in integration history. We presume that most important networks of professionals (politicians, journalists, scientists and others) will be reflected in the media coverage of European integration. To test this hypothesis, we look at articles on the first steps in the integration process in the late forties and early fifties in the Netherlands. As one of the founding members of the EU, the Netherlands prove an interesting case because of its diverse media landscape, in which different socio-political communities had their own ideological media outlet that coexisted with more neutral, general newspapers.

In the field of European Union studies, computational techniques to map out public discourse still are in their most early stage. A recent book on the role of national self-images in the perception of European integration in England, Germany and the Netherlands by Sven Leif Ragnar de Roode uses a wide selection of more than a thousand editorials, but selection and analysis have been done merely by hand. Other studies that convincingly prove the investigatory value of newspapers also use traditional techniques to select and analyse the source material. By using the large repository of the Dutch Royal Library to extract social networks and their main topics of conversation, we will develop a method to further the history of European integration in a digital fashion.

Using text encoding processes, the aim of the project is to explore text based methods and analysis to enrich digital data for exploration of the linguistic differences in the Armaments’ policies of the Western European Union (WEU). The underlying research is driven by the need to explore the nature of the British and French positions on major security and defence questions in the WEU. The selected source material focuses on particular types of institutional documents associated with armament production and standardization, such as meeting minutes and diplomatic notes or reports of the Assembly. The majority of the documents are available in French and English. Thus the corpus, comprising of 70-80 documents is bilingual.
The objective is to take selected research material from the Archives Nationales de Luxembourg, WEU collection and use a process of data capture, enrichment and analysis to form new insight on the WEU corpus. Firstly, optical imaging software was used to capture in digital form the select source material. Then the digital archive documents were transformed using Optical Character Recognition (OCR) methods (ABBY FineReader 11). Following the initial creation of text, the documents undergo manual post-processing, ensuring accuracy and validity of content and structure. The workflow of the project is outlined in Figure 1.

The next step involved enrichment of the documents that began with the creation of template styles for the corpus, and subsequently converted into XML-TEI P5 using the OxGarage conversion service With the initial XML-TEI file, the process of enrichment using Oxygen began. Enrichment included the addition of metadata based on TEI Header Specification and Dublin Core standards and encoding attributes based on list of predefined needs which were provided by CVCE historians and centred on investigating the individual/ institutional discourse.

Following the production of the enriched XML-TEI, the final stage of the process involved corpus analysis using Textométrie TXM software to discern specific linguistic patterns for the different country representatives. The project provides proof of concept that such processes, techniques and outputs are useful and viable addition to enrich collections and historical research developed as part of the research infrastructure of the CVCE (Centre Virtuel de la Connaissance sur l’Europe). Furthermore the results of this project will also inform the wider goals of the CVCE to provide an enriched dataset that can be further investigated with name entity recognition and social network analysis to investigate further insights.

The talk will present our results on modernising historical (Slovene) words, which enable better full text search in digital libraries and making old texts better understandable to today’s speakers. Modernisation of word tokens also allows for applying annotation tools, such as PoS taggers and lemmatizers trained to contemporary texts, to be used on historical texts. We present the resources used in this research (available at http://nl.ijs.si/imp/) consisting of a hand-annotated corpus, a lexicon of historical word-forms, and a large collection (digital library) of historical books. The resources are available for reading on the Web, with the texts also mounted on a powerful web-based concordancer; the results are also available for download under the CC-BY licence.
We then concentrate on the results obtained in our research on modernisation. We present the initial method which used hand-written rules with a finite-state library for their application. Recent research concentrated on using character-based statistical machine translation (CSMT). CSTM has recently become a widely used method for word normalisation but, as our results show, the exact set-up of the system is very much dependent of the type of text being analysed and on background resources, in particular the lexicon of contemporary words, which can be used as a filter for the CSMT-generated hypotheses. We present the training and testing data stratified into several historical periods and two sets of experiments, in a supervised and in an unsupervised setting. We discuss the effect of using various CSMT settings and present the implementation of the system in the open source Moses toolkit. As Moses, our models are also freely available.
Finally, we discuss the directions of further research. One problem with our current approach is the not infrequent mismatch between tokenisation on the historical and contemporary words, i.e. what used to be written as several words is now written as one, and vice-versa. The second problem is that our modernisation takes place immediately after tokenisation and is deterministic. However, how a particular word will be modernised often depends on its PoS, i.e. on its context. More powerful models are needed to take case of these cases, and the talk will present our initial ideas in this direction.

The British Telecom Correspondence Corpus (BTCC) is a historical letter corpus which was constructed at Coventry University over a two year period from March 2012. The corpus contains a wide variety of business letters from the public archives of British Telecom, the world’s oldest communications company. The letters all relate to the area of telecommunications, and date from the mid-nineteenth century through to the latter part of the twentieth century. This is a crucial era in the development of business correspondence but is so far under-represented in available corpora. As British Telecom was part of the British government until 1984, all of the pre-privatisation material contained within the BT Archives is public record and BT is obliged to promote access to it. The BTCC offers a new way to explore this material and gain insights into the development of business correspondence over this period.

Contextual information is crucial for interpreting letter exchanges, and so a lot of time has been dedicated to manually extracting detailed metadata from the letters and encoding it in the corpus. The metadata has been encoded using the Text Encoding Initiative (TEI) standard for Correspondence. This standard was used so that the BTCC would be compatible with other historical correspondence corpora and (as the information is encoded using just plain text) accessible to current and future researchers in the area of historical business correspondence. Working with the TEI’s Correspondence Special Interest Group I have devised a schema that makes it possible to filter the letters by a variety of contextual and linguistic categories depending on the individual researcher’s aims (e.g. looking at particular decades, authors or companies).

One of the big challenges in approaching the analysis of the BT corpus was how to make meaningful linguistic comparisons across so many different years, authors and subject matters. To address this, the letters were categorised and marked up by pragmatic function. The function categories (1. Application, 2. Commissive, 3. Complaint, 4. Declination, 5. Directive, 6. Informative 7. Notification, 8. Offer, 9. Query, 10. Thanking) were derived from a close examination of the letters and refined through four rounds of inter-rater reliability testing at Coventry University. These functions have been included in the TEI mark-up making it possible to extract instances of the functions in the corpus and trace their development diachronically.
In this workshop I will present some general findings from the corpus as well as preliminary results from the functional analysis.

This presentation focuses on the matter of text type classification for the Deutsches Textarchiv corpora. It will give insights about efforts and results of classifying DTA texts according to text types and the existing facilities for text type based research on the DTA corpus. Furthermore, current work on a new text type classification for the DTA will be presented.

The goal of the project Deutsches Textarchiv (DTA, http://www.deutschestextarchiv.de) is to create the basis for a reference corpus for the development of the New High German language (~1600–1900). The evaluation of to what extent the DTA core corpus is balanced with regard to the language material available at different eras in time is substantially based on the distribution of text types within the corpus. Such considerations essentially determined the original bibliography of the DTA core corpus and are still taken into account when evaluating possible extensions to the DTA corpus. In addition, the distinction of different text types represented in the corpus plays an important role for the analysis of the corpus. For instance, it is possible to examine the development of certain concepts or phenomena of language change with regard to text types and to compare the results to the reference corpus.

Thus, text types were assigned to all DTA corpus texts. The underlying text type classification originating from the DWDS project was gradually extended in a data-driven manner according to the requirements of the DTA corpus texts. The resulting classification consists of two levels: three main categories (fiction, scientific texts, functional literature) and their corresponding sub-categories.

Thanks to this classification it is possible to group DTA corpus texts according to text types and thus facilitate targeted browsing within the corpus. More importantly, though, the DTA search engine DDC supports complex linguistic queries limited to certain text types.

Even though the text type classification of the DTA is in use effectively, it is still limited to some extent. Most importantly, due to the data-driven approach of developing it, it prohibits insights about historically relevant text types which are not represented within the DTA. Therefore, it is difficult and error-prone (with regard to completeness) to identify important text types missing within the DTA corpus and hence carry out targeted extensions to it.

Therefore, a new DTA text type classification is being created based on the existing classification plus different other well-known and established text type classifications. 2 The latter allow for extensive additions to the existing set of historical text types. This set can then be used as an empirical basis for evaluating, if or to what extent extensions to the reference corpus are necessary, while being fairly definite in itself. In addition, it is planned that the new DTA text type classification should map with other widespread classifications, this way allowing to reuse text type categorizations already applied to certain texts in other contexts and vice versa.

The electronic corpus of the 17th and the 18th century Polish texts (up to 1772) is being created in the Institute of Polish Language, Polish Academy of Sciences, in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is meant to constitute a part of the National Corpus of Polish.

The antique books are now being transcribed and annotated with structural and textual tags. The rules of transliteration and annotation of baroque texts ensure the achievement of two basic objectives: the faithful mapping of the notation and structure of ancient texts as well as providing the user with convenient methods of finding information in the corpus. Here are some possibilities that the search engine is supposed to provide:
1. Finding phrases with a graphical form other than contemporary.
The user will be able to write the searched expression in a standard contemporary form and the search engine will also find the phrases which could occur in a modified form in the baroque texts. This applies to the words with dashed vowels (e.g. czás = czas ‘time’), abbreviations (e.g. atramẽt = atrament ‘ink’), the words written incorrectly that have been deciphered by the copier (e.g. dłngi = długi ‘long’).
2. Eliminating undesired results.
The user will be able to exclude fragments of a text written in a foreign language from searching, so that the search engine will not show foreign words having form identical with Polish words (e.g. lat. mali ‘bad’ vs. pol. mali ‘small’). There will be also a possibility to exclude the text in the running head, so that regularly recurring fragments will not distort the statistics.
3. Obtaining information about the context of the searched expression.
Information about the fragments that were omitted during copying with a specification of the omissions (e.g. [ILLUSTRATION]) will appear as a context of the searched expression. The expression found in a footnote or marginal note will be presented together with the sentence or paragraph to which it refers.
4. Obtaining information about the exact location of the searched expression in the text.
By linking each word with an appropriate page identifier the search engine will provide the users with the ability to accurately indicate the location of the searched expression in the text, which will facilitate the use of quotations from the corpus in scientific works and dictionaries.

In this digital cultural heritage project, we provide innovative access to heritage objects from heterogeneous online collections. We use historical events and event narratives as a context both for searching and browsing as well as for the presentation of individual and group of objects. Semantics from existing collection vocabularies and linked data vocabularies are used to link objects and the events, people, locations and concepts that are depicted or associated with those objects. An innovative interface allows for browsing this network of data in an intuitive fashion. The main focus in DIVE is to provide support to (1) digital humanities scholars and (2) general audience in their online explorations.

Research has shown that many users seek more exploratory forms of browsing. Within the project, content from the two cultural heritage institutions are enriched, linked and made available:
1. The Netherlands Institute for Sound and Vision 1 archives Dutch broadcasting content, including television and radio content. Within the project, a subset of the NISV collection of news broadcasts was made available using the OAI-PMH protocol (http://www.beeldengeluid.nl/).
2. The Dutch National Library provides access to historical newspapers. These have been made public through a Web interface and API, Delpher 2 (http://www.delpher.nl/).

The textual descriptions and descriptive metadata are enriched so that structured metadata in the form of events, places, persons etcetera are linked to the cultural heritage objects. For this, we employ an ensemble of enrichment methods. These include Natural Language Processing (NLP). Crowdsourcing techniques are also employed to have human-recognized entities and to refine the results from NLP.

The results from different tools and crowdsourcing are combined to come to high-quality extracted data. These are then consolidated as Linked Data using the Simple Event Model (SEM). This model allows for the representation of events, actors, locations and temporal descriptions. Links to external sources, including Wikipedia, DBPedia and Europeana are also established. The data is accessed through a SPARQL interface, on top of which an innovative and intuitive event-centric browsing interface is developed . Figure 1 shows the current version of the interface, which is optimized for tablets and modern web-browsers. The interface allows for browsing the linked data graph using visual representations of cultural heritage objects, persons places etc. When a user is inspecting a cultural heritage object, other objects that are related through events related to the first object are also shown. In coming phases, data from the Dutch National Library will also be ingested, enriched and linked. Target groups for evaluating the interface are Digital Humanities scholars,
professional users and members of the general public.

This paper reports on recent work that demonstrates that combining corpus linguistics and geographical information systems (GIS) can help further our understanding of nineteenth century history. It presents an overview of work currently undertaken by the ERC-funded Spatial Humanities: texts, GIS, places project, whose aim is to develop and apply methods for dealing with textual data within a GIS environment.

One example of our work, the Lake District project, analyses the historical cultural landscape of the Lake District by exploring a corpus of 80 texts (guidebooks, tourist notes and letters totalling >1,500,000 words). This research has demonstrated that accounts can be surprisingly different with, for example, the places mentioned and visited by Thomas Gray during his 15-day tour of the Lake district in 1769 and those mentioned and visited by Samuel Taylor Coleridge during his 9-day walk through the Lake District in 1802 showing little overlap (Donaldson et al. 2013). Another finding is that the state of local roadway infrastructure is critical to explaining why certain Lake District sites have received more attention than others in historical accounts.

A concurrent investigation of patterns of health, diseases and mortality in England and Wales during the nineteenth and early twentieth century has used GIS to explore corpora based on databases of historical documents such as Histpop and Bopcris. In our corpus of the Reports of the General Register Office for England and Wales (1840-1880), we were able to successfully identify places mentioned alongside keywords including ‘cholera’, ‘diarrhoea’ and ‘dysentery’. Mapping these places revealed a bias towards major urban areas and ports such as London, Liverpool and Newcastle, but also a temporal division at 1866. Before 1866, the places most mentioned in relation to these diseases revolved around London, whereas Newcastle is the location which stands out after 1866 (Murrieta-Flores et al. in press); we ascribe this shift to the effect of a major investigation carried out in that year into the causes and transmission of cholera.

Our most recent work analyses discourses surrounding places in the British Library’s digital collection of Victorian periodicals. The pilot stage of this project has looked at a single newspaper, The Era, from 1838-1900 (>3,000 issues, >377,000,000 words), to identify places mentioned and patterns in their representation. The Era includes a surprisingly broad coverage of cities in the UK, although these tend to appear in the context of news related to sports and the theatrical and musical scene (reflecting the known audience of this periodical) rather than typical domestic news articles (crime, politics). Turning our attention to the international stage in an examination of the concept of ‘war’, we see that the complex web of diplomatic relationships between European nations is repeatedly commented on, but that, interestingly, changes in the level of discussion of ‘war’ in relation to specific countries do not seem to drive changes in the level of attention accorded to these countries overall.

Besides providing a general access to manuscripts, through digital technology it is possible to explore new tools in retrieval information, crucial fact to scholarly editions. Artificial intelligence work in natural language reveals interesting paths, although demands big efforts in domain knowledge representation.

Trying to contribute to this effort, the present paper addresses the structure and automatic population form of the historical kinship ontology, for relation extraction in Portuguese early modern sources, as the early eighteen century printed and scribal news “gazettes” (1729-1742).

Historical kinship regarding the Iberian Peninsula Ancien Régime has a deep common background regarding political and church policies, which were implemented through the Portuguese and Spanish colonial empires. Given the importance of all kind of kinship ties (including spiritual kinship) in political and social dynamics, it´s no surprise that in most of the sources the identification of an individual is made by its own kin relations, adding often other social or political positions. These kinds of occurrences are particularly common in scribal/printed news. Important for different types of social and political studies in Early Modern History, the kinship relation extraction (semi-automatic) will improve significantly the network study approach in these subjects.

In this context, the ontology for historical kinship is extremely necessary. Although kinship is a nowadays basic example for retrieval information, the historical context doesn’t support the simple usual models. Besides the obvious terminology differences, the real obstacle resides in the concepts that represents de early modern kinship. To achieve a conceptual model of this reality, the study is based on analysis of kinship terminology and expressions from different early modern manuscripts as well as printed sources.

From the ongoing work, the 1.526 kinship expressions analysed so far enable the identification of seven types within two categories. The sources examination exposes different shades of implicit and explicit kinship relations formulas. Specific terminology (father, uncle) and other vocabulary associated with kinship events (marriage, birth, baptism) although are a focus, there are other forms that expresses kin ties. Royal family members as well as nobility titles are named according to hierarchy and kin relation to one another (prince = heir of the throne; “infante” = king´s brother; countess = wife of count). Moreover, the ontology has to incorporate all of these terms in bigger and complex concepts, regarding all the legal and social practices that support Portuguese early modern kinship, that´s the focal point. Some of these main concepts (“Casa” (house), family, kinship and law (regal and canon)) are being developed and we intend to display them in the overall ontology structure. At the same time we aim to present the ongoing efforts that enable the automatic ontology population.

In the era of large-scale stylometry – which is about to come – some basic but difficult methodological questions have not been answered yet. Certainly, the most important one is whether a given method, reasonably effective for a collection of, say, 100 texts, can be scaled up to assess thousands of texts without any significant side-effects. When one deals with historical corpora, however, this question becomes much more complex, since several additional factors have to be taken into consideration. Spelling variation, insufficiently trained NLP models, corpora a priori unbalanced – these are the obvious issues. However, one should also take into account less obvious yet equally important factors that make any stylometric investigations non-trivial, to say the least. To name but a few, these include editorial corrections, punctuation introduced by modern scholars, hidden plagiarism and/or text re-use problems, innumerable authorship issues, and many other sources of potential stylometric “noise”.

The complete collection of Patrologia Latina, recently made available in the form of carefully prepared corpus with morphosyntactic annotation, gives us a great opportunity to test some of the above assumptions and possible drawbacks of the state-of-the-art stylometric methods. The aforementioned collection consists of 5,821 texts by over 700 authors; it covers a time span of about 1,000 years. Even if the texts represent a few genres, the collection is thematically very consistent: for obvious reasons, theology is overwhelms other topics. At the same time, however, the Patrologia Latina is a pre-internet example of a big-yet-dirty text collection, published in the years 1844-1855; the goal was to publish the all the available material in a relatively short period of time, with the assumption that particular volumes would be gradually replaced by carefully prepared critical editions.

Fig. 1 Performance of four methods of classification tested on the Most Frequent Words

Fig. 2 Performance of four methods of classification tested on bi-grams of frequent words including punctuation marks

A number of massive stylometric experiments conducted on this collection partly confirm the aforementioned theoretical assumptions, but at the same time several new issues are revealed. To give an example: performance of supervised methods of classification turned out to be relatively poor (cf. Fig. 1), but alternative style-markers sometimes showed unexpectedly high effectiveness. The best performance was achieved by very frequent word bi-grams, or combinations of two adjacent words including punctuation (Fig. 2).

This and similar results deserve a detailed linguistic (and literary) interpretation. This study is aimed at explaining some of the unexpected results.

The Hebrew Bible is a compact series of books. With its 426555 words it fits easily in your pocket, and with is 6 MB it fits many times in your smart phone. At the same time it is a body of texts shaped by people of different religious communities in varying geo-political circumstances over at least ten centuries of time. It is a complex cultural artefact, the object of intense research in religion, philology, history, and linguistics.

The many smaller research questions fall under the following headings:
1. origins: when were which parts written, and how have they been transmitted through time and space?
2. literary quality: which literary genres occur, and how can they be defined by observable characteristics?
3. linguistic variation: how can we account best for the variation of language use in the Bible as a whole, and how can we even detect it?
4. interpretation: how can we employ the knowledge resulting from 1, 2, 3 in grasping the significance of obscure passages in the text?
5. context and linking: on what basis can we link passages to each other and to known historical entities?

The questions are deep, and maybe we are asking too much from a resource of such a limited size. On the other hand, this situation is a positive challenge to represent the material in such ways that we have ready access to all the explicit and implicit information it contains.

A pioneering step was made in the 1970s by the Eep Talstra Centre for Bible and Computer (formerly known as WIVU), who turned the text into a database and added morphological information as features. Later syntactic information was added as well, and the work is still in progress. The result is no longer a small resource, because millions of feature values have been added. The text database, in SQLite3 format, is now 126 MB.

In order to be able to use the methods of data analysis as they have been developed in recent years, it is important that the biblical text database exists in the open, in an accessible format, with well documented features, ready to be taken up by people coming from computational disciplines.

This has been established by the CLARIN project SHEBANQ, where the text database has been converted into Linguistic Annotation Framework (LAF), archived at DANS (as Open Access). Moreover, a laboratory tool, LAF-Fabric, has been added, by which data analysts can easily extract the data they need for further processing, and a website has been produced where people can publish their own queries and explore other people's queries. These tools are Open Source, and published on Github (ETCBC, DANS).

In our presentation we will show some of the results that are being gathered in and how this resource is a breeding ground where the above research questions can be tackled by a mix of computational philologists and philological computer scientists and many in between.