A JISC-funded project at the Institute of Historical Research

Main menu

Post navigation

On Thursday 21 June we are going to hold an afternoon workshop for historians interested in using digital tools for research. It will be held in Senate House, London, and will run from 2pm to about 4.30.

The project team will discuss the work we’ve done on Histore to date. Then there will be talks on semantic markup and text mining, followed by a break-out session for group discussion of different techniques. Attendees are encouraged to bring digital project ideas to discuss during the break-out. There will also be an opportunity to discuss your projects with us one-to-one, if you’d like to.

This workshop is free but places are limited. If you’d like to come to the workshop, or have any questions about it, just drop me an email at jonathan.blaney@sas.ac.uk.

In the rapid analysis phase of the project we looked at what tools are available to historians for digital research in the five areas we are concerned with (visualisation, text mining, linked data, cloud computing and semantic data) and tried to decide which we thought would be most useful to historians as the focus of introductory training courses.

In the end we decided in favour of text mining and semantic data. Visualisation was a strong candidate but we felt that cloud computing was somewhat nebulous as the subject of a training course: if we could do it at all, it seemed to us, it would essentially be training people to use particular tools rather than general techniques – which isn’t what the project has undertaken to do.

As it happens I am currently working on a JISC-funded linked data project, Liparm, which will use linked metadata to create a union catalogue of UK parliamentary material. This will be a good proof of concept for linked data in a historical context and I have already learned a lot about linked data from working on the project. But the point of Histore is to address the lack of take-up of digital resources by historians, and linked data already assumes (doesn’t it?) that those digital resources have been, or are being, created by historians. Linked data looks like a next step, rather than the kind of intial impetus that Histore seeks to provide.

That was our conclusion, but readers might disagree. We’d be keen to hear your thoughts.

As one of the team members for the Historie project I thought I would do a little bit of digging into other attempts (largely from other disciplines) of describing some of the tools that we will be looking at in the simplest way possible. In my first entry I will look at text mining. Jonathan has already provided a brief definition. Text mining is “The derivation of structured, meaningful data from a large body of unstructured data, using automated analytical methods”. But how are other people defining this particular tool?

Carrying out a basic Google search on the question – what is text mining? – the first item that appears on my screen is a short article titled with my search query written by Marti Hearst in 2003. Professor Marti Hearst works in the School of Information at UC Berkeley and makes a living researching various digital tools: search engines, social technology, computational linguistics (including text mining), information visualisation, and usability in websites. Her article ‘What is Text Mining?’ (17 October 2003). The article describes text mining as ‘the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources’.

Hearst emphasises that text mining is an aspect of data mining but differs in that it is attempting to extract from natural language text rather than structured databases of facts. Thus, text mining attempts to dig out new knowledge from free flowing text such as might be found in an article, monograph, or primary source material. Text mining is not a glorified Search – Search Engines look for something that is already known and cannot easily remove the chaff (irrelevant data) from the corn (relevant data)! Hearst also believes that data mining differs from programmes designed for information extraction:

‘I distinguish between what I call “real” text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data’. (Hearst, 2003)

A second, longer article by Hearst entitled ‘Untangling Text Data Mining’ (which appears online and in the Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland [1999]) looks at corpus-based computational linguistics engagement with text data mining and concludes that whilst good at producing better text analysis algorithms fails to search for new facts and trends about the actual world. Whilst now quite an antiquated study on text mining Hearst’s call for a semi-automated system to be devised to enable better text mining results is still under used – at the very least – in the humanities sector.

When drawing up the initial idea for the HISTORE project we broke the deliverables up into three main portions that seemed to make sense from the perspective of both compiling the resources in the first place and then presenting them in a useful way to historians.

These were:

A tool audit (by example of existing projects)

Case studies (one per tool)

Training modules (two tools demonstrated)

These will be made available through the IHR’s History Online and History SPOT platforms which are now our primary location for digital data, listings and online training materials. Much of this material will be produced in-house through our own extensive expertise in these areas; however there are various parts where we have planned (and have budgeted) for external help. The following is a brief breakdown of what we currently see these deliverables as containing.

Tools Audit

The tools audit will form a database of current relevant digital projects for historians using one or more of the tools selected for investigation for the HISTORE project. These will be organised by function, with a faceted browsing interface to allow filtering of tools along multiple dimensions. The tools audit will be made permanently available on History Online with direct links to the case studies and training modules on History SPOT.

Case Studies

A represented tool from each of the main areas relevant to historical research will be included in a series of case studies describing what the tool can be used for, providing examples of actual use, and demonstrating how it can be combined with other tools/software. These case studies will be made available on History SPOT.

Training Modules

The audit will inform the choice of training areas. Two free online modules will be developed to train historians in the basic use of two digital tools. The modules will be multimedia in nature and provide a general understanding and awareness of the tools use. Again, these will be made available on History SPOT.

HISTORE stands for Historians’ Online Research Environments which admittedly doesn’t tell you much on what the project is about even though it does state our planned end result very well. Simply put HISTORE is an attempt to help demystify and identify online tools which are of most value for historical research.

Many historians will have heard of text mining or semantic data (for example) but will not have thought about how these tools could be used in their own research. Indeed many historians will not actually know what these terms really mean or what results they might produce. The technical jargon can be off-putting and as yet there is practically no discipline-based training available either for undergraduates, postgraduates or established academics.

While there is a small group of enthusiasts adopting relevant tools as they have become available, many have confined themselves to searching a few trusted online collections, such as the Oxford Dictionary of National Biography or the Bibliography of British and Irish History. Surprisingly, early-career historians differ little from older colleagues, and even a freely available in-browser bibliographic tool like Zotero evokes only moderate name recognition and few declared users. In particular a question about the use of VREs (Virtual Research Environments) in stakeholder research carried out by the IHR produced very few respondents who said that they had ever used one (see The Impact and Embedding of an Established Resource: British History Online as a Case Study). A common response to questions about specific tools was ‘I don’t know what that is’, followed by expressions of interest when its functionality was explained.

All of this suggests that lack of awareness, not active resistance, is the main barrier to the embedding of research tools in the historical community. The HISTORE project hopes to help demystify several of these tools, raise awareness of what they can do and illustrate that these tools are not the preserve of IT specialists but quite learnable by academic historians.

We thought it might be a good idea, at the beginning of the project, to attempt to define what we’re talking about in the five digital research areas that will be covered: cloud computing, linked data, semantic data, text mining and visualisation.

These are the definitions that the team has come up with, but there is bound to be room for improvement. If you can suggest improvements we’d love to hear about them in the comments.

Cloud computing storage and processing of data entirely on third-party servers, with no local working or copies.

Linked data exposed data which can be read by machines along with other data in the same format.

Semantic data data marked up, however lightly or heavily, in ways which reflect the semantic content of a text, rather than its structure.

Text mining the derivation of meaningful data from a large body of unstructured data, using automated methods to reveal structure and associations.

Visualisation the visual representation of data in an attempt to show otherwise hidden patterns and relationships.

The central aim of Histore is to help encourage historians to make greater use of online tools in their research, by giving them information to help them choose the most effective tools for particular tasks and enabling them to request relevant additions to their own institutions’ Virtual Research Environments.

The IHR has been monitoring the use made of digital technology by historians since 2003. What we have found is that it is generally lack of awareness of research tools and their benefits that impedes their take-up in the profession, rather than outright hostility.

Histore will have two phases. In the first phase we will publish an audit of what digital tools are currently available, listing their main characteristics and assessing their difficulty level for new users. In the second phase we will select two areas which we feel present the most pressing need for research training and produce two free online research modules introducing these topics.