Knowledge

Hyperlinks in your files? How to get them out using tikalinkextract

Related to my work exploring hyperlinks in documentary heritage – something I feel we’ll be taking care of for a long time – I created a hyperlink extract tool called tikalinkextract.

Put simply – the tool will take your collection of files, extract the intellectual content using Apache Tika, and then analyse that content for anything ‘looking like’ a hyperlink. (And recently added, thanks to Andrew Berger on Twitter, ‘mailto:’ links.)

Tika’s list of supported formats grows with each new release, and so building a tool around Tika’s capabilities to extract content from files makes perfect sense. In theory, if the mechanics of tikalinkextract can be perfected over time, then the more formats Tika can access, the more formats we have access to where we can think about preserving links to potentially evidential external records.

Architecture

Apache Tika Server: tikalinkextract connects to Apache through TCP/IP to minimise the amount of embedding that needs to happen inside the code. This is a pattern I have found very attractive recently and will continue to work on.

httpreserve linkscanner: A golang package I created to tokenize strings and look for hyperlinks. This is the engine of the tool – if we continue to improve the hyperlink spotting capabilities of this, it can be deployed in multiple other applications.

httpreserve tikalinkextract: The front-end for the two other main components that sends requests to the server and then aggregates the results from linkscanner. tikalinkextract takes care of walking directories for you.

Output

The default mode of the tool is to output a CSV (comma-separated-values) list that informs users the name of the record where a hyperlink has been found, and the hyperlink itself:

Links are unique to the record, but not to the collection. As such, to better support a potential web-archiving workflow in government, a seed-mode was created. This outputs a unique set of URLs per collection of files.

How to make this work for you?

Wrap your files in a top-level directory and run tikalinkextract (instructions below are for Linux):

There are a lot of options there. I won’t go into them all, (RTM!). For me, the important options to highlight are,

no-directories

directory-prefix=output

Wherever you run this command it will still download all the files associated with the web-archive, despite also wrapping them inside a WARC file. To make sure that they can be easily cleaned-up afterwards, and they don’t pollute whatever directory they are in, I have opted for them to be stored in a directory called ‘output’.

-i

This argument will take a list of links which all get wrapped into the corresponding WARC.

When the command completes, a WARC will exist that you can inspect with tools like DROID and Siegfried to see that contain all the files associated with representations of the websites listed in your seed list.

What to do with the WARC?

Well, it should be no secret that I’m still exploring web archiving. I need to look into tools to look deeper into their structure and what is stored.

My hope is that a government archive can store a record of the hyperlinks associated with a record, somewhere adjacent to the collection, and inform users about how to access this and what it means to them. This work, however, runs a little deeper than that.

Demonstrate the value of a government permalink service by building an evidence base that shows the number of hyperlinks being used in records and the number of those links being lost. Collecting these links at point-of-transfer may (will) already be too late.

Preserve links to external records: Any hyperlink that has somehow informed a decision made in a public office.

Preserve links to content-management-system records: As more content management systems (CMS) become web-enabled, or web-based, the technology used in web-archiving becomes more relevant to records and information managers. If a CMS link is used in a record, how do we maintain that connection? What happens when the link becomes a 404?

Comments on this blog on any-or-all aspects would be appreciated. Knowing what else is out there is good. Knowing how this work can be improved, also.

Further Reading

Developing this blog, I came across a number of useful bits and pieces.

Comments

Share this page

Latest news

OPF member, the British Library, currently has a vacancy in their digital preservation team for a Preservation Repository Manager. Salary: £35,750 to £42,000 per annum Location: Boston Spa, Yorkshire Job type: Permanent This is an exciting opportunity to join the British Library as Preservation Repository Manager and help ensure their digital collections are properly preserved for future […]

Upcoming Event

This webinar will provide an overview of Emulation as a Service and will cover the following topics: Introducing the EaaS desktop application, setup and configuration Importing existing disk images, including forensic formats such as EWF/EnCase Working with objects Publication options: from the desktop into the Cloud The future of EaaS – the OpenSLX GmbH will […]