The University of Minnesota Archives manages the web archiving program for the Twin Cities campus. We use Archive-It to capture the bulk of our online content, but as we have discovered, managing subsets of our web content and bringing it into our collections has its unique challenges and requires creative approaches. We increasingly face requests to providea permanent, accessible home for files that would otherwise be difficult to locate in a large archived website. Some content, like newsletters, is created in HTML and is not well-suited for upload into the institutional repository (IR) we use to handle most of our digital content. Our success in managing web content that is created for the web (as opposed to uploaded and linked PDF files, for example) has been mixed.

In 2016, a department informed us that one of their web domains was going to be cleared of its current content and redirected. Since that website contained six years of University Relations press releases, available solely in HTML format, we were pretty keen on retrieving that content before it disappeared from the live web.

The department also wanted these releases saved, so they downloaded the contents of the website for us, converted each release into a PDF, and emailed them to us before that content was removed. Although we did have crawls of the press releases through Archive-It, we intended to use our institutional repository, the University Digital Conservancy(UDC), to preserve and provide access to the PDF files derived from the website.

So, when faced with the 2,920 files included in the transfer, labeled in no particularly helpful way, in non-chronological order, and with extraneous files included, I rolled up my sleeves and got to work. With the application of some helpful programs and a little more spreadsheet data entry than I would like to admit to, I ended up with some 2,000 articles renamed in chronological order. I grouped and combined the files by year, which was in keeping with the way we have previously provided access to press releases available in the UDC.

All that was left was to OCR and upload, right?

And everything screeched to a halt. Because of the way the files had been downloaded and converted, every page of every file contained renderable text from the original stylesheet hidden within an additional layer that prevented OCR’ing with our available tools, and we were unable to invest more time to find an acceptable solution.

Attempted extracted text

Thus, these news releases now sit in the UDC, six 1000 page documents that cannot be full-text searched but are, mercifully, in chronological order. The irony of having our born-digital materials return to the same limitations that plagued our analogue press releases, prior to the adoption of the UDC, has not been lost on us.

But this failure shines a light on the sometimes murky boundaries between archiving the web and managing web content in our archive. I have a website sitting on my desk, burned to a CD. The site is gone from the live web, and Archive-It never crawled it. We have a complete download of an intranet site sitting on our network drive–again, Archive-It never crawled that site. We handle increasing amounts of web content that never made it into Archive-It. But, using our IR to handle these documents is imperfect, too, and can require significant hands-on work when the content has to be stripped out of its original context (the website), and manipulated to meet the preservation requirements of that IR (file format, OCR).

Cross-pollination between our IR and our web archive is inevitable when they are both capturing the born-digital content of the University of Minnesota. Assisting departments with archiving their websites and web-based materials usually involves using a combination of the two, but raises questions of scalability. But even in our failure to bring those press releases all the way to the finish line, we were able to get pretty close using the tools we had available to us and were able to make the files available, and frankly, that’s an almost-success I can live with.

And, while we were running around with those press releases, another department posted a web-based multimedia annual report only to ask later whether it could be uploaded to the IR, with their previous annual reports. Onward!

____

Valerie Collins is a Digital Repositories & Records Archivist at the University of Minnesota Archives.