Harvest time

Related Links

FIVE FEDERAL, educational and private archiving organizationshave partnered to crawl the Web, gathering data from sites in the.gov domain to create an endof- term snapshot for posterity.

The ambitious Web harvest is an effort to preserve millions ofpages of government Web sites that are in danger of disappearing,or at least changing, when a new administration comes into officeJan. 20.

'No matter who wins [the presidential election], we expectthere will be changes in the policies governing Web sites,'said Abbie Grotke, coordinator of digital media projects for theLibrary of Congress' Office of Strategic Initiatives.

Much of what exists online now is liable to disappear regardlessof policy, as sites are updated on a regular basis, sometimesdaily.

LOC has been doing monthly crawls of congressional Web sitessince late 2003. However, 'there is a bit of a gap in who isresponsible for archiving the executive and judicial branches atthe end of the term,' said Kris Carpenter, Web group directorat the Internet Archives.

To fill that gap, LOC, the Internet Archives, the CaliforniaDigital Library and the University of North Texas, with some helpfrom the Government Printing Office, have taken on the job.

'Nobody told us we needed to do it, but we realized therewas nobody else tasked to do this,' Grotke said. 'Weall thought it was important to do.'

'It would be a tragedy if we didn't attempt topreserve this,' Carpenter said.

All of the organizations are members of the InternationalInternet Preservation Consortium and frequently cooperate onsimilar projects.

The project is a big one. The Internet Archives estimates thatit will gather some 125 million pages from around 5,000 sites.Estimates of the total volume of data to be collected range from 10to 20 terabytes.

Each organization is contributing according to its expertise.LOC will focus on development of its archives for the project; theInternet Archives began a comprehensive baseline crawl of the .govdomain in August and will do a second crawl before inaugurationday; the University of North Texas and the California DigitalLibrary will focus on prioritizing sites that need more frequentattention and doing more indepth crawls; and the GPO federaldepository library program is offering advice on curating thecollection.

Now you see it ...

The online world presents a paradox. We often are warned thatonce something appears online it never really disappears and thatincautious statements can come back to haunt us years after theyare made. But at the same time, online data is ephemeral,constantly changing and moving even if we cannot be sure that ithas ever been expunged. This makes finding and documenting it forfuture reference difficult.

Because they are dynamic, 'all Web sites are atrisk,' Carpenter said.

The Internet Archives was established in 1996 to make digitalmaterial permanently available. The collection is not intended tobe comprehensive, Carpenter said. 'It doesn't includeeverything. We're really focused on digital heritage,'that is, how society manifests itself online.

The collection now consists of a petabyte of compressed data andis expected to expand by multiple petabytes a year. To access thedata, the Internet Archives has developed the Wayback Machine, anopen-source online tool used by entering the URL and date of anarchived site. Once in a site, hyperlinks can be followed.

The University of North Texas has set a similar, if more limited,mission.

'We've been involved in capturing public governmentWeb sites since 1997, when we started the CyberCemetery,'said Cathy Hartman, assistant dean of digital and informationtechnologies. 'We are a federal depository library, and welooked at this as part of the role that the department should befilling.'

The CyberCemetery originally archived Web sites from agenciesthat were shutting down after hitting the end of life, or at leastthe end of their funding. The project began by going out andlooking for these sites, and as word of its mission spread,agencies began contacting the cemetery to contribute sites forpreservation. 'Now we are harvesting agencies that are notdying,' Hartman said.

The CyberCemetery's collection is not officiallysanctioned, but the university and GPO have developed guidelinesspecifically for collecting and preserving digital material.

The online environment has evolved rapidly, Hartman said.'When Clinton came to office [in 1992], there was almostnothing available on government Web sites.' Today, manygovernment services are offered online and millions of pages ofofficial information are maintained online.

The university did a fairly extensive harvest of .gov sites fouryears ago, but this year's project is more extensive. One ofthe first challenges it faced was coming up with a list of Websites to crawl and harvest. Compiling that list is an ongoing task,but once the duplicates and inappropriate sites are removed, itprobably will be about 4,200 sites, said Mark Phillips, head of theUniversity of North Texas' digital projects unit.

'One of the problems we had is that there are multipleorganizations involved, and each of us had a different list of the.gov domain,' Phillips said. 'But nobody had acomprehensive list.'

The California Digital Library, for instance, had a broad listthat included many state government sites that are outside thescope of this project. So the university developed the URLNomination Tool (UNT) to help create an authoritative list. It is aWeb-based tool developed with the Django open-source Web frameworkwith a MySQL opensource database on the back end.

'It's a way to add a little bit of metadata'about the URLs and Web sites being considered, Phillips said.'The biggest challenge in building the tool was the datamodel.' Creating a user interface was easy, but it took timeto come up with a common data model that would accommodate thevarious types of information included in URL lists from differentsources.

The UNT was created specifically for the end-of-term harvestbecause of the size of the project and the number of collaboratorsinvolved. But creating a list of target sites is 'a processthat happens every time you do a big crawl,' Phillips said.'This is a tool that we will be able to use in otherprojects. We're hoping to release it as an open-source toolfor the Web harvest community.'

Cooperation not guaranteed

Crawling the Web and gathering pages from the sites are nottechnically difficult. The Internet Archives is using its Heritrixopen-source crawler designed for massive Web-scale crawls. Itstarts with the list of seed addresses from the URL Nominator andvisits each of them, following up links within each domain andcrawling them as well. The job is complicated by the fact that manysites are dynamic, with content for a single page hosted ondifferent servers. The activity of the crawler is monitored toavoid getting caught in a loop of links or other mishaps.

The cooperation of Web site owners is helpful when doing a crawlto harvest data, and some provide site maps for crawlers to helpthem on their way.

But 'we don't always get cooperation,'Carpenter said. Whitehouse.gov is an example of an uncooperativesite. Although the Internet Archives generally respects the privacyof site owners who do not want to participate in its harvests,those sites that the Archives considers public, such as the WhiteHouse site, get included whether they like it or not.

Once the baseline material has been gathered, archivists andresearchers at the University of North Texas and California DigitalLibrary will identify sites likely to change frequently by January,which will be followed up with more frequent harvests.

Nobody expects the results to be comprehensive. 'Wecan't at this stage afford to capture everything as itchanges, in real time,' Carpenter said. 'But we doinvest heavily to create a representative snapshot at any giventime.'

Once the harvest is complete, each partner in the project willget a complete copy of the material, although the Internet Archivesis expected to be the main point of access for the collection. Itis expected to be available online by March or April. The 10 to 20terabytes of the .gov end-ofterm collection will be just a smallpart of the petabyte of data the Internet Archives already ismaking available, but it will be a huge addition for the otherinstitutions.

'I think we're going to end up with more data thanwe're equipped to deal with,' the University of NorthTexas' Phillips said.

Even the Internet Archives faces limits on how its largecollections can be used. Its current access technology, the WaybackMachine, works more like a browser than a search engine, requiringa URL to find material. So although the end-ofterm collection willbe included in the archives' general collection, it probablywill also be broken out as a separate special collection that couldbe searched.

But even as a separate collection, the end-of-term harvest mightbe approaching the upper limit of what is feasible to search withcurrent search engines. Searching a group of pages that havechanged over time is different from using Google to do a search ofthe live Web, which covers only current pages.

'We've been fairly happy with tools that have scaledto the hundreds of millions of documents,' Carpenter said.But once you get to a billion documents, the quality of searchresults drops off quickly.

Mining such massive collections is the next big step in Internetpreservation, Hartman said. She said the university has yet notdecided whether it will use tools to pull together subject-specificcontent in a search or to break the collection up into smaller setsby subject matter.

'We may have to have separate collections,' shesaid. 'Experiment and research will tell for sure.'

That could be the next project for this team. 'We arehoping to find some funding for research in that area,'Hartman said. 'There's a strong interest among ourpartners in that.'