Backing Up the World Wide Web

The average lifespan of a web page is 100 days. In an era of thousands of quickly changing websites, blog posts, and tweets, how can we archive the web and all other digital content? Digital librarian Brewster Kahle and historian Abby Smith Rumsey discuss what it takes to save old websites—and the entire Internet—and what society might lose if we don’t.

The lobby inside the Internet Archive in San Francisco, California. Photo by Alexa Lim

Segment Transcript

IRA FLATOW: This Is Science Friday. I’m Ira Flatow. If you drive to the northern tip of San Francisco, you’re going to pass row, rows, and rows of painted Victorian homes. And as you get closer to Presidio Park, you’re going to pass a corner building that, well, it just seems out of place. Like it might have been lifted from the Acropolis in Greece.

And while we architecture may seem out of place, the building, though, fits right into the spirit of Silicon Valley. It’s the headquarters of the Internet Archive, an organization that has a small goal of preserving and archiving the entire internet– web pages, photos, blogs, videos, everything. Brewster Kahle, digital librarian and founder of the Internet Archive, gave producer Alexa Lim a tour.

BREWSTER KAHLE: The Internet Archive is in San Francisco in a beautiful building, and it was Christian Science church. The reason why we bought this building is because it matched our logo, which was sort of the pillars of Alexandria. And the Internet Archive is part of the vision of the internet to build the Library of Alexandria version two.

So this is a scanning center of the Internet Archive. Here, we’re digitizing 8 and 16 millimeter film. There’s books scanning going on over there. So we hit the “record” button on the worldwide web in 1996, and take a snapshot of every website and every web page on every website. So we started by collecting the worldwide web, but there’s other media types that also aren’t being collected like old video games– the old Apple II Commodore, Atari. There these communities that have gone and put these online towards keeping older digital materials alive.

We’re now in a great room of the Internet Archive. The old church part– we actually still have the pews. These blinking machines are actually servers of the Internet Archive, themselves. Every time a light blinks on this, it’s somebody either uploading something or downloading something from the Internet Archive. We get between 2 and 3 million users every day. So people want old stuff.

Technologically, it’s completely doable, to actually go and collect these materials. The question is, how do you do it, and who’s supposed to do it? What we want is the wackiness and the wildness of all the people participating in the big conversation that it is internet.

IRA FLATOW: That was wacky and wild Brewster Kahle, himself– digital librarian and founder of the Internet Archive in San Francisco. And you can see photos from that tour on our website at sciencefriday.com/internetarchive. So is it possible to preserve the entire internet, and who should be in charge of doing it? Brewster Kahle is here to talk about it along with Abby Smith Rumsey, a historian and author of the forthcoming book, When We Are No More: How Digital Memory is Shaping Our Future. She’s also based in San Francisco. Welcome to of you.

ABBY SMITH RUMSEY: Thank you.

BREWSTER KAHLE: Great to be here.

IRA FLATOW: Brewster, it’s good to have you. We were looking at our Archives and on week found when you joined us back in 1993. And you were on an historic show that was the first national radio show to be broadcast live on the internet on that edition of Science Friday.

BREWSTER KAHLE: Yes, we even talked about climate change at the time.

IRA FLATOW: We did.

BREWSTER KAHLE: We did if you listen to the recording.

IRA FLATOW: It’s amazing. And that show is digitized. We sent out to Xerox Park, and they digitized it out there and sent it over the internet where probably both people could listen to it on their PCs at that time.

So when did the idea strike you? I know you’re an internet pioneer. When did it strike you that “Hey, we need to record and archive everything on the internet.”

BREWSTER KAHLE: Well, actually, it’s the whole reason that I got involved in trying to get the internet publishing system going in the 1980s. But by the time the early ’90s came around, I’d built a company to help publishing happen on the open internet. And the reason to do that was to make it so you could actually build a library only if it’s open. Only if there’s a free exchange of information on an open system like the internet could we build the library. So 1996, is when we started the Internet Archive to try to archive the whole web and then the whole internet.

IRA FLATOW: Wow. So you created a web crawler called “The Wayback Machine” to take a snapshot of every website. I love that name, because we all know where it comes from. How does that actually work? How do you archive it?

BREWSTER KAHLE: We started out by going and making a robot that basically started with the Yahoo directory. And it would just take all of the web pages, and it would basically be a dumb robot that clicked every link on every page and recorded what it saw. Then if it found new links, it added it to the list. And it would just go, and go, and go, and go, and then we’d start it again after two months.

But we’ve gotten a little bit more mature now. Actually, they’re now 1,000 librarians in 400 institutions that are building subject collection. And the persnickety librarians are wonderful, because they might want to make sure that we have everything exactly right. So we started with robots, but now it’s a combination of robots plus librarians.

IRA FLATOW: I want to tell our listeners if maybe they have an important website that they’d like to get to, you have a question about how we can archive the internet– our number 844-724-8255. You can also tweet us at scifri, S-C-I-F-R-I. Abby, when people are taking pictures, they usually delete the bad ones. As an archivist, are we throwing out potentially good information?

Back in the day, I used to talk to other archivists about paper photography– when we used to do that– and they lamented the idea that people no longer keep their bad photos. They just delete them off of their cellphones or cameras. And it’s sometimes the bed photos contain useful information.

ABBY SMITH RUMSEY: Well, I agree that sometimes bad photos contain useful information, even if it’s not very flattering. But you must a very different set of people than I do, because I don’t know anybody who deletes anything from their cellphone. I think this is one of the issues that we face is the scale of possible duplication. It makes editing not worth it, because we think we can save everything. So this becomes a problem for deciding what to highlight, or how to search for the things that are truly significant as opposed to just an accidentally photographed redundant picture.

IRA FLATOW: Are you saying that you don’t want all of that stuff that people are saving?

ABBY SMITH RUMSEY: No, the sense that not everything is equally important. There are two ways of thinking about it, and this is what makes the digital so different from the analog– things that are in hard copy. In the digital, it’s very easy to be able to record and then to send something out on the web. You don’t really have to think about it, so you don’t have any filters in place.

So things of very disparate quality and also intention, something that could be intended to be just short or something that could want to last for a very long time– they live in the same space. And for archivists, and both robots and the librarians, such as what Brewster has, it’s difficult for them to wade through so much material to find the things that are truly significant. But I will say that we’re used to thinking of selecting materials and putting them into print or recording, because they have some kind of long-term, or at least, near-term value. That’s not true of a lot of this stuff on the net which is– or the web, I should say– which is like a conversation or a bulletin board.

But I also think about we need to change the way we think about the value of content per se, because most of the content on the web now won’t be read by human beings. They’ll be read by machines and at a scale which human beings have a hard time imagining. Machines can see different patterns in large data sets than humans can. So it could be that we actually do want to save all those bad photos, because they will have some significance to a machine that runs some algorithm that’s invented in 50 years.

IRA FLATOW: Brewster are we going to be able to read all this stuff that we save in the formats that we’re used to? You know, JPEG whatever, MOV files, text– we have to keep up with that don’t we?

BREWSTER KAHLE: We do, we do. Basically, not only do people become changing what they’re interested in– whether the file formats are even viewable or not. We’ve had to go and take the movies that have been uploaded to the archive– archive.org, which you can just hit the “upload” button and upload movies– we’ve had to transcode those six times over the last 15 years to just make it so that now people want to see them on their iPhones or whatever it is. And we have to just keep on it. So the format conversion is really a big issue.

Flash is now already becoming, basically, obsolete. And a lot of web pages use them. So there’s we, in the archival community of– we used to just be able to kind of sit around, wait for people to die. And then they just give us the stuff. No more.

We have to be out there all the time, not only gathering the stuff but keeping it in formats and keeping it relevant.

IRA FLATOW: So are you looking for people to go to your website and say, “Here, just take my stuff.” “Take my photos, my–”

BREWSTER KAHLE: Absolutely.

IRA FLATOW: Yeah?

BREWSTER KAHLE: Just go to archive.org and hit the “upload” button.

IRA FLATOW: And what about all the tweets. Can you archive all the tweets? I mean, the billions of tweets that are going on?

BREWSTER KAHLE: Twitter and Facebook– the web has shifted. It used to be a very open world where, basically, you invited crawlers because that’s how you got search and found. But now we’re starting to get more and more closed areas to the web. Whether it’s Facebook, which doesn’t allow crawling at all, or at least to us to crawl. There’s also issues with Twitter where they license that material.

So we get the things that are currently referenced, but trying to get a comprehensive collection is hard. They’ve given a copy to the Library of Congress, which actually is above and beyond what most commercial companies do. But it’s still difficult to get to for researchers.

ABBY SMITH RUMSEY: Yeah, Ira, you have to understand these are privately owned information assets, really. People don’t think of things that they put on Facebook or Twitter as belonging to somebody else, because they come from us. But in fact, they don’t have ultimate control over this, and Brewster’s right. We need more organizations that control information like Twitter, Apple, Google to actually develop partnerships with public institutions like the Internet Archive or the Library of Congress, so that they can be archived and made available to the public for the long term.

Right now, most of these firms can do whatever they want, ultimately, with the data that they have. Even though it may not be what they generate themselves, there’s an agreement, which many users don’t quite understand, between the user or the uploader and what the control is they have over that data long term.

ALAN: A few years ago, I found the limitation of the Wayback Machine. I always loved it but really support the idea of it. But my favorite blog was something I would check in on every month or two and kind of read a bunch of entries all at once. And I went back to do that at one point, and it had totally disappeared. And I was wondering what happened to the blogger.

So I went on the Wayback Machine and it had been like, no archives. Apparently, people if they want to just disappear and take all their content away, they can click to not have it archived. And then it’s really gone, and I don’t understand why that’s an option. Because it kind of seems to defeat the whole idea that you’re guest is expressing here.

IRA FLATOW: OK, let me ask Brewster about that.

BREWSTER KAHLE: Your– Hi, I’m sorry we let you down. There’s been some compromises. The question is, how do you deal with some of the privacy aspects or even copyright aspects of the web where some of the materials are not meant for the ages? And what we’ve done is made it so that people can, retroactively, take things out of the Wayback Machine.

And so they could be actually in the Wayback Machine, and then they put a robot exclusion. Or they write to us and say, “Look, that was the blog from when I was married, and I don’t want it up anymore,” or whatever.

And sometimes this is creating holes that they shouldn’t be. And It sounds like that’s what you got caught in. But it’s been a compromise to try to do a “bend, not break” approach towards how archival should we force other people to be, especially a younger generation that don’t really understand that some of the things they’re saying online are going to be there in 20 years?

IRA FLATOW: Here’s a tweet in from “ipoliticalglobal” who asks, “How big is the archive, in items and bites?”

BREWSTER KAHLE: The Internet Archive collects web pages at about 1 billion pages per week. So the web collection is about 450 billion web objects, pages, or JIFs and JPEGs. And so that’s the database size, which is just fricking huge.

The Library of Congress’s number of books is 28 million. We collect that in about, oh, I don’t know, six hours in terms of the number of items. The total bite collection is a 26 petabytes. It goes mega, giga, tera, peta is the next thing after terabytes so 26 petabytes. And it’s replicated in San Francisco, as well as in Richmond, and a partial copy in Alexandria, Egypt– really– and Amsterdam.

The idea is to try to, well, not repeat the Library of Alexandria problem this time around.

IRA FLATOW: I’m Ira Flatow. This is Science Friday from PRI, Public Radio International, talking with Brewster Kahle and Abby Smith Rumsey about archiving the internet. Abby, you worked with the Library of Congress. How did they tackle web preservation? How did this differ from the traditional approach?

ABBY SMITH RUMSEY: Well, it’s been a challenge, and not just for the Library of Congress but every library and archive that exists for paper and other things just because the scale is so huge. The libraries and archives are used, as Brewster said, to receiving things, essentially, after the fact, when something has already been published in which case a book has been through many, many filters. There’s been a substantial allocation of resources, money and time, in order to get something published.

And it’s almost inevitably deposited for copyright, and a company will actually deposit a copy of the book, a hard copy or a film, to the Copyright Office in the Library of Congress so that it can be preserved. And they actually deposit it so that in case there’s a court case, they actually have some evidence. But it means that the lever Congress has been able to build this massive I mean, literally, unparalleled collection of the creativity of the American people. Things that they have for copyright, and that’s no longer operative in the digital realm.

The copyright law, for the digital age, is so far behind what it needs to be for the analog world. It’s not even comparable. And it’s very difficult to understand how to reform copyright in the digital age in a way which accommodates and promotes the private practice of private creation of content but also to make an accommodation for the long-term, unassailable, public interest in the access to these cultural assets over time.

And so there really needs to be a kind of handoff between the private creators who are owners of content and something like the Library of Congress. And until that’s in place, then most libraries and archives, they often will partner with– I was going to say an institution like Brewsters, but, in fact, Brewster is unique– they will get somebody to do some web crawling for them of the segment of the web that they specifically collect. So a special collections that does labor history, for example, that used to collect ephemeral pamphlets will now actually use something provided by the Internet archive called “archive it,” where they can actually direct a crawl to gather the kinds of materials that are the digital equivalence of handbills, and flyers, and propaganda, that kind of thing.

So that’s what institutions are doing, and they’re trying to do it as best they can. But, again, I’d say that the scale is something that’s very difficult for people to understand. And the institutions that are used to doing this are largely in the public domain and inadequately funded to do that work.

IRA FLATOW: Yeah, this is what we really call “big data.” We’re going to take a break, and come back and talk more, and get more of your questions. Our phone number– 844-724-8255. You can also listen online. Go to our website. It’s sciencefriday.com if you would like to listen from your desktop, your portable device. Talking with Brewster Kahle, Abby Smith Rumsey– we’ll be back after this break. Stay with us.

This is Science Friday. I’m Ira Flatow. Were talking this hour about archiving the internet and all that digital stuff we’re creating. My guests– Brewster Kahle, digital librarian, founder of the Internet Archive. Abby Smtih Rumsey, an historian and author of the forthcoming book, When We Are No More: How Digital Memory is Shaping Our Future. Our number– 844-724-8255. Let’s see if we can go to the phones. Let’s go to Derek in New York. Hi, Derek.

DEREK: Greetings and salutations.

IRA FLATOW: Thank you.

DEREK: Thanks for having me on. The issue I want to bring your attention is related to the presidential candidate, Governor Chris Christie of New Jersey. A couple months back, there was an indictment. And I think that the first trial has, perhaps, started maybe on what they call “bridgegate,” where the George Washington Bridge connecting New Jersey to New York, the most heavily trafficked bridge on the planet was closed under the guise of a traffic study as political retribution for not endorsing the governor and his reelection campaign by a democratic mayor.

The indictments have boarded around the testimony of Christie’s staff and Christie. And the hole in the investigation are missing, destroyed emails and tweets between his staff, and his staff and Christie. WNYC New York affiliate–

IRA FLATOW: You gotta make to a question, because I’m running out of time here.

DEREK: Can your archive find, identify, and produce the missing emails and tweets that were destroyed by his staff?

IRA FLATOW: All right, let’s ask that. Brewster, can you do something like that?

BREWSTER KAHLE: Sorry, not going to be there for you. In general, what I think what you’re– it’s actually probably not tweets. They’re probably SMS messages that they did texting each other on their phones. And the Internet Archive has really started to do basically published works of humankind, things that were available to anybody without any restriction. So if it was on their website, or if it were tweeted, we’d have a very likely– or YouTubes– things like that, we have a good chance of having it.

But if it’s private communications, in general, that’s not the realm of libraries. Now, you can have archives, but they have to be sort of permissioned in. And we’re starting to work with archives, but it’s not going to help you on that one. I’m sorry.

IRA FLATOW: I imagine that the data must be growing exponentially. And how do you– are you going to run out of space? I imagine with all the servers there, you don’t even have heating bill, right? Because just the heat from the servers must take care of you in the winter time.

BREWSTER KAHLE: Absolutely, we heat the building based on the servers. Yeah, we’re buying petabytes at a time. But luckily, the digital guys are doing really well towards continuing to have the disk drives get bigger, and bigger, and bigger. If that stops– which actually, there’s some indication that the hard drives are flattening out and actually may not get bigger– we’re going to be a little bit in trouble.

Fortunately, it only costs us a couple million dollars a year to just buy the new servers that are required. So you say, “Well, that’s a lot of money.” But not by “big boy” standards, so we could basically keep up.

ABBY SMITH RUMSEY: So Ira, this is a very interesting question, because there is a limit to how much we can put on physical formats. And the amount of information which is growing, particularly in the scientific realm, is really phenomenal. It’s even hard to guesstimate.

But there have been some very interesting successful proofs of concept recently about writing human memory to DNA material. This has been done in Switzerland, and in various university campuses, and Microsoft in the United States. And so the hope is that we’d be able to keep far more information if we can inscribe it on to DNA. I know that sounds almost sci-fi-ish, and there will always be moral and ethical issues involved in dealing with anything having to do with inscription on, after all, what is the code for life. But people are very aware of how dangerous all of this explosion of information is in terms of storage space.

IRA FLATOW: Well, I want to thank both of you for taking time to be with us today. Brewster Kahle is digital librarian and founder of the Internet Archive. Abby Smith Rumsey, an historian, author of the forthcoming book When We Are No More: How Digital Memory is Shaping Our Future. Have a great holiday.