April 13, 2007

Archival Quality

I can clearly recall one particular afternoon from my childhood when I went over to a friend’s house to play. I walked in the back door, as I ordinarily would, and encountered his mother standing at the kitchen table with the day’s newspaper, and beside it, a scrapbook. When I asked what she was doing, she said, “Today is an important one in history, the Berlin wall has come down.” She folded the front page of the New York Times in half, and delicately placed it between the scrapbook’s pages. I shrugged and ran into the next room with larger things (like Transformers) on my mind, but that small gesture has always stuck in my head.

The act of archiving is certainly not a rarity, but the motivation behind it is something profound. People save things for many different reasons, only some of which may be personal bookkeeping — perhaps as a way to capture an event for themselves, bottling that moment in time to create context and perspective for their own lives and the ways in which they perceived the world.

I got to thinking about all this a little while ago, specifically in the context of what I (and most of you, I presume) do on a daily basis: create websites. Looking back on even the brief history the web has had; how much of it have we actually accounted for? Not just content-wise, but visually as well. We are good at saving where things are located (the URL), and pretty good at saving what was said (the content), but very poor at saving what things look like (the design). Things become even more bleak when we consider saving what things look like over time, or at a specific time.

Cause for Alarm?

At SXSW this year I saw a great panel headed up by the NYPL’s Carrie Bickner called Preserving our Digital Legacy and the Individual Collector; a continuation of her panel last year, Digital Preservation and Blogs (podcast). The panelists discussed in detail what goes into the process of discovering, documenting, and preserving collections for libraries, but also the extreme troubles of digital preservation. Dealing with physical and digital archival methods, and format restrictions: will technology 30 years from now be able to even parse what we are creating today? Carrie’s and her fellow speakers’ words really struck a chord with me; so much of what we’ve already done in our industry has slipped away. Entire websites, companies, thoughts, and ideas have all fallen into the void, and will potentially never be seen again.

It’s simple when we think of clearly important events like wars ending, celebrity suicides, and space travel: these events have stacks of associated information and visuals saved throughout the world. But what of the more mundane, or of things important to a select few? Many of the world’s greatest authors and artists only became recognized as so after they had died. Afterwards, their work, and, sometimes more importantly, their private effects like letters and diaries, become monumentally important devices for understanding both the individual and the world through their eyes. So too, much of the work we’ve done has left the web forever, residing only in our hard drives, and dusty stacks of ZIP disks.

Of course, there are projects out there like the Internet Archive’s Way Back Machine, which is wonky even on its best days, often searching for images and files from previous designs long since deleted from servers. The same goes for Ma.gnolia’s feature to “View Saved Copy”, which saves a version of a page you bookmark that references CSS and images from your server. If you change your document structure or markup, remove those images or CSS, or otherwise… ya know… redesign, that cached version of your site will likely break. Others might take matters into their own hands, like Mr. Inman, who programmed an AppleScript to automatically upload a screenshot of his site to Flickr everyday.

All of these are great ideas, but far from perfect, especially when it comes down to sustainability and public consumption. It’s terribly frustrating to bookmark a beautifully designed website, only to go back later to a different design, and the realization that I never took the time to take a screenshot. Social sites like del.icio.us and Newsvine have flourished based on their strong community built around sharing information and interests. We’ve figured out how to share stories and locations, the text of the web, but what about the rest?

Quite the Quandary

Calling this a gargantuan effort is probably a severe understatement. Problems like storage, browsers, platforms, and color profiles would surely pepper the discussion, and those are just for the hurdles involved in taking a static visual of a website. Would the leap to preservation of code or a living, semi-functioning website be far behind?

I most certainly don’t have any answers to this, but it’s been troubling me for some time. Even discussing the problems will get us closer to thinking about what to do next. Maybe some people don’t care, or imagine the web to be a more fluid medium where content is the most important thing happening. Well, I care about design over time, both from historical and sociological standpoints. We can look back at books that are hundreds of years old and still read them and feel what they were like, but the web is already slipping away from us. I want to know what a site used to look like 10 years ago, and I want to be able to do the same in another 10 years from now. Preserve our digital legacy!

Commentary (23):

Indeed! This is the reason that I now keep all of the images, javascript, css and html in a wordpress template file. New design? New template file. This also makes it easy to choose your, ahem, “skin”.

With this method he preserves content, design and the functionality of links—but it doesn’t solve the problem of archiving whole websites since it’s of course not possible to save more than one webpage at once.

Julian Schrader: That’s a nice method, and there are many more out there. I guess the biggest problems with methods like that is they are so dependent on the individual. If we’ve learned one thing from community sites like del.icio.us or wikipedia, it’s that a large group of people can really get a lot done in terms of cataloguing. I could save lots of screenshots on my own computer, but that doesn’t benefit others online. But through things like tagging and sharing, we can put content like that into contextual relevance. Of course, this is still thinking in terms of mere screenshots, full sites is a different beast entirely.

I’d like to see del.icio.us add a full-page screenshot (via flickr) of the link when it’s added to their site. It would be a huge load of data, but since they’re both Yahoo companies, hopefully that won’t be too much of an issue.

Of course, there would have to be some fancy back-end checking to make sure they’re not saving thousands of copies of the same screenshot (1 for each person saving a link), but I’m sure it’s doable.

It’s not any sort of perfect solution to what you’re mentioning here, but it’s something.

Joshua Lane: It’s funny you should mention that, because that very idea is what this post used to be about (as it sat in a draft for a few weeks). I’m hesitant to spend so much time investing in something like del.icio.us because I want to save the way a site looks too, not just the location, and most definitely not a wee thumbnail. It would certainly be a start.

How timely. I was at an exhibition of modernist graphic design recently, and was hit by the realization that there was almost no chance that the well-crafted (and in some cases revolutionary) design I love so much on today’s web would ever be archived with such care, if at all.

The conventions established during the high modernist period had such an impact on how we design. Is it really that much of a stretch to think that current web design could have a similarly historic impact?

Poignant post, it’s one of those things that’s been bothering me for awhile. Especially with the evolution towards separation of style and content. Most of the stuff I made 5-10 years ago still renders okay, despite the mess of line style and tables. More recent stuff is a bit of a crapshoot.

The lack of permanence in web design is one of the reasons that I chose to move across the pond and study book design this year. Although, books will still show some wear with age. I was originally looking at the problem for my dissertation, but the scope is a bit to large for a one-year program.

I’ve been using a Firefox extension called Pearl Crescent Page Saver for some time now. It is not perfect, nor does it solve some issues you raised (it just saves a sort of ‘screenshot’ of the page), but for now it has served me well. Still, it is ridiculous that browsers such as Firefox still can’tsave a local copy of a websites that work properly (images, css, html, etc).

Another thing that bothers me a lot on this subject (yes I am an archiving manic, yes I have over 1Tb of data, etc) is related to a concept called Provenance, which, in this context, means being able to tell form where something came from.

So you’d have, say, embeded into a JPEG (maybe through EXIF or spotlight comments), the address of the website from where it came from. Or maybe somewhere into a PDF file the e-mail address from the person who sent you (this would probably have to be implemented by the e-mail service provider). And so on.

How many times you “find” something on your computer that you wish you knew where you got, but you can’t remember and can’t find anything else about (mostly because it is called “clip.mov” or something cryptic like that). Of course you can follow some scheme to keep all your filenames organized, but it is just stupid that you have to do work that the computer could do automatically (and probably better than you).

I am still eagerly searching/waiting for something (a firefox plugin?) that implements this.

I ran into this problem with a site I made a few years ago. Granted, I was looking for an old photo from my band to prove that I once had orange and red hair, but I’m sure plenty of relevant data is lost in the same way. The only way this image could have been archived is if I had a script that took a screenshot of every dynamic page on the site on a regular basis.

The other issue with archiving is determining what’s worthy to archive. Usually the archiving process would only start after something has become note-worthy. The growth up to that point would likely be lost.

One contributing factor, I feel, is that as designers and Web developers we look back at something we did in the past and just hate it. You look at the code or design and think “What the holy heck was I doing?” So you pull the old stuff down, and there goes your reference point.

I had a similar discussion with Shaun Inman over drinks and absurdly loud music at SXSW. One of the primary things that concerns me about archival is that a simple screencap or static representation is simply not enough, and not only because the web is an interactive medium.

My major concern is that of context. If I go and read a piece at NYTimes.com that was written in 1997, I am reading the article within the context of the 2007 version of the site.

What have I lost? How did the layout/purpose/design of the 1997 NYTimes.com influence the content of the article? I have no idea of knowing. It’s interesting that, as web developers, we simply do not accept the possibility of a 404 to an old article. We will go to the trouble of rewriting .htaccess files to ensure that our locations are not lost.

But we throw away design and context on a DAILY basis. Even the slightest of change from one day to the next represents a new model of perception and thinking around what you are trying to communicate. Inevitably, we think that we are doing “better,” and while that may be true, that doesn’t mean it should come at the cost of completely destroying what came before.

Because the web is digital, and that storage is trending towards being basically free, there is seemingly no justifiable reason to simply throw context away, or even to just screencap a design for the sake of reference. We can KEEP context, and we should start thinking about how.

This seems like a very old problem for the design community. How much of what past designers have made over the last century much less since Gutenberg has actually been archived? What percentage of any designers portfolio is never saved?

Using screenshots (photographs) to preserve the web is just as limited as using captions (text) to archive a photograph. No matter how throughly you work to document it, describing every last detail, something will be lost in translation.

My worst experience was realizing too late that in the early days of CD-R the standards for burning information was not settled. Most of my undergraduate work now exists in faded print mock-ups because the discs are unreadable by modern CD players even on the same OS.

I am constantly pushing back the fear that the few printed copies I have of work will survive the decade. And I know that my ‘backups’ to hard drives and discs will inevitably fail or be rendered unreadable by software ‘progress’.

Sean Madden: I counldn’t agree more. Context for art and historical works is equally important. Consider too, the context of specific browsers and the way they render websites. It seems trivial now, but even factors like software versions play a big part in weaving a picture of the past. In the SXSW panel I mentioned, the panelists (most of who are librarians of some sort), went into some depth about altering the original piece as little as possible. Even to the point that in 10 years time, an emulator of a long dead web browser wouldn’t be the same as seeing a website under it’s original conditions. This entire medium has shook up the way in which we preserve things.

Rian: You are right, this has always been a problem in the design community, but the web as a medium has thrown a large monkey wrench in the works of an already bad problem. For printed items, especially things like books, many times there are stacks of copies spread all around the world. This state of distribution is, in a way, a method of preservation itself; in the same way that if we had a database of a collection of websites, we would backup to different servers in different locations as well.

Screenshots definitely aren’t the way, you’re right. They would merely be a temporary patch for the problem. They wouldn’t solve it, and definitely wouldn’t stand the tests of time either.

I’ve got almost every version of my site that I’ve ever done on a backup hard drive, and now that I’ve got the space to host them all I’ve contemplated creating a sort of online “museum” of my design and where it’s come in the past decade.

If nothing else, this post has reminded me of watching Hasselhoff dancing and singing on top of the Berlin Wall…

There was a rather good lecture series on “digital future” put on by the Library of Congress two years ago. I think it might still be available for free over at audible.com (that’s where I got it). The main gist covered copyright and digital publishing, but there were several parts that delved into archiving everything. One of the lectures included Brewster Kahle discussing the Internet Archive. Well worth the listen in regard to this topic.

I do think that screenshots are better than, or at least should be in conjunction with, entire websites saved with images and stylesheets.

The web technologies change, and the website you saved in its entirety just won’t look the same in any browser 10 or 20 years from now. Along with the saved website you’d also have to archive a browser and an OS. Perhaps even a computer? With a collection of spare parts of course.

Screenshots, preferably printed in a high resolution on quality glossy paper and preserved in an airtight dark place, will keep much longer ;-)

gb: Ooo, thanks for the lead on the Library of Congress Series on the Digital Future at Audible. I just downloaded the tracks and will start listening to them shortly (they are about 12 hours long in total). For those interested:

The full collection, contains the free David Weinberger lecture, as well as lectures from 7 other speakers like Neil Gershenfeld, Brewster Kahle, and Lawrence Lessig, all for the low, low price of $10.

I think the problem stems from technology itself. Computers aren’t supposed to be like paper or vinyl – we use them to display many kinds of data in many ways – which is considered as an asset, in fact.

Only once we can print on a sheet of paper though, and what is printed will remain there until I rip the paper off. Unlike paper, the data we store on our computers is prone to alterations in so many different ways – and we owe that to rewriteable discs and flash memories and so on…

Re. knowing where things came from, Safari will embed the originating URL in the metadata for any file which is downloaded via its download manager (get info in the Finder, then expand the ‘more info’ pane, if necessary). It’s not perfect (I don’t think it happens to drag and dropped images) but it is a start.

I think you’ve raised an important point, Jason, with the question, will technology 30years from now be able to even parse what we are creating today?
You can purchase the biggest harddrive in history and put screenshots in there every day of your designs, but if future computers/technology can’t even read/parse that format in years to come, is it worth it?
This is something I have considered with my own work in the past, not only in design but in everything that I have stored on a piece of technology somewhere - will I be able to show future generations my work (regardless of what that is) if the technology of the future is too advanced to read/parse/etc that work?

Anyway, this is the first time I’ve commented on your site before, Jason, but I just wanted to say, I come on here all the time and have a good read. You’ve always got something interesting to say. Thanks!