Digital Preservation & Corporate Owned Platforms

When it comes to social media, the vast majority of our personal thoughts and records are only preserved by some company's far-away servers. For a fairly long time now, we have collectively tied our identities to spaces owned and operated by third parties.

Dr. Tamar Carroll of RIT’s Digital Humanities and Social Science program brought up how Facebook embodies this trend.

“Facebook tells you, ‘Here’s your memory from seven years ago, or ten years ago,'” she said. “So I think there’s a way in which people just assume it will always be there. And that’s a dangerous assumption, individuals need to be thinking about backing up their own.”

Carroll expressed how these platforms — and parts of our lives preserved on them — don’t have the guaranteed longevity of public services.

Why the Concern?

Despite that degree of uncertainty, society has still adopted and utilized the potential of such sites in mass. The very fact that so many of these platforms are free almost makes it too easy to treat sites like YouTube, Facebook and Twitter as public resources.

Students in Carroll’s Oral History (HIST-324) classes will use SoundCloud, for instance, to post excerpts of interviews they’ve done.

“SoundCloud was an attractive place to put them because it is so widely used and it’s a way to get content seen and heard,” Carroll said. “The other thing that is nice about SoundCloud is when their servers are hosting it you can just put links on your own website and so it can speed up — make for a faster viewing experience.”

Thankfully, Twitter elected to make vine.co a time capsule, where creators and fans could access old videos. The content may not be guaranteed indefinitely, but it’s more than one might expect from these corporations.

“It’s not really an archive. It may not be there in 50 years.”

“Things like a company ...,” began Becky Simmons, RIT’s Archivist. “It’s not really an archive. It may not be there in 50 years.”

“No they don’t. For companies, for them it’s about making money and moving forward,” Simmons said. “They don’t always respect their own history or see the importance.”

Simmons also pointed out how these sites are just vehicles for other people’s content anyway.

"It's not their content, it's not even about them," she noted. Simmons postured that these companies likely perceive their history to be reflected more in meeting minutes, internal documents and company milestones; not the thoughts or creations of users.

It's not hard to see how companies could divorce themselves from any moral obligation to preserve its users' content. There is no legal responsibility either, leaving both creators and fans of platforms like SoundCloud to wonder what would happen should one ever shutter unceremoniously.

Preserving by the Petabyte

While there are certainly socially-conscious enterprises, at the moment non-profits and the government seem the most committed to digital preservation. Carroll pointed to public institutions like the Library of Congress and Digital Library of America as examples. Organizations like the Internet Archive have also been making efforts towards preservation..

Started in 1996, the Internet Archive has stored over 30 petabytes (1 petabyte = 1,000,000 gigabytes) of data according to the organization itself. Their archive apparently contains:

279 billion web pages

11 million books and texts

4 million audio recordings (including 160,000 live concerts)

3 million videos (including 1 million Television News programs)

1 million images

100,000 software programs

“They try to archive everything and so that’s a partial back up,” explained Carroll. She recalled a student of hers used their Wayback Machine for a project on the transgender community in Rochester. Their research involved searching for a particular electronic chat room as it was in 1996. Although it wasn’t an active website, the student was able to get a sense of the chatroom that was one woman’s introduction to others within the transgender community through what was archived.

“So it’s an important source for historians to look at content as it was — as it appeared — on the internet at that time.”

“So it’s an important source for historians to look at content as it was — as it appeared — on the internet at that time,” said Carroll. “That said ... I wouldn’t rely on it completely.”

Despite collecting so much information, the Wayback Machine does not contain everything that has ever been transcribed on the internet. The Internet Archive can only take so many snapshots. Quite often it can only provide an idea of what a website might have been like in the past; not every day for every site is archived.

Parsing Info, Proprietary Tech and Future-Proofing

While the Wayback Machine’s records may not be 100 percent exhaustive, there is still a lot of content to parse through. Storing so much data is one challenge, being able to access what’s been archived poses several others.

"Well it’s like walking into RIT Archives and saying, ‘Do you have anything on the history of RIT?’ Well, yes we have a couple things.”

This was reminiscent of a point Simmons made about what it would be like to try to find a specific Facebook post years in the future. “Searching is an art,” she said. “If you have a lot of stuff to look through, you got to be able to — well it’s like walking into RIT Archives and saying, ‘Do you have anything on the history of RIT?’ Well, yes we have a couple things.”

Both she and Carroll also conveyed how proprietary media formats can be a hindrance. Some companies will use technology unique to only them, but then go out of business (e.g. many forgotten video game consoles).

“Open source software is one way to try to avoid that, because it can be shared and modified widely,” noted Carroll.

There’s also the fact that file types will get replaced in time, and the need to ensure what's preserved maintains its quality or fidelity. Formats like JPEGs, for example, lose resolution every time they’re manipulated and re-saved. File types that are considered good for future-proofing are:

Tag Image File Format (TIFF) for images.

WAV for audio recordings.

According to Simmons, the jury is still out on the ideal format for video.

Capturing the Identity of RIT

As an institution of higher learning, RIT itself is concerned with the preservation and accessiblity of its own digital media. Simmons’ responsibilities include acquiring both materials from across the school, preserving them and making them attainable.

One might be wondering what would really be lost if some content, like a tweet from University News, wasn't preserved. Simmons explained why, in the case of her work, there’s value in saving it.

"They say something about the history of RIT, what was important on a particular day.”

“Well because they say something about the history of RIT, what was important on a particular day,” she noted. “Sort of like a social history of the university.”

In that vein, some of Simmons' goals include to capturing more of student life, to better archiving the feel of RIT, and not just documenting the administration’s point-of-view. She specifically named student-generated video as one form she’d love to archive more of.

Currently, RIT employs "Archive-It" (an application built by the aforementioned Internet Archive) to crawl RIT's entire website, a few social media channels and the pages of certain student organizations for cataloging.

The fact that sizeable portions of our identities are only known to us digitally, manifests itself in the materials Simmons receives.

“When people donate their photographs, they don’t give me a [physical] photograph, they give me a digital file,” she said. “Who keeps paper?”

However, a part of her wonders if with because of how easy it is to "toss things away" in a digital context, if the proportion of what’s saved for posterity is diminishing. “I always tell people, ‘Okay stick it in a file cabinet, to throw it out takes an effort,’” she stated. “But how easy is it to hit delete?”