I’m Going To Need A Bigger Cup Of Coffee

The sources available to historians jump exponentially for the post-1945 era. The rise of typewriters, copy machines, computers, and printers created a blizzard of paper that shows no sign of ending. Add into that all the electronic files, email, and the like, not to mention oral history recordings, and historians studying the years after World War II might be forgiven for having a thousand-yard stare and powerful bifocals. Google (which I am using as a generic word for search & indexing of all type. There goes the trademark) has helped some, but has its own problems.

Now comes the flood of video. The Air Force, the linked article notes, collects 6 petabytes (which is technical language for “Holy sh#$%$#%, that’s a lot of data”) of high-definition video per day. Such video could be remarkably useful for military historians (want to watch a combat engagement in real time?) but wading through it will be the work of generations.

This is officially an award-winning blog

HNN, Best group blog: "Witty and insightful, the Edge of the American West puts the group in group blog, with frequent contributions from an irreverent band.... Always entertaining, often enlightening, the blog features snazzy visuals—graphs, photos, videos—and zippy writing...."

12 comments

Any idea how much of the video is actually saved? At one point, I wanted to chase changing gender and economic patterns by looking at motor vehicle registrations over time. The idea was that with the rise of wage work, vehicle registrations by Navajo men would have increased while registrations by Navajo women held steady or decreased. It was a great idea with only one problem, the registrations were destroyed by the state after a year.

I tried to do it via truck sales, but old man Gurley of Gurley Ford in Gallup, NM wasn’t letting any East Coast white boy into his files since the lawyers from DNA raked him over the coals for unfair business practices back in the 70s.

@WD: It’s a really good question. I don’t know, I’m afraid. 6 petabytes a day * 365 works out to ~2 exabytes a year, which is quite a bit (math not guaranteed). That’s so much I’m reduced to British-style understatement.

That’s a lot of storage. Like, a million or so DVD-ROMs (each at a few gigabytes) per day. And while I’m sure the Pentagon could afford that much, where are they going to put them all?

wading through it will be the work of generations.

Safe to say that video search engines will get a lot better in twenty years’ time. “Find me every bit of video footage of platoon-level or larger actions in Zabul province from 2008 to 2012 in which IEDs were not involved” should be doable – as long as the video’s in searchable format.

I think that measuring the video in bytes is somewhat misleading, aside from thinking about the current inconvenience of storing or working with it. Unless you’re trying to do super-detailed analysis of every book on a bookshelf in the background or of the precise cloud pattern in the sky on the day a video was shot, most of that data is uninteresting. A better measure seems like it’s more on the scale of how long it takes to analyze it; the actual runtime of the video is probably a better measure. It can take an awful long time to read a megabyte of text, after all.

By “searchable format” I mean “accessible”; i.e. it’s not on a big stack of DVD-ROMs in a crate in a climate-controlled bunker in Dayton, OH, but is actually in a place where your computer can get at it and run it.

By “searchable format” I mean “accessible”; i.e. it’s not on a big stack of DVD-ROMs in a crate in a climate-controlled bunker in Dayton, OH, but is actually in a place where your computer can get at it and run it

Doesn’t really change my answer; a lot of it is going to be in forms like that, or, worse, in formats that have gone away.

I responded to a version of the post at the link that PM posted, above.

A longer response:

1. Just like paper sources now, a lot of the new data is not going to be in a form amenable to the kind of software and quantitative analysis posited.

2. Mass quantities of the statistical analysis held up as a model are, in fact, complete crap, undermined by unreliable data, terribly unskeptical analysis, the mistaking of correlation and causation, and the elevation of weak and insignificant statistical relationships to things deeply meaningful. A current example of this is Steven Pinker’s book on the decline of violence which takes ancient sources about numbers of deaths at their word and spins a quite impressive fantasy out of it.

3. But, in any case, so what? The distinction made (between historians as “qualies” and social scientists as “quants”) is largely non-existent on a scholarly level. Historians have long used statistics and statistical analysis in their work. It’s one more tool in the quiver. The historians of American voting behavior in the 19th century would be quite surprised to find out that they were resistant to statistics. Is history behind other disciplines (like political science) in adopting stats? Sure, but that’s because historians are (at least partly) worried about both source problems (what was the GDP of ancient Greece? Heck, what was the GDP of the Confederacy?) and an allied sense that the data that does exist is incomplete and skewed, especially for pre-1945.

In fact, the distinction between “qualies” and “quanties” is exactly the kind of categorization that hinders analysis, rather than helps it. Both approaches are tools and both are often required on the same topic. Avoiding one or the other simply reduces the number of analytical methods available.

(UPDATE: note also that two things are being conflated here: the ability of software to wade through raw data and put it in usable form, and then the analysis of that data. The latter relies on the former. I’m skeptical of the ability of computers to handle the former effectively and will be until I get my thoroughly OCR’d texts of pre-20th century sources and my perfectly accurate transcripts of oral history recordings.)

The question of how to access video captured by technology that is not searchable, no longer standard – or even extant – is a good one. There’s a team at the NASA Ames Research Lab (Moffet Field) that has spent years re-assembling high definition visual imagery from Apollo from the originals (magnetic tapes or film, I think) that had to start by reverse engineering some of the technology because it had been replaced 30 years ago, and the original recordings were never converted to whatever the replacement technology (videotapes, I guess), much less digital imagery.

I should note that the irony of this is that I cite Silbey’s work extensively in my dissertation, which is not all that high-tech.

I’ve revised and extended my remarks elsewhere but I want to underscore that my fear is precisely that data mining will supplant theoretically informed research. I disagree on some of the points about how amenable much of the data being generated will be to quantitative analysis—Google has proven that given sufficient resources and access it is possible to digitize, essentially, everything—but that’s not the crux of my argument.