Menu

Category Archives: Visualization

Tucked away in a presentation on the HathiTrust Digital Library are some fascinating visualizations of libraries by John Wilkin, the Executive Director of HathiTrust and an Associate University Librarian at the University of Michigan. Although I’ve been following the progress of HathiTrust closely, I missed these charts, and I want to highlight them as a novel method for revealing a libraryfingerprint or signature using shared metadata.

With access to the catalogs of HathiTrust member libraries, Wilkin ran some comparisons of book holdings. His ingenious idea was not only to count how many libraries held each particular work, but to create a visualization of each member library based on how widely each book in its collection is held by other libraries.

In Wilkin’s graphs for each library, the X axis is the number of libraries containing a book (including the library the visualization represents), and the Y axis is the number of books. That is, it contains columns of books from 1 (the member library is the only one with a particular book) to 41 (every library in HathiTrust has a physical copy of a book). Let’s look at an example:

Reading the chart from left to right, the University of Illinois at Urbana-Champaign library has a small number of books that it alone holds (~1,000), around 25,000 that only one other library has (the “2” column), 36,000 that two other libraries have, etc.

What’s fascinating is that the overall curvature of a graph tells us a great deal about a particular library.

There are three basic types of libraries we can speak of using this visualization technique. First, there are left-leaning libraries, which have a high number of books that do not exist in many other libraries. These libraries have spent considerable effort and resources acquiring rare volumes. For example, Harvard, which has hundreds of thousands of books that only a handful of other libraries also have:

On the other side, there are right-leaning libraries, which consist mostly of books that are nearly universally held by other libraries. These libraries generally carry only the most circulated volumes, books that are expected to be found in any academic research library. For instance, Lafayette College:

Finally, there are rounded libraries, which don’t have many popular books or many rare books, but mostly works that an average number of similar libraries have. These libraries roughly echo their cohort (in this case, large university research libraries in the United States). They could be called—my apologies—well-rounded in their collecting, likely acquiring many scholarly monographs while still remaining selective rather than comprehensive. For instance, Northwestern University:

Of course, the library curve is often highly correlated with the host institution’s age, since older universities are more likely to have rare old books or unusual (e.g., local or regional) books. This correlation is apparent in this sequence of graphs of the University of California schools, from oldest to newest:

Beyond the three basic types, there are interesting anomalies as well. The University of Virginia is, unsurprisingly, a left-leaning library, but not quite as a left-leaning as I would have expected:

Cornell is also left-leaning, but also clearly has a large, idiosyncratic collection containing works that no other library has—note the spike at position “1”:

Moreover, one could imagine using Wilkin Graphs (I’m going to go ahead and name it that to give John full credit) to analyze the relative composition of other kinds of libraries. For instance, LibraryThing has a project called Legacy Libraries, containing the records of personal libraries of famous historical figures such as Thomas Jefferson. A researcher could create Wilkin Graphs for Jefferson and other American founders (in relation to each other), or among intellectuals from the Enlightenment.

Clarification: John Wilkin comments below that the reason for the spike in position 1 in the Cornell Wilkin Profile is that Cornell had a digitization program that added many unique materials to HathiTrust. This made me realize, with some help from Stanford Library’s Chris Bourg and Penn State’s Mike Furlough that the numbers here are only for the shared HathiTrust collection (although that collection is very large—millions of items). Nevertheless, the general profile shapes should hold for more comprehensive datasets, although likely with occasional left and right shifts for certain libraries depending on additional unique book collections that have not been digitized. (That may explain the University of Virginia Wilkin Profile.) Note also that Google influenced the numbers here, since many of the scanned books come from the Google Books (née Google Library) project, introducing some selection bias which is only now being corrected—or worsened?—by individual institutional digitization initiatives, like Cornell’s.

One of my favorite Woody Allen quips from his tragically short period as a stand-up comic is the punch line to his hyperbolic story about taking a speed-reading course and then digesting all of War and Peace in twenty minutes. The audience begins to giggle at the silliness of reading Tolstoy’s massive tome in a brief sitting. Allen then kills them with his summary of the book: “It’s about Russia.” The joke came to mind recently as I read the self-congratulatory blog post by IBM’s Many Eyes visualization project, applauding their first month on the web. (And I’m feeling a little embarrassed by my post on the one-year anniversary of this blog.) The Many Eyes researchers point to successes such as this groundbreaking visualization of the New Testament:

News flash: Jesus is a big deal in the New Testament. Even exploring the “network” of figures who are “mentioned together” (ostensibly the point of this visualization) doesn’t provide the kind of insight that even a first-year student in theology could provide over coffee. I have been slow to appreciate the power of textual visualization—in large part because I’ve seen far too many visualizations like this one, that merely use computational methods to reveal the obvious in fancy ways.

I’ve been doing some research on visualizations of texts recently for my next book (on digital scholarship), and trying to get over this aversion to visualizations. But when I see visualizations like this one, the lesson is clear: Make sure your visualizations expose something new, hidden, non-obvious.

I gave a talk a couple of days ago at the annual meeting of the Society for American Archivists (to a great audience—many thanks to those who were there and asked such terrific questions) in which I showed how researchers in the future will be able to intelligently search, data mine, and map digital collections. As an example, I presented some preliminary work I’ve done on our September 11 Digital Archive combining text analysis with geocoding to produce overlays on Google Earth that show what people were thinking or doing on 9/11 in different parts of the United States. I promised a follow-up article in this space for those who wanted to learn how I was able to do this. The method provides an overarching view of patterns in a large collection (in the case of the September 11 Digital Archive, tens of thousands of stories), which can then be prospected further to answer research questions. Let’s start with the end product: two maps (a wide view and a detail) of those who were watching CNN on 9/11 (based on a text analysis of our stories database, and colored blue) and those who prayed on 9/11 (colored red).

By panning and zooming, you can see some interesting patterns. Some of these patterns may be obvious to us, but a future researcher with little knowledge of our present could find out easily (without reading thousands of stories) that prayer was more common in rural areas of the U.S. in our time, and that there was especially a dichotomy between the very religious suburbs (or really, the exurbs) of cities like Dallas and the mostly urban CNN-watchers. (I’ll present more surprising data in this space as we approach the fifth anniversary of 9/11.)

OK, here’s how to replicate this. First, a caveat. Since I have direct access to the September 11 Digital Archive database, as well as the ability to run server-to-server data exchanges with Google and Yahoo (through their API programs), I was able to put together a method that may not be possible for some of you without some programming skills and direct access to similar databases. For those in this blog’s audience who do have that capacity, here’s the quick, geeky version: using regular expressions, form an SQL query into the database you are researching to find matching documents; select geographical information (either from the metadata, or, if you are dealing with raw documents, pull identifying data from the main text by matching, say, 5-digit numbers for zip codes); put these matches into an array, and then iterate through the array to send each location to either Yahoo’s or Google’s geocoding service via their maps API; take the latitude and longitude from the result set from Yahoo or Google and add these to your array; iterate again through the array to create a KML (Keynote Markup Language) file by wrapping each field with the appropriate KML tag.

For everyone else, here’s the simplest method I could find for reproducing the maps I created. We’re going to use a web-based front end for Yahoo’s geocoding API, Phillip Holmstrand’s very good free service, and then modify the results a bit to make them a little more appropriate for scholarly research.

First of all, you need to put together a spreadsheet in Excel (or Access or any other spreadsheet program; you can also just create a basic text document with columns and tabs between fields so it looks like a spreadsheet). Hopefully you will not be doing this manually; if you can get a tab-delimited text export from the collection you wish to research, that would be ideal. One or more columns should identify the location of the matching document. Make separate columns for street address, city, state/province, and zip codes (if you only have one or a few of these, that’s totally fine). If you have a distinct URL for each document (e.g., a letter or photograph), put that in another column; same for other information such as a caption or description and the title of the document (again, if any). You don’t need these non-location columns; the only reason to include them is if you wish to click on a dot on Google Earth and bring up the corresponding document in your web browser (for closer reading or viewing).

Be sure to title each column, i.e., use text in the topmost cell with specific titles for the columns, with no spaces. I recommend “street_address,” “city,” “state,” zip_code,” “title,” “description,” and “url” (again, you may only have one or more of these; for the CNN example I used only the zip codes). Once you’re done with the spreadsheet, save it as a tab-delimited text file by using that option in Excel (or Access or whatever) under the menu item “Save as…”

Now open that new file in a text editor like Notepad on the PC or Textedit on the Mac (or BBEdit or anything else other than a word processor, since Word, e.g., will reformat the text). Make sure that it still looks roughly like a spreadsheet, with the title of the columns at the top and each column separated by some space. Use “Select all” from the “Edit” menu and then “Copy.”

Now open your web browser and go to Phillip Holmstrand’s geocoding website and go through the steps. “Step #1” should have “tab delimited” selected. Paste your columned text into the big box in “Step #2” (you will need to highlight the example text that’s already there and delete it before pasting so that you don’t mingle your data with the example). Click “Validate Source” in “Step #3.” If you’ve done everything right thus far, you will get a green message saying “validated.”

In “Step #4” you will need to match up the titles of your columns with the fields that Yahoo accepts, such as address, zip code, and URL. Phillip’s site is very smart and so will try to do this automatically for you, but you may need to be sure that it has done the matching correctly (if you use the column titles I suggest, it should work perfectly). Remember, you don’t need to select each one of these parameters if you don’t have a column for every one. Just leave them blank.

Click “Run Geocoder” in “Step #5” and watch as the latitudes and longitudes appear in the box in “Step #6.” Wait until the process is totally done. Phillip’s site will then map the first 100 points on a built-in Yahoo map, but we are going to take our data with us and modify it a bit. Select “Download to Google Earth (KML) File” at the bottom of “Step #6.” Remember where you save the file. The default name for that file will be “BatchGeocode.kml”. Feel free to change the name, but be sure to keep “.kml” at the end.

While Phillip’s site takes care of a lot of steps for you, if you try right away to open the KML file in Google Earth you will notice that all of the points are blazing white. This is fine for some uses (show me where the closest Starbucks is right now!), but scholarly research requires the ability to compare different KML files (e.g., between CNN viewers and those who prayed). So we need to implement different colors for distinct datasets.

Open your KML file in a text editor like Notepad or Textedit. Don’t worry if you don’t know XML or HTML (if you do know these languages, you will feel a bit more comfortable). Right near the top of the document, there will be a section that looks like this:

To color the dots that this file produces on Google Earth, we need to add a set of “color tags” between <IconStyle> and <scale>. Using your text editor, insert “<color></color>” at that point. Now you should have a section that looks like this:

We’re almost done, but unfortunately things get a little more technical. Google uses what’s called an ABRG value for defining colors in Google Earth files. ABRG stands for “alpha, blue, green, red.” In other words, you will have to tell the program how much blue, green, and red you want in the color, plus the alpha value, which determines how opaque or transparent the dot is. Alas, each of these four parts must be expressed in a two-digit hexidecimal format ranging from “00” (no amount) to “ff” (full amount). Combining each of these two-digit values gives you the necessary full string of eight characters. (I know, I know—why not just <color>red</color>? Don’t ask.) Anyhow, a fully opaque red dot would be <color>ff00ff00</color>, since that value has full (“ff”) opacity and full (“ff”) red value (opacity being the first and second places of the eight characters and red being the fifth and sixth places of the eight characters). Welcome to the joyous world of ABRG.

Let me save you some time. I like to use 50% opacity so I can see through dots. That helps give a sense of mass when dots are close to or on top of each other, as is often the case in cities. (You can also vary the size of the dots, but let’s wait for another day on that one.) So: semi-transparent red is “7f00ff00”; semi-transparent blue is “7fff0000”; semi-transparent green is “7f0000ff”; semi-transparent yellow is “7f00ffff”. (No, green and red don’t make yellow, but they do in this case. Don’t ask.) So for blue dots that you can see through, as in the CNN example, the final code should have “7fff0000″ inserted between <color> and </color>, resulting in:

When you’ve inserted your color choice, save the KML document in your text editor and run the Google Earth application. From within that application, choose “Open…” from the “File” menu and select the KML file you just edited. Google Earth will load the data and you will see colored dots on your map. To compare two datasets, as I did with prayer and CNN viewership, simply open more than one KML file. You can toggle each set of dots on and off by clicking the checkboxes next to their filenames in the middle section of the panel on the left. Zoom and pan, add other datasets (such as population statistics), add a third or fourth KML file. Forget about all the tech stuff and begin your research.

[For those who just want to try out using a KML file for research in Google Earth, here are a few from the September 11 Digital Archive. Right-click (or control-click on a Mac) to save the files to your computer, then open them within Google Earth, which you can download from here. These are files mapping the locations of: those who watched CNN; those who watched Fox News (far fewer than CNN since Fox News was just getting off the ground, but already showing a much more rural audience compared to CNN); and those who prayed on 9/11.]