Saving Climate Data (Part 2)

A few days ago, many scientists, librarians, archivists, computer geeks and environmental activists started to make backups of US government environmental databases. We’re trying to beat the January 20th deadline just in case.

Backing up data is always a good thing, so there’s no point in arguing about politics or the likelihood that these backups are needed. The present situation is just a nice reason to hurry up and do some things we should have been doing anyway.

you can nominate a dataset for rescue, claim a dataset to rescue, let everyone know about a data rescue event, or help in some other way (which you must specify). There’s also other useful information on this page, which was set up by Nick Santos.

However, the people organizing the project are currently overwhelmed with offers of help! People worldwide are proceeding to take action in a decentralzed way! So, everything is a bit chaotic, and nobody has an overall view of what’s going on.

I can’t overstate this: if you think that ‘they’ have a plan and ‘they’ know what’s going on, you’re wrong. ‘They’ is us. Act accordingly.

Here’s a list of news articles, a list of ‘data rescue events’ where people get together with lots of computers and do stuff, and a bit about archives and archivists.

• DataRescuePenn Data Harvesting, Friday–Saturday 13–14 January 2017. Location: not determined yet, probably somewhere at the University of Pennsylvania, Philadelphia.

I hear there will also be events in New York City and Los Angeles, but I don’t know details. If you do, please tell me!

Archives and archivists

Today I helped catalyze a phone conversation between Bethany Wiggin, who heads the PPEHLab, and Nancy Beaumont, head of the Society of American Archivists. Digital archivists have a lot of expertise in saving information, so their skills are crucial here. Big wads of disorganized data are not very useful.

In this conversation I learned that some people are already in contact with the Internet Archive. This archive always tries to save US government websites and databases at the end of each presidential term. Their efforts are not limited to environmental data, and they save not only webpages but entire databases, e.g. data in ftp sites. You can nominate sites to be saved here:

Post navigation

19 Responses to Saving Climate Data (Part 2)

This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project explained here:

Just a general piece of advice: Backups are only useful if they are accessible themselves, and tested/verified. So a site that lists “what has already been backed up” might be useful for getting started, but once it’s “full”, what it needs to list for each thing that “has been backed up” is “how can a third party verify that it’s been backed up correctly, and get/use the data themselves”?

For example, if we were talking about a single file (a gross oversimplification), what is its url, length, format, and a hash of its content? Then anyone can verify they can access the backup, by doing so. If there are two backups, anyone can compare them for equality.

For a directory of many files, as long as some general info is given, anyone can verify accessibility of randomly chosen files within it, and compare the same file from two purported backups.

For a “database” it’s more complicated, but therefore this verification is all the more needed. (A backup by an amateur is especially likely to be incomplete or incorrect without them realizing that.)

Obviously the experts already know all this. I’m pointing it out just so you can easily notice whether a site that “lists which backups have been done” is missing this crucial info or not.

Digital archivists and others are getting onboard, so I hope the issues you raise will be dealt with correctly.

By the way, the old list of ecological databases to be backed up is now topped with a huge message saying that less information is available there now. Now I see no information about whether each database has been backed up (and how many times, etc. etc.). I don’t know where that information can be found.

You sound like the economist who, seeing a $20 bill lying in the gutter, didn’t pick it up—for it were real, someone would have already done so.

What you’re surprised hasn’t happened already is exactly what’s happening now. Many people are, as we speak, “just” downloading everything and letting other know they’ve done it. But the word “just” is not quite the right word when you’re trying to back up a least 90 databases, ranging from hundreds of gigabytes to hundreds of terabytes in size.

One big job is just finding all the databases. There is no perfect list of all US federal government environmental databases. The best list I know is the one that people are throwing together now (click the tiny word “Datasets” in blue on top).

Seems like a natural fit with true decentralized technologies as the IPFS and Blockchain technologies. These provide a framework for uncensorable, untamperable applications. I am a developer in this Field if you want to know more, AMA

This is a copy of my comment to the Saving Climate Data (Part 1) post, just in case someone is reading here and didn’t see it there.

While the blockchain as a technology can be used to provide audit log for the data, I would still consider it as a technology, not the technology, until it has further developed and matured enough not to require the complete chain all the way back to the genesis block.

A sufficient level of trust does not require a mathematical proof of authenticity all the way back to the inception of any particular system.

Having reasonable guarantees distributed widely enough in a decentralized fashion is enough. Future derivatives of the blockchain technology and similar should rely on sufficiently many verifiable sources, instead of requiring an absolute chain to an absolute origin, and everyone keeping a perfect record of all history until current point in time.

A mature system will allow for imperfect record and still work. Note that this can already be achieved to a reasonable level with message digests, digital signatures and timestamps alone, when widely deployed. Having a blockchain doesn’t hurt, but shouldn’t be considered a necessary part of the publication process. We can do 1-3 in any case, with and/or without 4.

which looks like it overlaps the former to some degree, but also offers datasets which are not mentioned in the former. I did not find a corresponding FTP site, although there may be ones lower down in the hierarchy of datasets.

Do we trust the original spreadsheet? Or try to go after these? The nice thing about FTP access is there is no question of needing to understand the Web site structure in order to replicate files. Data can just be downloaded.

The original spreadsheet was randomly thrown together in haste by a bunch of people who aren’t much wiser than you and me. It’s probably missing a bunch of good sites, and it also includes some sites that probably shouldn’t be there, since they aren’t .gov. When Sakari asked the person running the University of Idaho Gridded Surface Meteorological Data site whether he could back that site up, that person replied:

I am not as worried about my datasets as they are on servers hosted by US universities, rather than the federal gov’t.

So, we should use our initiative in identifying and backing up .gov databases that contain important climate/ecological information and aren’t on the original spreadsheet!

In short: go for it! It sounds like you found some good stuff. You can add it to the original spreadsheet but also just start downloading

Also, please log the sites you’ve already backed up, so we can know what we’ve done! You can do it here:

Sorry, I hadn’t figured out how to see the entries on azimuth-inventory-wiki. I’m still not sure I understand the best way to see a list of all databases that we have backed up. Right now the only thing I know is to go to

“First, we need to identify vulnerable programs and then seed their URLs to the webcrawler of the End of Term project, which will make copies of those webpages. Second, we are researching and evaluating the many data repositories that the EPA has online: some of this data we know will be backed up and protected by laws, some data will be archivable at the Internet Archive through their webcrawler, and yet other sources of data will need to identified as in need of saving at a library. Libraries, such as at the University of Pennsylvania, are arranging to become repositories of this kind of vulnerable data not easily preserved. We will be passing on what we build and research to our colleagues in other cities so that they can pick up where we have left off.”

I have noted it elsewhere, but there’s a nice discussion of what a Trump administration could and could not do. Providing additional motivation and urgency to the Backup Project, they cite the case of a conservative Canadian PM who literally threw out many government reports from Environment Canada.

How To Write Math Here:

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.