Dear Internet, we need better image archives

Dear Internet,

You know what should be really easy to find online? Good quality, Public Domain vintage illustrations. You know, things like this:

I found this on Flickr, where someone claims full copyright on it. That’s copyfraud, but understandable because Flickr’s default license is full copyright (all the more reason to ignore copyright notices!). But copyfraud is not the main problem. The main problem is that images like this are painfully difficult to find online, especially at high resolutions (and this image is only available at medium resolution – up to 604 pixels high, which is barely usable for most purposes but higher than much of what you find online).

The images are out there – and with zillions of antique books being scanned, their vintage illustrations are being scanned right along with them. But the images are buried in the text, and often the scan quality is poor. Images should be scanned at high quality, and tagged for searchability.

Are archives ignoring the value of images?

Take the American Memory archive of the Library of Congress. Lots and lots of historical documents here, but no way for me to find an image of, say, a horse.

Most book-scanning projects focus on texts, not illustrations. Many interesting and useful illustrations are buried within these scans, uncatalogued and inaccessible. Scan quality is set for text, not illustrations, so even if one can find a choice illustration buried within, its quality is usually too low to use.

Archive.org is great (I love you, archive.org!) but does not have an image archive. Still images are not among their “Media Types” (which consist of Moving Images, Texts, Audio, Software, and Education). So I went spelunking through their texts, starting with “American Libraries,” and searched for something easy: “horse.” Surely I could find a nice usable etching of a horse in there somewhere. I eventually found “The Harness Horse” by Sir Walter Gilbey, from 1898.

Nice illustrations! Can I use them? Unfortunately, no. The book is downloadable as PDF and various e-publication formats, but when I try to extract the illustrations, I get a mess:

Copied and pasted from Adobe Acrobat. WTF?

The same image, inverted. Doesn't work.

"Save Image as..." from Acobat. This worked, except where it didn't: part of the image is simply missing.

Clearly something is messed up here. Was it just that page? Alas, no:

This sad image from another page has the same problem

The scans have some flaws that PDFs and Photoshop can’t cope with:

Screen grab of zoomed-in view from Acrobat. What looks like a blur in the PDF renders the image unusable when extracted.

These images are not useable, which is a pity because they are very nice illustrations. And they seem to be among the higher quality scans, which again isn’t saying much.

Let me add that it’s great these books are being scanned at all! That’s definitely better than losing them entirely. But as an artist, it saddens me that we’re neglecting this wealth of visual art. I’d like to see our rich visual history properly archived. Our bias favoring text over pictures is especially ironic considering how much more efficiently information is communicated to humans through images; “A picture is worth a thousand words,” or more. That’s why I’m a cartoonist, after all.

I was able to extract one clean image from the book, on page 48:

Unfortunately I can’t use this illustration for my purposes, but maybe someone else can. I’ve already gone through the trouble of finding it in a text, extracting it, and rotating it. If only there were some image archive I could upload it to at high resolution, so someone else could use it. I could tag it, to make it easier to find. I could include all kinds of useful metadata, like what book it was from and when it was published; but even if that was too bothersome, I could at least include tags like “horse,” “rider” and “engraving.” Wouldn’t it be nice if such an archive existed? Wikimedia Commons is close, although I dread uploading things there after having all my open-licensed comics deleted by an overzealous editor. But maybe they’re our best hope.

Continuing my searches on archive.org, I found this ostensibly Public Domain, vintage horse book with line illustrations. Unfortunately this is controlled by Google Books. It’s “free” to read online in Google’s reader, which doesn’t allow any image export. It also doesn’t allow me to zoom in.

All those illustrations, trapped at low resolution, unusable (even if they were tagged/catalogued, which they aren’t). This is our “Public Domain.” Who exactly is benefitting from having these 18th Century illustrations inaccessible to today’s artists?

Then there’s Dover Books. I loved Dover books growing up – they introduced me to the idea of the Public Domain. Dover reproduces vintage illustrations in books for artists and designers. Their paper books were reasonably priced, and you could use the illustrations for anything, without restriction. Browsing was free, so I would flip through the pages in the book store, and if it had what I needed, I’d buy it.

Dover is still selling books, but the prices are now relatively high, few are carried in bookstores, and they prohibit browsing online. You have to shell out $15 to find out if what you need is in the book, and how could you know? They seem to be clinging to an outdated copyright model, and rather than selling things of added value, they are simply blocking access to existing Public Domain works, in order to collect a toll.

What else has kept a good public archive of Public Domain images from existing? Some artists and archivists do make high quality scans of vintage illustrations – and keep them to themselves. I guess we could call this “image hoarding.” I assume the reasoning is, “I went through all the trouble to scan it, why should I share? Others can pay me if they want a copy.” Also there’s the “finders, keepers” reasoning: “anyone else is free to find the same illustration in another antique book, but I found this one, so it’s mine.” And so these images remain inaccessible, not part of any public archive.

Wikimedia Commons is the best public image archive I know of right now. A bit of searching led me to their “Engravings of Horses” category, which yielded some nice images. Unfortunately, many of these are not available at sufficiently high resolutions.

The maximum size of this image is 800 × 608 pixels, which limits its use. Limited image sizes and limited selection have been the biggest obstacles to my relying more on Wikimedia Commons; but it can get better. Maybe it will. It would be nice if something became the public vintage image archive I and so many other artists need.

34 comments to Dear Internet, we need better image archives

Great post Nina, I couldn’t agree more! I’m currently producing a documentary and I want to make use of a lot of vintage woodcuts, engravings and illustrations. High res scans of them are all readily available through commercial services like Corbis and Getty but the cost for usage in a feature length documentary is around $400-600.

While I suppose that they are entitled to charge for the fact that they have high quality scans (though different tiers of “licensing” for public domain work seems questionable), but I would love a central, high quality archive of work like that.

I know they got several woodcut and copperplate style horses. Actually: someone has been really fond of copperplate style b/w drawings of animals. So you can find tons of animals there.

But even better: They got a wonderful Brontosaur! Oh, you can ride that! I know that I would want to!
I always asked my Mum for one but she said they would take 200 light years plus shipping from Andromeda. Later I decided to order a teleportation device instead. But their internet shop insists on the 200 years delivery time. Bastards! I hope they include the batteries.

Also: what’s the big deal with that horse? If it really has to be THAT horse, open it in the free version of PDFXViewer and select “Export as image” and then “current page”. 2375×1450 px should be enough, don’t you think?

Okay – now you’ve done it! You tricked me into trying it myself, so I guess I may as well just upload the page for you and send you the link.
Don’t you do that again, Nina!

Aside from resolution issues, PDFs are a scourge on the digital world. They’re a proprietary format you pretty much cannot edit except by hacks. Plus, everyone seems to think they’re the lingua franca of document publishing on the net.

Someday we’re going to lose a LOT of cultural information to this silliness.

One of my favourite online image archives is the National Library of Medicine’s Images from the History of Medicine. Yes, you can search for “horse,” but also do all sorts of nifty things. There’s very good metadata accompanying each image, and images generally come in several formats and resolutions. Many images are public domain, given the subject matter.

Other sites could definitely take away some lessons from this amazing resource. No idea what kind of budget they’re working with.

@John Yes, that has already been taken care of. There is a stable standard for a PDF-format that is not going to change with new Acrobat versions and is specially made for archives. It is PDF-A. I only wish archives would really use it. One of the reason for PDF-A was that it should be possible to extract the contents of a PDF document, including images, easily.
I don’t think we will lose anything with PDF. Okay: the format is a mess. If you ever tried decoding PDF in binary (I did) you will know what I mean. But it’s a well and openly documented mess.

What really bothers me is something else: since PDF got “copy protection” features now, more and more people use them out of ignorance and more and more documents are encrypted. Especially documents like protocols – even political protocols that should not be encrypted at all.
Thus making it even more difficult to use them 20 years down the road. I don’t think these people will remember their passwords.
Yes, you can still read the document and you can crack the encryption and extract the content anyway. But we shouldn’t need to do it. People should be educated enough to understand that you must not use encryption and copy protection unless you got a REAL good reason to – esp. not because you feel like you are “in the mood” of restricting content. Because these action have consequences: it’s like putting a book in a glass-box and throwing the key away.

@N_Bloch Ah! That brought up an idea. These archives all use similar software that has a fixed installation directory and naming. One of them seems very popular.

Go to Google and search for: luna servlet.

This will bring up tons of links to free archives with historical pictures.

@Tom Thanks for the references to openclipart (with dedication to the public domain) and to Luna (with high quality pictures, but under fair use / for educational use terms).
However, as Nina Paley points out, putting pictures that are in the public domain is copyfraud. There was the same issue with e-rara.ch, a digital collection of ancient books from Swiss university libraries. E-rara.ch’s terms of use originally put its content under a http://creativecommons.org/licenses/by-nc-sa/2.5/ch/deed.fr license. After the Swiss Creative Commons team pointed out that you cannot put things that are in the public domain under a CC license, they removed it from the Terms of Use. But they argued that:

What sometimes irritates us librarians is the fact that commercial vendors can freely/unashamedly draw from rich digital offers whose creation is financed by public monies. Our terms of use aim at creating a small moral barrier against that, nothing more. (1)

And this issue deserves to be addressed, even though not via copyright law.

My opinion and as far as I know, if you photo-copy (e.g. by scanning) existing art (= “reproduction work”) it is not subject to copyright and you may not claim any restrictions. (At least in Germany. I don’t have any info on Austria and Switzerland, even though I expect it to be the same.)

There might be a chance to claim copyright by calling it a “collection”, which is something different. But this may not apply to an archive because the “collection” must be in such a way, that it is a new piece of art in itself. So a collage of photos of famous paintings is protected art – while a pile of postcards is not. I do believe that an archive is the latter.

However: you may still sell the container or the (web-)service that makes the work available. So I guess it is okay if they ask for a download fee.

I too believe that they may not restrict the use of the copy – including free distribution, BUT!
I’m no lawyer and thus are not allowed to give any advice. Also I do know little about the laws in Switzerland. My personal opinion and best guess as an informed layman is that these restrictions are actually void, BUT I must suggest getting info on the subject from a professional. International copyright law can get very complicated.

You can easily download the still images from which Internet Archive PDFs were derived. There should be a link on the left side to “All Files,” right with the links to the various versions. You will see a menu, and what you want is the .zip of all the .jp2 files. It’s usually a large download, but you will then have each page in much better resolution and quality.

Finding quality public domain imagery is a real problem, although easier than it was 5 years ago. Project Gutenberg has a lot of imagery, but it too is hidden away and difficult to find.Of course the US government is an example source for imagery: http://www.usa.gov/Topics/Graphics.shtml

Below that, there is “All Files: HTTP”. When I clicked that, I got a list of all kinds of things – and one was indeed .jp2 zip! Now that I know what it is, I can use it. But it’s very hidden! And we still don’t have an image archive, although poring through .jp2 files and cleaning up and tagging images found therein could be a way to contribute to one.

“5. Commercial-grade Stock Images
You can get high-resolution images for print or other use of these pictures from alamy.com; for academic use or for other pictures on the Web site that are not listed under “stock images”, contact me directly using the Comment link on the Web page for the item you want.”

By “commercial grade” they mean high res, and they’re trying to control access to who gets to use high res images that should belong to the Public.

thanks for writing this article, was a good reading. because with this article others get an idea what should be done or what is really needed when digitizing stuff.
but asking “the internet” for something does not work. ask humans including yourself to do it!
years ago we were interested in some specific music, we created our own fan portal back then, started an online radio with that music etc. because there was nothing else there back then.

now i’m involved in a project that is related to your wish: the publicdomainproject.org digitizes old recordings which are public domain. we do it on our platform and not on archive.org because of the lack of a sense for quality. as you wrote in your article a lot of stuff is hidden because it is not tagged or the used format/resolution/… makes it not really usable.

there are a few projects who do it very well and spend a lot of time for the works they provide, gutenberg.org does not only scan books, they make it available as TEXT. searching, convert to other formats etc. is simple for them.

for sheetmusic there is a big market for copy-fraud publicdomain works. the imslp.org project has the goal to digitize them and make them available not only as pdf but as editable sheets and midi files. they also try to catch a lot of other information around a work, who wrote it, different versions all that kind of stuff.

for the publicdomainproject.org we try to follow a similar aproach. anything we digitize will be available as high-quality loss-less flac files (no mp3’s like archive.org), everything will be tagged, searchable and linked together with scans of the covers, biographies (links to other projects) and to the sheetmusic if available from imslp.org.

so, coming back to your question, why not asking for more? why not asking for public domain illustrations which are digitized and available as vector graphic (svg) files?

if no existing platform fits your needs try to start one! try to get help, try to get people involved, if the idea gets momentum your needs are also the needs of others. if not, well, don’t hesitate to spend your time for something else or start to help another project. thats the internet, humans spending time on something they love or need.

Next I downloaded the PDF and opened it in Adobe Acrobat Reader v.10.1.1. I went to the page, rotated it 90 degrees and then used Jing again to capture a still image which you can see here:http://screencast.com/t/G6amW0xG6

Next I futzed around using Adobe Reader to make the picture as big as I could on my screen. Then I used Jing to capture this larger image and saved it to my local disk. I opened it in Gimp and found that the pixel size is 1350×934. So I zoomed into the picture where you had seen much bluring and took another snap with Jing. I think mine looks much sharper than yours. Here’s the link to it.http://screencast.com/t/harjCYlxgea1

Of course you are limited to your screen resolution when capturing from your screen. Even 1920×1080 isn’t really high enough for good print design of any size.

I happily do what I can to clean up and share these wood engravings and other potentially useful illustrations—as well as readable (and as much as possible, searchable) texts to provide some sense of history.

As you can see, I have a “backlog” of books, most of which are filled with lovely public-domain wood engravings. I scan them all at a minimum of 700dpi grayscale (to leave room for the downsampling that happens when we clean them up and straighten them and such). If I went faster, the quality would suffer and you’d get junk like your Google robots-and-underpaid-Philipino-scanned images.

But as it happens, I’m going to die sometime in the next 50 years or so.

So we have a problem, all of us.

What I’m saying is: there is no archive, and for real reasons there never can be.

Even if there ever is a place Brewster Kahle calls an “archive”, it will turn out to be no more useful to search for anything than it is to rummage around in my basement looking for a picture of a horsie, or in Google Books for handwritten documents, or in the V&A Library’s online collection for an Indian pot with nice flowers on it, a certain shape, just so.

No archive that is not actually just a collection of piles. No matter what advances we have in AI (which i do in “real life”, by the way).

What there can be, though, is an effort, a sensibility, and if we’re all lucky perhaps a kind of expectation. A lot like the expectation we have (while the cultural capital lasts) that every little town will have a Historical Society and a Garden Club and a Genealogical Archive.

In other words: we will be the only archive you will ever get. And happily! I said that, right? It would be a pleasure to make something you’d find useful.

Now. Inasmuch as we’re crazy enough to love serving in this capacity: what would you like to do culturally to make it easier for us to serve?

BTW: another problem is photographs. Some Zoos allow you to take pictures of animals only for private use. I don’t know … haven’t ever seen a big cat complain about it’s privacy, really.
Same for some public parks, gardens and train stations. (Yep! Train stations, airports, including the trains and all the airplanes.)
So basically any private property that is not visible from the street can be restricted.
Not to mention that theoretically you are not allowed to export pictures of bridges, power plants, important industry, power lines or railway tracks for reasons of national security. Stuff that dates back to the cold war.
Fortunately I haven’t heard of a case were somebody was actually sued or asked to delete pictures.

There was a prominent case in Germany where a company tried to sanction picture taking in the public park of palace Sanssouci. It is public property and entrance to the park is free. However, it has been given to a private foundation for housekeeping, that didn’t like photos of it’s “property”. The case eventually was taken to court by photographers in 2010 and the company lost.
While you are allowed to take commercial pictures in the public park now they didn’t stop telling you otherwise. Their website only has a new paragraph stating that you must ask for an allowance EXCEPT for where picture taking is allowed by the law.

Maybe the law should be explicit that there is no copyright on animals, plants and historic buildings where they are publicly shown. Animals and plants are not art. They should not be allowed to be subject to copyright restrictions up until 70 years after the death of their creator!

“Not to mention that theoretically you are not allowed to export pictures of bridges, power plants, important industry, power lines or railway tracks for reasons of national security. ”

Is it world war 2 again? darn, I had no idea the western ‘democratic’ world had gotten that strict. I had no problems like that in Communist China for crying out loud.
I’m glad I don’t live in the U.S anymore

Just read your post and wanted to quickly say that I extract images from Internet Archive’s reader very easily all the time, just by zooming into the page you want to maximum capacity, then right click on this zoomed in page and then select “Save image as….” and you have normally a high resolution J-peg to do whatever you like with!

I totally agree. There are a huge need for this. High-res, pd-images in open formats.
There are various archives around the net which we’ve seen in the comments to this thread but they are almost all focusing on a small niche, uses weird formats or low resolutions.
I like nuess0rs suggestion to create an archive like this, but it would require quite some resources. The images will take up much space so large disk space is needed, we would need some highly skilled person(s) to setup and maintain a solid, fast and useful infrastructure etc.

“I totally agree. There are a huge need for this. High-res, pd-images in open formats.
There are various archives around the net which we’ve seen in the comments to this thread but they are almost all focusing on a small niche, uses weird formats or low resolutions.
I like nuess0rs suggestion to create an archive like this, but it would require quite some resources. The images will take up much space so large disk space is needed, we would need some highly skilled person(s) to setup and maintain a solid, fast and useful infrastructure etc.”

The problem is this kind of thing really is a niche need. Still, I do have some spare time and server space. I’ll talk to the PPI and see if we can do something.

I add my voice to yonemoto’s. Please use libre software like GIMP. You can view PDF files with GIMP and modify them.

You can’t speak out for free culture, advocate free licences and use closed proprietary formats and software (like M$ or Mac). We can’t very well use an EPS file while PNG or even XCF(GIMP’s work file format) is accessible to everyone (GIMP being free as freedom and as free beer).

For your convenience, here are the Image files extracted from the .EPUB of The Harness Horse. .EPUB files are pretty well set up, and are just .zip files in disguise. Most decent un-archiving programs will let you browse them and pull out the images. hhorse_images.zip

Bottom line, unfortunately, if there is money to be made there will be more and more restrictions on “pictures, Prints, etc…”. Soon images of any kind will be scarce if not impossible to find, for free.

A site with potential to become a good place for old images is http://www.wikipaintings.org/
The current content is somewhat sparse and mixed in quality, but some images are of a reasonably good quality.

It is a relatively new non-profit project which “aims to create high-quality, most complete and well-structured online repository of fine art”.

1. use the pdf on archive.org as your reference to locate tyhe images.

2. go to the archive.org page where you dowlonoaded the pdf and look for the HTPPS in the left margin.. Once clicked look fro JP2RAR ZIP

3. DOWNLOAD THIS FILE. This is where the images are at the highest quality lever. PDFS often get converted at print quality. What this does is vectorize certain areas of the art and thats where those odd blotches are coming from.

4. At the end of the day you need an image. The only unusual thing is that the images are in JPG2000 format – which is a format that can be opened by photoshop , most adobe products and FULL version ADOBE PDF.

5. JPG2000 has a reason ans hisytory behind it – thats lengthy and boring to explain. just look for the https torrent area ans you will find bettween 5-12 different zip files to download – based on your needs. – I suggest you right click on files and do a save as – otherwise you will be burdened by the files being opened in your browser which can be cumbersome with those enormoous file sizes that sometimes are up to 1 G – for an average size historical book.