A blog about the travels of Alicia and Markus in their quest to find ...

Friday, October 16, 2015

Hacking the Internet Archive - URL's

I've been playing around with the Internet Archive again lately. I though it would be fun to share some of the things I have discovered along the way and so I intend to write up a series of posts on "Hacking the Internet Archive".

By hacking, I mean using various archive.org tools, api's, tips and techniques to do some cool things with your media. While many of these are documented, many are not and so I'll use these blog posts to share what I find.

To start out, I decided to poke around inside the HTML source code for a details page, In this case a simple image upload.

Using the browser's 'Inspect element' or 'View source' commands, we see something like this (image is example from Chrome desktop browser):

One thing to notice, is that archive.org is now using jQuery, Bootstrap and other interesting frameworks. Fun stuff! We'll come back to that in a later post.

When the detail page is for a book (mediatype: texts), there are lots of interesting things to say. I'll devote an entire post or two to this topic.

An easy thing to do in most browsers is search for the detail's identifier, preceded by a slash to find interesting url's related to the archive.org item. Here are some that I found so far:

Some of these url's, such as '/edit' and '/manage' assume that you are already signed in. They are used to edit the metadata or manage the files using the IA admin pages. More on these features later (boy! are we gunna have a lot of follow-up posts).

The item home page is located at '/details', followed by the unique id for your archive.org upload item. By default, archive.org shows single images in the 'theater' area of the page if it can find one.

If there are multiple images, archive.org inserts a bootstrap slide carousel. If the item is a 'book' (texts), archive.org inserts a book reader. Video and audio also have players associated with their details page.

BTW, here is the full list of archive.org media types as show in the media type select pull down on the archive.org advanced search page (/advancedsearch.php):

Notice that you can get a thumbnail, if it exists, using '/services/img'. Here's an example: I simply asked blogger to insert an image from a url, then typed in 'http://archive.org/services/AliciasShoes' and presto!

I wonder what other services there are?

You can access a specific file by using '/download'. This is useful for image or video 'src' attributes, 'json' data and more. Here's an example:

I have not made use of the '/compress' feature yet, but it looks handy. More to come on that one for sure.

I'm not sure what '/embed' does at this time. I think it is for embedding an iframe into another webpage, but it seems to be broken as of this writing. :(

That's the iframe above, that big white space. The main div inside the embed appears to be empty on this and other detail pages I have tried today.

Ok, that's it for the first post. Just a simple introduction to some important features. In future posts, we'll look at some cool tools for working with the Internet Archive and some ways to use them to create some interesting web pages.

1 Comments:

The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value in this field defines the appearance and functionality of the item's detail page on Internet Archive. In particular, the mediatype of an item defines what sort of online viewer is available for the files contained in the item.

The mediatype metadata field recognizes a limited set of values:

audio: The majority of audio items should receive this mediatype value. Items for the Live Music Archive should instead use the etree value.collection: Denotes the item as a collection to which other collections and items can belong.data: This is the default value for mediatype. Items with a mediatype of data will be available in Internet Archive but you will not be able to browse to them. In addition there will be no online reader/player for the files.etree: Items which contain files for the Live Music Archive should have a mediatype value of etree. The Live Music Archive has very specific upload requirements. Please consult the documentation for the Live Music Archive prior to creating items for it.image: Items which predominantly consist of image files should receive a mediatype value of image. Currently these items will not available for browsing or online viewing in Internet Archive but they will require no additional changes when this mediatype receives additional support in the Archive.movies: All videos (television, features, shorts, etc.) should receive a mediatype value of movies. These items will be displayed with an online video player.software: Items with a mediatype of software are accessible to browse via Internet Archive's software collection. There is no online viewer for software but all files are available for download.web: The web mediatype value is reserved for items which contain web archive WARC files.If the mediatype value you set is not in the list above it will be saved but ignored by the system. The item will be treated as though it has a mediatype value of data.