Archive for the 'Metadata' Category

Calibre is undisputedly the number one when it comes to e-book management. It’s HUGE. It’s got a plethora of functions.

And it’s got quirks, design decisions which may not suit to your workflow. Certainly a lot of them don’t suit to mine.

Calibres own space. Every document imported into the library ends up copied into some private directory of calibre, and named according to some /Author/Title/Title scheme. The way I cope with this, is import into calibre, and save-to-disk again.

Metadata on the filesystem Metadata is stored not within the file, but in some database, and apparently in some opf-file with the book as well. Luckily, calibre tries to put metadata into the file when saving to disk. So the solution here is the same as above.

Name like Yoda, A When writing files, it misnames them to some library sort order, with the article appended at the end. To fix this, there’s a parameter in “Preferences” -> “Tweaks” -> “Control Formatting of Title and Series when used in Templates”, called save_template_title_series_sorting which needs to be set to strictly_alphabetic

No such Character There’s a set of characters Calibre does not want in file names. They are the same on all platforms, and while it’s not wise to use asterisks and such on unix filesystems, because they would wreak havoc on shell-processing, they would still work. The only character really not allowed is the “/”. But Calibre also replaces various ballast from Windows, like desirable critters like “:” and “+”. The way to fix this is to edit/usr/lib/calibre/calibre/__init__.py and have them removed from _filename_sanitize_unicode.

Publishing by the Month Before the advent of the e-books, publishing dates are by definition expressed in years. Copyright law also uses the year only. To get rid of the ridiculous month in the publishing date, go to “Preferences” -> “Tweaks” -> “Control how dates are displayed” and set gui_pubdate_display_format to yyyy

Not unique As librarians know, in the absence of ISBN, books are identified by author, title, publishing year and publisher. Now when saving pdf files, Calibre neither puts in an ISBN, nor the publishing year, nor the publisher. Apparently, this is a problem of podofo, which does not know these. Speaking of which:

podofail Sometimes podofo also fails to write some tags. It’s not quite clear when this happens, as all my pdf files do not have any encryption, and exiftool can write metadata to them without problems.

Over time, I’ve written a slew of scripts to read and set metadata, these are:

epub-meta (c) — A very fast EPUB metadata-viewer based on ebook-tools’ libepub from Ely Levy.

epub-rename (perl) — A script to rename epub-files according to the EPUB’s metadata. Needs epub-meta and ebook-meta (from calibre).

exif-rename (perl) — A script to rename files according to their EXIF-tags. Tested with PDF, DJVU, M4V, OGM, MKV and MP3

exif-meta (perl) — A script to set EXIF/XMP-metatags according to the filename.

I like my metadata for files within the file. The reason is simple, a filesystem is only a temporary storage for a file, and things like filenames or paths only make sense within the filesystem itself. If you move the file, the filesystem might not support your particular metadata.

Starting with the path. For instace, /movies/Pirate/ won’t exist on other peoples machines, and it actually can’t even exist on stupid windows filesystems. So the fact that the file residing within this path is probably a pirate movie would get lost. And of course, not every filesystem supports all characters or encodes them the same way, and thus the movie “Pippi Långstrump på de sju haven” might end up with a totally garbled title on a filesystem.

Since I work on the Unix shell and on the web a lot, spaces in filenames tend to get garbled (“%20”) or interfere with commandline processing. So my filenames do not have spaces or umlauts in them, they are instead BiCapitalized. In fact, I’ve written a program bicapitalize to do just that.

Enter Matroshka

When it comes to metadata, the one container format that can just about contain everything is Matroshka. MP4 would be a possibility, but it’s rather constricted in it’s use of subtitles, codecs and audio tracks or even cover images. Also, matroshka looks much less as if “designed by commitee” as MP4 does; and is generally better supported by open source software. Not quite enough, as we’ll see..

This only changes the container, it won’t recode anything. It usually works with avi, mp4, mpeg, flv, ogm and more, but not with wmv.

You’ll notice the program respacefilter, which I’ve written to display BiCapitalized filenames as strings containing spaces. And if you’ve got some experience with the unix shell, you’ll also notice the above commandline will fail for files containing spaces. That’s exactly the reason why spaces in filenames are bad.

The above command also sets the “Title” tag to something probably meaningful, and the language of the first audio track to english. You can change the latter later on withmkvpropedit --edit track:a1 --set language=ger target.mkv

If the title is screwed, you could set it withmkvpropedit --edit info --set title="This Movie" target.mkv

Of course, if you already do have Matroshka files, and their title tags are not set or wrong, you might not want to set all titles by hand. I’ve also written a script called titlemkv to fix this. It can also fix some drawn out all-caps acronyms. Apart from the mkvtools, this needs mediainfo (install on Debian/Ubuntu with apt-get install mediainfo).

All the above can also be done, one file at a time, with the graphical interface mmg (of course: apt-get install mkvtoolnix-gui).

By now, you should have all you movie files in Matroshka-containers, and if not, because things like wmv-files, or files containing ancient codecs can’t just be re-containered, there’s HandBrake (as usual, apt-get install handbrake-gtk)

Matroshka Metadata

Apart from title and the languages of audio-tracks and subtitles, Matroshka files do not contain any metadata directly. Instead, they are in an xml-file, which is muxed into the container. Which makes the whole process obviously rather tedious. You don’t want to do it by hand.

Also, it turns out, most application do not read any metadata from the containers AT ALL. mediainfo of course can do it. So can avinfo, surprisingly. vlc can display most of them in a special window. mpv will display the Title as the window title. But the ones really needing metadata, the media center applications CAN’T. Neither MythTV, nor xbmc. Instead, both of these rely on filenames, and put the metadata into their database, with the added option of using some accompanying file with the movie which gets interpreted as well.

To add insult to injury, given one of these accompanying files with correct data, xbmc will display it, but when trying to fill in the blanks, it will happily try to look it up — by interpreting the filename again, wrongly. At least MediaElch can do this right (and that’s why it gets linked).

So the questions are a) how do we get these “accompanying files” (assuming they’re really needed for getting metadata from the web) and b) how do we get better metadata into them, and c) how do we put this metadata into the files itself.

For this, titlemkv can produce a rudimentary .nfo file for xbmc, when given the -n switch. It will contain the title, and the year, if it is already set in the mkv. Going from this, MediaElch or any other not broken scraper, can now fill in the blanks and produce .nfo files which contain a lot of information, like directors, actors, summaries and so on.

The last piece is my nfo2xml script, which will walk over a directory and produce a mkv-compatible XML file out of every .nfo-file it finds. The XML can the be muxed into the mkv-container, thus: for i in *.mkv; do mkvpropedit $i --tags all:`basename $i .mkv`.xml ; done

The Future

I’ll probably update titlemkv to generate complete .nfo files from mkv metadata (or split the functionality into another program), also, I want to look at the question of how to incorporate cover images and such. I want all my files to contain useful metadata, and second, as long as this sorry state persists, I want to be able to generate whatever external metadata an application wants out of the incorporated metadata (which has its own merits: I would also be able to rename and sort my whole collection solely according the metadata in the files themselves).

(Edit 1: I wrote a rather stupid shellscript mkvattachcover to convert and attach cover images. It expects them with the filenames provided by MediaElch.)

(Edit 2: For use with mediainfo --inform=file:///path/to/AVinfo.csv I put up a decent template, AVinfo.csv which will show Matroshka specific tags. No, I have no idea why they’re calling their templates .csv, they aren’t.)

But crucially, the media center applications and the file managers will need to support metadata incorporated into files; just as one expects with audio files where this is absolutely the case.

Metadata MUST reside within the same file. I do understand that certain programs do not want to incorporate code to change this metadata, but just about everything accessing these files must be able to READ it, including media players, scrapers and file managers.

(Edit 3: nautilus displays either cover.jpg or small_cover.jpg as icon. But that’s it, apparently it can’t read any other metadata.)

Posted in Computers, Metadata | Comments Off on Matroshka and the State of Movie Metadata

Google finds about 250’000 of these papers. It gets much worse if you only search for documents called “untitled1”. Not just the documents themselves have this meta-information, but all kinds of conversions, to html, and to pdf as well.

Sometimes, to make the whole thing even more ironic, the publisher has added his own information — but neither the title, nor the author.

Yes, metadata is a kind of a pet issue for me, and I’ve even written about How to Enter EPUB Metadata, apart from also having written Software to fix metadata in PDF- and epub-files (epub-meta/epub-rename and exif-meta/exif-rename. The latter works for PDF; the name comes from exiftool, altough technically the PDF metadata is XMP).

But still, if your paper should be worth anything, it should be worth to be found, and this also means worth being provided with accurate meta-information.

Librarians either work with an ISBN, and if no ISBN can be found (because it was published before 1969, or because no ISBN was ever registered), they need the following to correctly identify a work:

Author

Title

Publishing Year

Publisher

So you should take care that at least the first three of those are correctly filled in. If you’re doing a paper or book in the course of your work or study and publish it on the internet, consider entering the university or company as publisher.

Posted in Computers, EBooks, Metadata | Comments Off on Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1”

It has commandline-switches to selectively choose which metadata should be displayed. And it does it very fast, 3ms on my system, as opposed to 670ms ebook-meta from Calibre needs. However, it can only display the metadata, not change it. For changing metadata, you still need ebook-meta.

Having the ability to display metadata fast, made it possible to rename EPUB-files. Initially, I had the idea to do that in C too, but working with strings is actually quite tedious in C, so I decided on perl. So there’s now also a program called epub-rename, which renames EPUB-files according to it’s metadata in the format Author - Series SeriesIndex - Title. Moreover, it also has, trough ebook-meta, the ability to fix certain issues in metadata-tags. Namely change inverted Title/Author tags, fix Author-Tags which are in the wrong(!) Last, First-Format, and some more.

If you have a certain library of E-Books from different sources (e.g. Baen, Gutenberg, Archive.org, Google Books) you will notice a disparaging plethora of different styles of annotating EPUB-files, sometimes blatantly wrong and in violation of the EPUB Standard itself.

So this is a Howto on how to enter these metadata correctly. I’ll mostly cover the program “ebook-meta” (part of Calibre) which is available on about every platform.

Encoding

EPUB uses UTF-8, and UTF-8 only. Still, if you don’t use things like left-and-right quotes and backquotes, you’ll make sure your tags don’t get messed up. Ideally, only use the single quote “‘”.

Vocabulary

Try to be consistent in the vocabulary for tags (genres, categories). Sadly, no vocabularies are specified by the standards right now.

Tags

Title: This will contain the Title as it’s read. Don’t put in the author (yes, seen that). Don’t anticipate sorting by naming it “Title, The”, this is the task of the library program which sould do this. Don’t enter Series and Series Index. Don’t enter the author here.

Title sort: You don’t need to enter that; at least ebook-meta usually sets this correctly.

Author(s): Enter the author as named. Don’t enter the title here (also seen..), and don’t enter things like series or title after the author’s name. Don’t anticipate sorting by naming it “Name, First Name”. Enter it in the form “First Middle Last”. If the authors name is usually used with initials, use these. Don’t enter “John Ronald Reuel Tolkien”, but “J. R. R. Tolkien”. After an initial, enter a dot and a space. If there are several authors, enter all of them, when using “ebook-meta” separate them with “&”.

Author sort: You don’t need to enter that; at least ebook-meta usually sets this correctly.

Publisher: This is the original publisher. If you’re preparing an out-of-copyright e-book, don’t enter yourself. Also, don’t anticipate sorting but enter it as given.

Languages: At least one language must be set, you can set several if the book is multi-lingual. The language-code is the 2-letter iso-code. Apparently it ignores localized ones such as “en-gb”.

Published: This is the original publishing date. Not the date you’re preparing the e-book!

Rights: Enter the year and copyright holder, if applicable, and a license if necessary. Like this: “Copyright 1954 by J. R. R. Tolkien” or “Copyright 2012 by Peter Keel, License CC-By-2.5” or “Public Domain” if the work is not protected by copyright anymore.

Identifiers: Here go ISBN or ISSN (for magazines) or UUID. You can put in as many as ou like. “ebook-meta” allows only to set the ISBN and a BookID specifically.

Comments: This is actually the “Description”-tag, and it’s supposed to hold the blurb which would otherwise go onto the flap or the back of a physical book. And it should not contain HTML-tags. Also, don’t make this too long.

Series: This is a Calibre-specific tag, however it’s honored in many e-book-readers, so you really want to use this. Enter the series as spelled. Don’t take sorting into account. Don’t enter any series number.

Series Index: Also Calibre-specific, but goes with support for the “Series”-tag. Enter a number here, corresponding to the number in the series.

Tags: This one is really the “Subject”-tag. It contains as many tags as you wish on what the book is about. Enter the genre here as well. Enter tags separated by comma. Do NOT enter a blurb here.

Category: This is probably the “Type” tag but support seems to be rather limited. in any way, the genre does NOT go into that, but rather things like “textbook” or “novel”.