Requesting open-licensed, open-format recordings of the voices of Wikipedia subjects for Wikimedia Commons

The Idea

A little while ago, my friend and fellow Wikipedia editor Andrew Gray (he’s the Wikipedian in Residence at the British Library!) mentioned to me that Wikipedia could do with more sound files. We discussed recordings of music, industrial and everyday sounds (what does a printing press sound like? Or a Volkswagen Beetle? What do different kinds of breakfast cereal sound like when milk is added?), as well as people’s voices, so that we have a record of what they sound like.

Beethoven’s Trumpet (With Ear) By John Baldessari, at the Saatchi Gallery.Photo by Jim Linwood, on Flickr, CC-BY

In the spirit of Wikipedia, all such recordings would be open-licensed, to allow others to use them, freely. They can then be uploaded to Wikimedia Commons (the media repository for Wikipedia and its related projects) in an open format, namely Ogg Vorbis (that’s like mp3, but without patent encumbrances).

So I’m working on a new initiative to provide short (under ten-second) open-licensed audio clips of examples of the speaking voices of notable people (i.e. people who have Wikipedia articles about them).

What To Do

As a pilot, I’m asking some of my (cough) celebrity friends to kindly record the following, or a variation of their choice, with no background noise:

Hello, my name is [name]. I was born in [place] and I have been [job or position] since [year]

Once they’ve done that, they can convert the file to Ogg Vorbis using this free tool and then upload it to Wikimedia Commons, with an open-licence, with no “non-commercial (NC)” or “no derivatives (ND)” restrictions, (e.g. CC-By or CC-By-SA), and add the category “Voice intro project”.

If that’s too much fuss, they can e-mail it, or its URL, to me (andy@pigsonthewing.org.uk), using common file formats like mp3 or .wav, stating that it’s under one of those licences, and CC the mail to: permissions-en@wikimedia.org to formally record the open licence. Then I or other Wikipedia editors will make the conversion.

Alternatively, perhaps, they can point to a suitable, open-licensed, example of their speaking voice, which is already online.

Anyone Can Help

If you’re not the subject of a Wikipedia article, you can still help, by recording and uploading to Wikimedia Commons audio files, as described above, of machinery or everyday activities and occurrences.

Updates

A couple of Wikipedia article subjects have asked why they would do this. In short, so that there is a public — and freely reusable — record of what they sound like, for current and future generations. And so that we know how they pronounce their names.

I’ve been asked about multi-lingual recordings. The best thing would be separate files, one in each language, please.

If you have a microphone on your computer (doesn’t work on iPhone/iPad), it’s possible to record directly into the Vocaroo website, and just email or tweet me a link. But you still need to agree to an open licence!

This is a brilliant idea, I hope it’s widely adopted. I also like Yuvi’s idea very much – a similar-ish sort of thing is the new Radiolab app which allows you to record a tiny clip of you reading a bit of the end-credits (the name of the podcast’s sponsor) which I’m itching to try but always get stagefright whenever I press record, haha.

Our thoughts have recently been turning to similar matters. We’ve been working on a project for BBC World Service to take 70,000 English-language programmes and somehow make them available on the web. The big problem is that whilst we have high quality audio files we have no descriptive data about them. So nothing about the subject matter discussed or who’s in them and in some cases not even when they were broadcast.
To fix this we’ve put the audio through a speech to text system which gives us a (very) rough transcript. We’ve then entity extracted the text against DBpedia / Wikipedia concepts to make some navigation by “tags”. Because the speech to text step is noisy some of the tags extracted are not accurate but we’re working with the World Service Global Minds panel (a community of World Service listeners) who are helping us to correct them.
Recent work has been around voice recognition from the same audio files. We’re able to recognise voice patterns but not give a name / identity to the person speaking. Again the Global Minds panel are helping us to put names to these voices.
Obviously it would be better if we could associate these names with Wikipedia / DBpedia concepts to surface programmes about *and* featuring person X.
One suggestion was to compare the audio to speech recordings on Wikimedia. If we found a match we could associate the voice in the World Service archive with it’s Wikimedia identifier and from there to Wikipedia and from there to DBpedia.
To do that we’d need longer (duration) and higher quality samples than suggested here and we’ve mentioned cultural bodies (BBC, BL etc) opening up speech snippets from their archives. By releasing small nuggets of their archives they’d be putting just enough in place to make the further contextualisation of their (and other) archives possible which feels like a good trade. As ever there are probably rights issues…
@bilt – would anything roughly like this fall under your job description? 🙂

Interesting ideas, Michael, and lots to think about. The project sounds ripe for crowd sourcing; and that could be facilitated by open-licensing some of your recordings. They could then be uploaded to Wikimedia Commons, tagged or categorised there, and you could then reimport the metadata — I have in mind a similar initiative with old photographs from the US National Archives, which worked this way.

Of course, the BBC also has its own collection of voice recordings from named people, against which you could match — and those, or samples from them, would also be useful under an open licence.

The World Service archive would indeed be a candidate for a complete crowd-sourced approach (with the usual caveats around rights) but the research goals of the project I’m working on are about finding a sweet spot between first-pass machine processing and community correction. Have just posted more over here.

Matching against our own voice archive would only really be useful if we could get from that match to wiki/dbpedia identifiers so would only work if we added them to wikimedia under an open licence.

I’m not the best person to ask about what’s in the archive but if you’ve a genuine interest in Horace, Arnold or Thomas I know the perfect person to ask 🙂

When I edited the page for Terry George’s Whole Lotta Sole I discovered there was such a thing as a ‘films made by Terry George’ template thing that I could add. When I then amended the template itself to add in WLS and another of his films I was pleased that it all updated itself nicely.

Similarly because I keep my eyes on that page I am aware when someone has inserted a category or something like that, even if nothing much changes on the page. So people are aware of quite subtle changes. If I write [[text]] around something it redlinks until that page is created at which point it automatically sorts itself out, the link is already in place.

At the moment when people look at pages of notable folk there is no indication that anything’s missing in terms of the lack of voice recording.

Could there be a template for the voice recording that (a) will automatically pick up the formatted recording that’s added to the Wikimedia page and (b) when the template is added to a page won’t show up on the final page (until the sound file is installed) but will make page-watchers go ‘hang on a minute, what’s this clever notion then’?

All the better if the template can also link to the voice recordings page with the link to your blog on it for instructions.

Then we can add these templates to notable people, it won’t affect the page (so hopefully no-one will mind it sort of sitting there, waiting to be activated) and it raises awareness of the project.

It’s not a limit! That’s how long it takes to recite the sample script, which is designed to not be onerous, and to be long enough for the listener to get an impression of what the subject sounds like. But if people want to say more, they can.

The script is nice and simple; sufficiently so to allow someone to repeat until they’re happy with it. They might as-well have recorded the video too. And, I’ve a possible solution to the conversion fiddling from the Wikinews Paralympics project.

Andy, as the founder of MyWikiBiz (and with Wikipedia link “Gregory Kohs” redirecting to MyWikiBiz article on Wikipedia), would I be welcomed to add a voice recording identifying myself? Are users banned on English Wikipedia (but active and welcome on other WMF projects) permitted to participate? Is a general rule being applied for “corporate” topics having, for example, a company founder identify the company by voice?

Hi Gregory, The files are uploaded to Wikimedia Commons, not en.Wikipedia, so that’s the first hurdle dealt with. As for adding one to an article, I see it as no different to adding a picture. Do note, though the request for non-controversial content, and please keep your comments neutral, in line with the suggested script and existing examples.

Thanks for the info, Andy. I’m a welcome user on Commons, and I’m actually quite capable of making an audio introduction about myself without foaming at the mouth — 😉 — so, I’ll give this a try in a little while. Obviously, adding the clip to the English Wikipedia would have to be done by someone who’s willing to carry the burden of “proxying for a banned user” accusations that will surely fly… but there are no shortage of drama mongers on Wikipedia would probably love to test this out.