IMDb Data – Now easily available to contributors

Today (20 Dec 2018) we are pleased to announce the IMDb datasets are easier to access and now directly from imdb.com. Using the new interface, contributors can bulk-access subsets of IMDb title and name data for personal and non-commercial use. Each dataset file is in a gzipped, tab-separated-values (TSV) format.

The extended datasets are only visible to folks who have made a large number of contributions recently. Those below that threshold only see the basic datasets. Folks who have not made any contributions don't get to see either.

Chris,Please define 'large number' and 'recently'. Apparently making the top 250 list several times doesn't qualify me: is that because my number of contributions is still too low, or because I've taken a break for while?

And why aren't the requirements mentioned either in the announcement or on the page itself (there's not even any hint that anything else might be available)?

Chris, can you please answer the question of how many a "large number of contributions" is? And can you also please explain why this information would only be made available to them? For a long time it was available to everyone. What was the problem with that?

What are the requirements for accessing the mysterious 'extended datasets'? Exactly how many contributions are there in 'a large number'? Exactly how many days/months/years ago counts as 'recently'?

But I suspect that whatever your answer, I'll be asking you to reconsider your policy on this. For my own case, I have contributed over 45,000 items to IMDb over the last 5 or 6 years (according to Col's end-of-year reports). I realise that's nowhere near as much as some, but I still consider it to be a very large number, and it was enough to get me into the top 250 contributors list three times in recent years. So I'm disappointed that you seem to be demanding more, before I can access data that wouldn't even exist without contributors like me. I'm sure I'm not the only person in this situation. For all
the flaws in the old ftp-based system, at least the data was there and
freely available. Why are you now so reluctant to share?

Also, why are you being so secretive about this data? I still have no idea what might be included in these 'extended datasets' and in fact I wouldn't even know that such a thing existed at all if I hadn't asked here. Neither the announcement in this thread nor the datasets page itself makes any mention whatsoever of what might be available or what mystical incantation is required to unlock it. Why is that?

There may have been some esoteric widespread incidents of individuals or organizations who have not contributed much, if anything at all, to IMDb yet built up their own websites by siphoning off IMDb. This is going to happen to every popular website regardless of the measures to slow it down, but sometimes trying so is worth the drawback.

If restrictions are necessary, then so be it, but the criteria for access need to be designed carefully so that significant contributors are not left out. But without knowing the current criteria or IMDb's reasons, we're left guessing...

To newcomers, I very often claim that IMDb is a collaborative site. But there is also the dark blind opaque side of IMDb: information that never make it out, concealed roadmaps, unanswered questions, which turn out to be extremely irritating (I leave the mysterious rating computation aside: this I can understand as it could be manipulated).So I am afraid that the answer will never come.(And thus I hope to tease and lead the bear to come out of his den !! :D)

Jeorj Euler :... There may have been some esoteric widespread incidentsof individuals or organizations who have not contributed much,if anything at all, to IMDb yet built up their own websitesby siphoning off IMDb.- - -

IMDb could add or already has added special names that are not realand would not be found any other place but IMDbIMDb could then do a Giggle Search for that nameand see who is publishing IMDb data as their own ? ? .

1 day ago by Ed Jones (XLIX) That curious title is an IMDb Test title I believe. Think I remember Col explaining this once. It was made up names (nm) sequence numbers that had bogus actors on it.Running across one of these titles or names is rare..

Another two weeks go by and still only silence from IMDb. Disappointing. I had seen some signs recently that IMDb was getting better at communication, but clearly there's still a long way to go. If you're actively refusing to answer my questions then at least have the decency to say that, instead of leaving me waiting and wondering.

If anyone's interested, in the meantime my recent contributions seem to have triggered the magic formula and unlocked the extended datasets for me. Looks very promising at first glance, so I should probably be thanking you for that. But, sadly, the discussion in this thread now leaves me wondering if I'll be able to find enough time to look at the data in detail before my recent contributions are no longer recent enough and I lose access again.

Great! Thanks a lot for the insight.Yes I am aware of the duplicate of crews, which aready occurs with the .principals, both available in the initial set of 7 files.And I also know for certain that IMDb will not disclose what triggers the access to the extended set...Let it be!

Please try to give the IMDb company a chance to make good on its vows before passing too hard of a judgment. I suppose, there could be a big question as to what constitute a "chance", but improvements can take time to be planned, specified, tested, double-checked and implemented, and that probably includes the very techniques of feedback and interaction with the IMDb Community. One problem is that there are a multitude of big plans, and while we may not all be happy with the way those plans are prioritized, the circumstances of resource management, following trends and making the site more attractive to more people, or easier to use by more people, have to be taken into consideration.

I was not making any judgement. Any of my activity around IMDb might be considered as a hobby, so no reason to take it too seriously.
Moreover, I have worked in IT for more than 25 years and I experience every week what priorization and project management means!

Jeorj, do you really think it's unreasonable for me to hope for a reply from IMDb Staff in less than the month I've been waiting so far? While I'm grateful for the input here from you and Vincent, I don't understand why these datasets seem to be surrounded in mystery and I'd like an official response. Even if that reply is "we can't/won't answer your questions", I want to know that IMDb aren't ignoring me. Are my expectations too high?

I don't know. You're more than welcome to keep inquiring about it every month. If you're feeling disappointed about this or that, then it may be because your expectations are way too high. After several fiascos of the past few years, I've long since lowered mine.

There is hope that things will get better, but as we can see from the most recent dozen or so announcements, the company seems heavily focused on the portability aspects and entertainment aspects, of the site, as of this time.

The Extended Datasets are available to those who have 1000+ approved contributions in the last 360 days, otherwise the Basic Datasets are available with just one approved contribution in the last 60 days.

Ok, it's nice to finally see that more data is available (to some people).I see no way to qualify as I have a wide range of interests and I'm not from the industry so getting 1000 updates in 360 days will be never possible (I only made maybe 5 updates in the last 15 years and pointed to serveral issues with broken/incomplete exports of the LIST files via the normal support pages for ~19 years).

The Java Movie Database (JMDB) application...What should make me qualify to access the data is the fact that I'm the author of the Java Movie Database (JMDB) which is available since 19 years (first versions only to a limited set of beta testers). --> http://www.jmdb.de/Contact details can be found on the above website.

There is no other free application I know of processing the (public) IMDb data for that long. The only thing coming close is the IMDbPY project.While the application was originally created by two persons I'm the only one left (since 15 years).

The JMDB application allows to import/process the old LIST files (which is still the base for the search inside the application) plus the TSV files (not yet used inside the application to search).

The reason that I didn't update the code to use the TSV content inside the application beside adding support to import the data is basically because the available content is so limited, that it is actually useless.Sorry, but there is no other (nicer) way to sum it up when it comes to the TSV file format.This is what the people already complained about in the original thread (https://getsatisfaction.com/imdb/topics/imdb-data-now-available-in-amazon-s3).

So I would really like to see how the "complete" stuff actually looks like and if it's actually for the following reasons:

Extend the TSV data import that JMDB supports to the full data-set (not only limited to the "free" stuff available now - which is useless)

Making the new TSV content searchable (if it's actually worth the effort - not for details limited to 10 entries (title.principals) per movie and TOP-4 movies/titles per name (name.basics) while the old LIST format offered the full list of people involved in a title)

Maybe offering some functionality to the professional "contributors" to use the application to send updates to the IMDb website (API available?)

As this has been under the radar here other use of the IMDb data in the past...I also want to share that the IMDb data has been and still is (at least the frozen LIST files) used together with JMDB:

The imported IMDb data has been used to create covers or other internal notes on titles that have been recorded from TV

Finally some technical issues with the format...There are also some technical issues I have starting with the fact that
inside each of the *.tsv.gz archives the name of the file is "data.tsv".Normally the filename of the compressed file should be equal to the
file found inside the archive minus the ".gz" compressor extension
(Example: "title.crew.tsv.gz" --> "title.crew.tsv").It's broken since the beginning. I you extract all files you have downloaded into the current/same directory you end up with only one file - that was extracted from the "last" archive processed.

When the data is imported with JMDB from the compressed gzip file and you try to add foreign key constraints to the filled tables containing the data it will not work for some cases as there are references in the e.g. title.crew.tsv (person reference) that has no matching entry in the name.basics.tsv file so the relational database tells you that your data basically is incomplete. There is more of this.

I am an academic who has used his incredibly useful JMDB tools to manipulate and organize the IMDB LIST files as part of my research. His efforts have made using those IMDB flat files much easier for people like myself whose programming skills are sub-par, and as such I think advances the mission and visibility of IMDB.

I am pretty sure the target audience of the contributor files would include individuals like myself who are interested in having comprehensive files (in my case for statistical analysis). I imagine industry folks and those more interested in following certain titles would use either the regular webpage or (in the case of industry folks) using IMDB Pro.

So my hope is that these additional files are made available to volunteers like Juergen who provide the tools which many of us use to access the raw files. I think his software is a meaningful contribution to IMDB, and it would make sense both from the perspective of advancing IMDB and also for a reward for pro-social behavior behavior to expand the definition of "contributor" to include tool-making.

I also agree with Juergen,What are these secret data?I remind you that the IMDB database was created over the years with the support of millions of people around the world, without them (and even a small contribution of mine) IMDB would never have been born.It also seems absurd and insane that to get access to the extra data you have to enter 1000 contributions a month, but even if I enter a data a day I will never succeed.Go back, and release the database to everyone, it and public assets.