Netflix Dataset

Nowadays, the existence of streaming media is needed by people. It is because people can watch their favourite TV show or movies anytime and anywhere. They just need a device that can connect to the internet and also pay to the provider. Netflix is one of the companies that provide streaming media. This company was founded in Scotts Valley, California by Reed Hastings and Marc Randolph in 1997. At the beginning, this company just focused on streaming media and also video on demand online and DVD by mail. But then, Netflix enlarged the service into film and TV production and also distribution.

The company that is headquartered in Los Gatos, California, had 109.25 million subscribers that spread around the world. This number includes 52.77 million subscribers in the United States. Netflix produce new content, secure the rights for additional content and diversify through 190 countries. It leads the company to rack up billions in debt: $21.9 billion as of September, 2017, up from $16.8 billion from the similar time the prior year. For your information, beside it has a headquarter in Los Gatos California, this company also has the other offices including in the Netherlands, Japan, Korea, India and Brazil.

Talk about Netflix, have you heard about Netflix Prize? It was an open competition for the best collaborative filtering algorithm. It is intended to forecast user ratings for movie based on foregoing ratings without any other information about the users or movie. This competition was held by Netflix and open to anyone who is neither connected with Netflix nor a resident of certain blocked countries. In 2009, exactly in September, the team of BellKor’s Pragmatic Chaos got the grand prize of US $1,000,000 that bested Netflix’s own algorithm for forecasting ratings by 10.06%. Netflix served a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Every training rating is quadruplet f the form <user, movie, date of grade, grade>. The user and also movie fields are integer Ids, while grades are from 1 to 5 (integral) stars. The qualifying data set includes 2,817,131 triplets of the form <user, movie, date of grade>, with grades recognized only to the jury. A participating team’s algorithm has to predict grades on the whole qualifying set. But they are only informed of the score for half of the data, the quiz set of 1,408,342 ratings. Another half is the set of test of 1,408,789 and performance on this is used by the jury to set potential prize winners. We can conclude that the data that are used in the Netflix Prize looks like this:

1. Training set (99,072,112 ratings not containing the probe set, 100,480,507 containing the probe set.
2. Probe set (1,408,395 ratings)
3. Qualifying set (2,817,131) that consist of:
4. The set of test (1,408,789 ratings), it is used to decide winners.
5. The set of quiz (1,408,342 ratings), it is used to calculate leaderboard scores.

The set of training is such that the average user rated more than 200 movies and the average movies was rated by more than 5000 users. But there is variance in the data, several movies in the training set have as few as 3 ratings, while one user rated more than 17,000 movies.