Main menu

Yahoo open largest database to the public

Machine learning is a topic I’ve touched upon a lot recently, and whilst there are efforts underway to develop machines that are more efficient learners, at the moment it still requires huge databases and a lot of computational grunt to train the algorithms successfully.

Alas, access to that kind of data is something that is often lacking, with large-scale datasets typically the preserve of machine learning academics or scientists at huge companies.

A project from Yahoo Labs aims to make such data more widely available. The Webscope database has been available to Yahoo researchers for some time, but they have recently opened it up to the public.

Data for the masses

The database, which currently provides in the region of 13.5TB of anonymized user-news item interaction data from around 20 million users over a 3 month period.

The data consists of anonymized user interactions with news content on a range of Yahoo properties with the aim of promoting independent research in machine learning and recommendation systems.

They also hope to open up this world to a wider range of participants and level the playing field between industrial and academic research.

The data provides users with both demographic information for users plus their interactions with content. Each interaction is timestamped and contains information on the device used to access the content.

The data has already been given a thorough working over by the Personalization Science team at Yahoo, and they’re confident that the public will have similar fun in areas such as behavior modeling, machine learning, recommendation services and content modeling.

“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, “real-world” dataset,” they say.

Suffice to say, the field of research possible with this data is relatively limited due to the subject matter included, but it’s nonetheless a tasty amount of data for scientists to play around with, and hopefully it will prove to be valuable to the community.