Exploring Big Data with the OpenDNS and Pandora research teams

Last Thursday OpenDNS hosted 150 guests from the SF Data Mining meetup group for a night of insightful talks on Big Data. After filling up on plenty of beer and pizza, attendees made their way upstairs to enjoy presentations from the OpenDNS security research team and Pandora’s data mining team. Although the meetup group is local to San Francisco, Big Data has begun to dominate research strategies across entire industries, all over the world. So, we wanted to share the presentations here for everyone to enjoy. Below you’ll find summaries of each presentation and a video capturing the evening’s talks.

Video of the presentations can be found here:

Big Data for Security at OpenDNS

OpenDNS is currently answering 50 billion DNS queries a day. That’s nearly 3 percent of the world’s total Internet users. The impossible mission of the OpenDNS security research team is to slice and dice these queries in order to discover new threats and new malicious domains as fast as possible.

OpenDNS researchers use statistical analysis, visualization and machine learning to gain insight, and are complimented by the OpenDNS community-based tagging system that now also includes malware domain tagging. Since a single false positive can have a catastrophic impact for both Umbrella and OpenDNS customers, the team works on numerous algorithms to achieve a very low error rate and eliminate false positives.

The team detailed three such algorithms in their presentation:

Secure Rank – A variation on Google’s Page Rank algorithm that leverages prior knowledge of both malicious and benign domains to assign an initial rank. This algorithm works amazingly well and has quickly discovered thousands of new, highly suspicious domain names for the OpenDNS research team.

Personalized Page Rank – Focuses on what happens just before and just after an infection, and at a specific geographical location. The focus here is on finding compromised websites that will eventually host or control malware.

Fast Flux – There’s a specific category of domains known as Fast Flux that take advantage of the fact that the set of IP addresses returned for a domain name is only valid for a limited period of time, over which the domain owner has full control. A botnet operator can leverage this feature to very quickly switch to a different set of hosts in order to serve a malicious payload. This makes it more difficult to block a criminal operation, as any compromised machine can become a host for redirecting Web browsers or serving spam, phishing and malware at any time. In a recent Fast Flux detection experiment, predictions were 99% accurate with zero false positives for 600 true negatives.

The team also shared details on the numerous tools, programming languages and libraries they use for processing the wealth of data at their fingertips. The research team uses Hadoop, HBase, Pig, Hive, Java, Ruby, Python, R on a daily basis, and Kafka, Storm, GraphLab and GraphChi are being experimented with. You can learn more about the OpenDNS security research team by following @ThinkUmbrella on Twitter.

Data Mining at Pandora

The Pandora research team shared some very impressive numbers: 175 million registered users, 3.8 billion stations and 1.27 billion listening hours per month and counting. But what makes for good data mining? At Pandora, it’s all about discovery, and learning interesting or unexpected facts about your own area of study or system. For example, a few years back the Pandora team discovered a station named, “Christmas Radio” that they had never created. However, the system had self-adapted to actually play holiday music.

For data mining, Pandora uses Hive, R, Matlab, D3.js and the Google Charts API. For data storage, they recently shifted from Postgres to Hadoop, observing that until you have huge amount of data, Postgres works just fine.

It has been twelve years since Pandora first invented the Music Genome Project, which compresses a song as a sequence of characteristics, or “genes”. But, this exciting and unique project is still being worked on today by 25 musicians. Sounds like there’s lots more discovery to come.