Internet Data Sets

The Most Comprehensive, Internet-Scale Data Sets Available

Petabytes of Internet Data at your Fingertips

At RiskIQ, data is in our DNA. Since our inception, we have been gathering petabytes of passive DNS and WHOIS data, and through our crawling of the entire internet, have amassed data sets that include SSL certificates, newly observed domains, web and analytics trackers, mobile apps, and the components that make up the web pages we see every day.

These data sets can be used by security professionals and threat analysts to connect the dots between threat infrastructure and understand the attack vectors and patterns used by attackers.

White Paper: Using Internet Data Sets to Understand Digital Threats

Passive DNS

Domain Name Service (DNS) is a lot like a phonebook for the internet. It tells your browser the IP address of the server that hosts the website associated with a domain name. Passive DNS collection involves gathering the domain request and IP response from DNS providers across the internet when they happen. This can provide insight into when resolutions change and where they change to.

RiskIQ collects 1,000 gigabytes of passive DNS data daily

Threat actors need to establish infrastructure to conduct their attacks, and one of these infrastructure elements is often DNS. For example, a piece of malware may include a hardcoded domain name that is seemingly legitimate. To execute an attack, a threat actor may changes that domain’s DNS record to resolve to a malicious IP address to deliver a payload or to encrypt data through ransomware. RiskIQ also includes sources of active DNS resolution when a specific domain or IP is queried.

WHOIS Records

Thousands of times a day, domains are bought and transferred between individuals, and domain registrants must provide information about themselves when registering. This information gets stored in a WHOIS record associated with the domain.

WHOIS is a protocol that lets anyone query for ownership information about a domain, IP address, or subnet. RiskIQ has a vast database of WHOIS data, which is available to query for registrant information. WHOIS records provide information that includes the name, email address, street address, and phone number of the individual who registered the domain.

Attackers need to establish infrastructure to originate their attack as well as set up servers to communicate with their malware. Often, attackers register multiple domains at the beginning of an attack campaign for use during all phases of their operations.

WHOIS data can provide an organization with insight into who is behind an attack campaign. Using domain registration information, an organization can unmask an attacker’s infrastructure by linking a suspicious domain to other domains registered using the same or similar information.

URL Intelligence

Through RiskIQ’s internet crawls, we have records of billions of pages and the associated HTTP requests. Through our vast database of URL intelligence, combined with other industry-leading blacklists and URL data feeds, threat analysts can query a specific URL for any information that RiskIQ has on that URL. If RiskIQ does not have any information, the URL will be crawled and evaluated for malicious code, iframes, redirects, or drive-by-downloads that could lead to compromise.

SSL Certificates

Securing user transactions and interactions on the internet is an essential part of everyday life. SSL certificates are files that digitally bind a cryptographic key to a set of user-provided details and assist in providing this security. Beyond securing your data, certificates are a way for analysts to connect disparate malicious network infrastructure. SSL certificates can provide context by showing whether a domain or IP is legitimate based on its certificate, identify self-signed certificates versus third-party authority, and identify IP clusters and additional certificates based on shared certificates.

RiskIQ has collected more than 30,000,000 SSL certificates since 2013

SSL certificates are typically used by malicious actors in a few different ways. Some are self-signed, so they have no real credibility, and are associated with a website or web server performing a malicious function that RiskIQ has seen in the wild. Some SSL certificates are used to encrypt command and control communications for a piece of malware, so the data isn’t visible. And sometimes information about the certificate can be used to surface connections among subject alternate names for certificates.

Browser Cookies

As RiskIQ virtual users crawl the internet, they capture everything that happens under the hood when the virtual users visit a website. This includes capturing any cookies that might be dropped by the site to track user behavior or note the status of the user’s machine. Cookies are yet another source of information that can tie pieces of infrastructure together across attack campaigns, or connect seemingly unrelated assets together. RiskIQ correlates cookie source name and data with infrastructure hosting the cookies to allow analysts to pivot and find other sites with related cookies.

Threat actors often use cookies to track users who have been delivered a malicious payload so as not to try to infect a user again. Threat hunters who are investigating a cookie as a possible indicator of compromise can search the RiskIQ internet database for that cookie.

Mobile Apps

RiskIQ crawls and scans more than 150 mobile app stores (yes, there are more than just the Apple App Store and Google Play store) daily, taking inventory of the apps, versions, and code that exists in each of the stores. With RiskIQ’s knowledge of nearly 20,000,000 apps, organizations can ensure that their official mobile apps have not been compromised and are hosted only in stores authorized for distribution.

RiskIQ has a database of nearly 20,000,000 mobile apps

Threat actors and hackers often will download the application binary (the app’s code), make small changes that infect users with malware, spyware, or viruses, and then re-post the app to an unmonitored app store where an unsuspecting user might download it, thinking it’s official and legitimate.

Monitoring for these occurrences of rogue and unofficial apps in RiskIQ’s Mobile App data set safeguards your customers and their mobile devices from attacks.

Newly Observed Domains

Newly observed domains are simply domains that RiskIQ has never seen before through our collection of petabytes of DNS data. Threat actors often create new infrastructure quickly, engage in their attack campaigns, and then duck and run in a matter of hours or even minutes. From a strict security perspective, there are few reasons that someone would need to visit a domain that has just come online, unless the person created it for their own use, or, more likely, they were sent a URL via a malicious campaign.

Organizations can use the RiskIQ Newly Observed Domains data set to systematically block new domains (via a web filter or firewall) for a set amount of time after they’re first observed. This can act as an added layer of protection against quick attack campaigns.

Host Pairs

Host pairs are unique relationships between pages that are observed by RiskIQ when we crawl a web page. Each pair has a direction of child or parent and a cause that outlines the relationship connection. These values provide insight into redirection sequences, dependent requests or specific actions within a web page when it loads.

The connection could range from a top-level redirect (HTTP 302) to something more complex like an iframe or script source reference. What makes this data set powerful is the ability to understand relationships between hosts based on details from visiting the actual page. Host pairs relies on knowing website content, so it’s likely to surface different values that other sources like passive DNS and SSL certificates do not.

See an Example of How Host Pairs Uncovered Malicious Infrastructure

Website Metadata and Trackers

RiskIQ gathers the full DOM during the loading process of pages that we crawl. We extract details such as website trackers, analytics codes, social network accounts and other unique details. These values can provide insights into additional infrastructure that typically goes unnoticed by static data sets. RiskIQ has data about trackers from includes IDs from providers like Google, Yandex, Mixpanel, New Relic, Clicky, and more.

For example, often when a website’s HTML is scraped and reposted for something like a phishing campaign, malicious actors don’t bother to change things like the associated Google Analytics ID, tracking pixels, cookies, or social networks connections. Being able to search official tracking codes can surface pages where the threat actor has forgotten to change this information, leading to security teams finding and shutting down a malicious campaign.

Also, like most digital organizations, some hacking organizations utilize tools like Google Analytics to measure the success of their malicious campaigns. We can find other instances across the internet where we’ve seen the same malicious actor’s analytics tracker and uncover additional campaigns associated with them.

Derived Data Sets

Along with the terabytes of collected data from across the internet, RiskIQ also extracts and analyzes the data to create new data sets that aid in discovering, understanding, and mitigating digital threats.

Data sets like the RiskIQ Accomplice List which includes websites and URLs that are not malicious themselves, but that link to or redirect to URLs that host malware, phishing or scams. RiskIQ also maintains our own Phish List, Scam List, and Zero-Day List which include never-before-seen URLs and pages that host phishing, scam, or malware content that we find while crawling the internet with our virtual users. These proprietary lists are generated using RiskIQ intelligence and correlation models to provide insight that is not available from point solutions or other threat intelligence feeds.

OSINT - Open Source Intelligence

Open source intelligence is community gathered information that is available from public sources. RiskIQ gathers this information from media and press sites, web-based community sites, and public data available via search engines to consolidate it into our platform for easy access and correlation with existing data.

Blacklist Feed

RiskIQ ingests blacklist feeds from internet service providers, phishing solutions, fraud prevention, and other internet security organizations to consolidate and further enrich our own, proprietary blacklists. Items on blacklists have been confirmed to be hosting some sort of malicious activity, such as phishing pages, malware, scams, or fraudulent goods.

Malware Feed

Malware feeds are provided by a number of internet security organizations and internet service providers that detail URLs and sites that host malicious files. Sometimes the sites that are included in these feeds are not something that a typical user would stumble on to while browsing the internet, but rather websites which are part of malicious redirection chains in compromised websites or advertisements.

Phishing Feed

Phishing feeds are provided by various email providers, internet security organizations, and internet services providers that have confirmed URLs to be hosting sites meant to steal user credentials. While RiskIQ gathers and confirms phishing pages in our crawls of the internet, additional phishing feeds ensure greater coverage for customers. RiskIQ also provides our confirmed phish URLs to Google Safe Browsing and Microsoft SmartScreen to help block phish from impacting other internet users.