You can download the dataset and start doing lookups and statistics which wouldn’t be possible using Censys Web UI or REST API.

The repository is not limited to data from Censys. Other organizations and researchers can upload the results of their own researches.

It is the latter point that this post will focus on. I provide technical details and practical use cases (mainly for reconnaissance).

Rapid7 runs their own Internet-wide research called Project Sonar. Often forgotten fact is, that data from Project Sonar is not included in Censys Web UI results.

DNS

Let’s first discuss how DNS is scanned on the Internet-wide level. If you are not interested in such technical details, feel free to skip directly to Usecases.

Censys uses port scan on port 53. It detects DNS servers and resolvers around the globe. Censys also does some form of “health check”. It tries to resolve A record for the specific domain name (c.afekv.com) to see, whether DNS resolver returns the correct value. This method can reveal rogue public DNS resolvers around the globe. This is an example of a result that Censys provides for 8.8.8.8. If you are interested in researching open DNS resolvers, I recommend checking public-dns.info.

There is, however, another side of DNS scanning. While Censys takes scanning approach which operates on the transport layer (L4), Project Sonar uses application layer (L7) for DNS scanning. In other words, Project Sonar provides the dataset which contains domain names and their corresponding resource record values gathered at some point in time. This creates entirely new visibility into DNS from the global perspective. We can cluster domain names which resolve to the same IP address, look for DGAs, see most popular mail server providers, and much more.

Project Sonar starts by collecting a large number of possible domain names (even non-existing ones). This is an essential step because we cannot query IP address to give us all domain names pointing to it. DNS just doesn’t work this way. Sources of these domain names include:

Dumps of TLD zone files (.com, .net, ccTLDs, …)

Domains found in responses from HTTP study

CN and SubjectAltName fields in x.509 certificates from SSL study

While documentation doesn’t provide a full list of sources, there is a good chance that third party threat intel sources are used for gathering additional domain names. Note that this list doesn’t only include second-level domain names, but also subdomains and higher-level domain names (or FQDN). On the other hand, the list is by no means complete. There is no way we can find every possible FQDN on the Internet.

Once the list of all potential domain names is gathered, the resolution part starts. The resolution takes a domain name and queries its DNS server to get record values such as a, mx, soa, …

Now, here it gets a little complex. To decrease network traffic, Project Sonar uses ANY “meta-query” type to get all possible record values at once. ANY tells DNS server: “Give me every record you have for this domain name”. Unfortunately, ANY query also increases the load on DNS servers and providers such as CloudFlare stopped responding to ANY queries. Alternatively to ANY query, iterative DNS resolution can be used. This resolution queries each record type one-by-one (e.g., first A record, then AAAA, then MX, ...). Project Sonar is slowly moving towards this resolution as explained further below.

After the resolution phase, the final dataset is produced. It is a GZ-compressed file having ~23GB. Its structure is pretty straightforward:

Notice type:hinfo. This is usually an indication of ANY query refusal. Although hinfo tells us that domain refused to respond to ANY query, we at least know that domain name is active.

Rapid7 realized this problem and beginning November 2017 started producing separate A and AAAA datasets. What is the difference? Resolution uses A/AAAA query directly instead of ANY query. This way, domains which refused to respond to ANY will now provide valid record value. This dataset, therefore, provides only records with type a or aaaa respectively, whereas regular datasets provide combined records with ns, mx, soa, ...

Forward DNS dataset only includes domain names which were ACTIVE during their resolution.

Similar to its forward DNS counterpart, reverse DNS study provides PTR records collected for the IPv4 address space. PTR records can potentially reveal some hidden subdomains. These subdomains are fed back to the master domain list which is used for Forward DNS study.

DNS usecases

After technical details, here are some practical examples that I use quite often. Snippets assume that you have jq installed and available in PATH. Forward DNS dataset is expected to be saved in a current working directory and named fdns.json.gz. Since the dataset is pretty huge and compressed, expect several minutes for these snippets to finish.

1. Reverse DNS lookup
We can retrieve a list of domain names which some IP address / netrange points to. Note however that we are using Forward DNS dataset since PTR records are usually much less versatile in showing domain names associated with IP address. This is often used to verify whether a particular IP address is hosting multiple virtual hosts.

Note that we don't need to parse every JSON to get results. Since IPv4 address is unique to A record, it is sufficient to do grep on raw text and post-process the results.

2. Subdomain enumeration
Subdomain enumeration is one of the essential steps performed during information gathering / reconnaissance phase. It can reveal high-value and forgotten web applications and services that an organization is exposing publicly on the Internet. You can read about subdomain enumeration in my another post. From the more recent tools, I recommend checking amass.

Nevertheless, Forward DNS dataset can also be used for subdomain enumeration (providing very good results):

Note that we now need to test for correct value by parsing JSON directly. Why? There might be a CNAME record which does include the searched domain name in value field instead of name. This might seem counterintuitive since we are looking for all possible subdomains. In other words, subdomain in value field would be one of them. Notice that we are extracting only name value from the record in the second line. Imagine this example:

3. List all IP addresses for domain
Since the cloud is getting more and more prevalent, organizations are slowly moving from dedicated netranges to shared pools of IP addresses owned by a cloud provider. How do we find which cloud providers and IP addresses are some organization using? You guessed it.

In order to find every possible IP address that some subdomain of example.com points to, we can run:

For the sake of simplicity, the snippet only retrieves the IP addresses of subdomains and not example.com itself. You can then use whois against Team Cymru servers to determine a name of ASN to which IP address belongs to:

List of a domain can be used to detect malicious domains. One of the latest ideas am I playing around is to using phishing_catcher on Forward DNS dataset.

SSL

Unlike DNS dataset, SSL study provides files in CSV format (without header). SSL study contains full raw certificates as they were observed during SSL handshake. Keep in mind that this is linear IPv4 address space scan and this dataset includes nowhere near all SSL certificates on the Internet. The reason is Server Name Indication (SNI). SNI was introduced to allow multiple separate certificates (not just additional subjectAltName) be present on a single machine. SNI requires specifying domain name upfront, so the server knows which certificate to respond with. Since SSL study doesn't provide domain name with SSL handshake, not all certificates are observed during this scan.

SSL dataset consists of four components. Certificates are hashed using SHA-1, and each of the four components links the hash to different information:

Names
This file has two columns, first being SHA-1 hash and second is Common Name or one of the SubjectAltName entries. It can be used to cluster domain names with the same certificate without parsing each certificate in the first place. These domains are then fed back to the master file for Forward DNS study.

Certificates
This file has two columns, first being SHA-1 hash and second is Base64 encoded x.509v3 in PEM. Note that base64 string is not standardized and most parsers won't work directly. I have created a simple parser for this purpose.

Endpoints
This file has three columns, first being an IP address, second is port, and third is an SHA-1 hash of certificate that was observed on the IP address. As you can see, this is very similar to the Hosts file.

In some SSL dataset, I noticed that even SHA-1 of the certificate is present in the Hosts file, it is missing in Certificates file.

Project Sonar provides other studies such as:

HTTP — HTTP responses for full IPv4 address space. Note however that this is scan based on IP addresses. VHosts are not taken into account, so the final dataset is heavily limited by it. Links in returned HTML are used to feed the master list for Forward DNS study.