What’s in a Web Crawl?

May 10, 2016, Brandon Dixon

Since releasing our host attribute dataset (pairs, components, trackers), we’ve gotten a lot of great feedback from our community. Users are reporting faster investigation times, more substantial connections and new research leads they wouldn’t have found otherwise. While these datasets are great, they are only a fraction of the data RiskIQ stores on a daily basis. What makes RiskIQs web crawling technology powerful is that it’s not just a simulation, it’s a fully instrumented browser. To understand what this means, we thought it would be useful to go behind the scenes and into a sample crawl response from the RiskIQ toolset.

Crawling Process

Understanding a crawl is a fairly straight-forward process. Similar to how you digest data from pages you browse online, RiskIQ’s web crawlers largely do the same, only faster, automated and made to store the entire chain of events. When crawlers process web pages, they take note of links, images, dependent content and other details to construct a sequence of events and relationships. Crawls are powered by an extensive set of configuration parameters that could dictate an exact URL starting point or something more complex like a search engine query.

For most crawls, once they have a starting point, they will perform the initial crawl, take note of all the links from within the page, and then go crawl those follow-on pages performing the same process over again. To avoid crawling forever, most crawls have a depth limit that stops after 25 or so links outside the initial starting point. RiskIQs configuration allows for some different parameters to be set that dictate the operation of the crawl.

Knowing that malicious actors are always trying to avoid detection, RiskIQ has invested a great deal of time and effort into their proxy infrastructure. This includes a combination of standard servers and mobile cell providers being used as egress points deployed all over the world. Having the ability to simulate, say, a mobile phone in the region where its being targeted means the RiskIQ crawlers have a higher likelihood of observing the full exploitation chain.

Collected Data

RiskIQ provides a web front-end for the raw data collected from a crawl and slices it up into the following sections. Its worth noting, most, if not all of the data presented in the interface is searchable via database queries, some of which power the PassiveTotal host attribute datasets.

Messages

If you ever view a web page in Google Chrome and fire up the developer tools, you might notice a few messages in the console. Sometimes these are leftover debugging statements from the site author and other times they are errors that were encountered when rendering the page. These messages are stored with the crawl details and presented in the raw form. While not always helpful, it’s possible that unique messages could be buried inside of the messages pane that may show signs of a common author or malicious tactic.

Dependent Requests

Most modern web pages are not created from just one local source. Instead, they are constructed from many different remote resources that get assembled to form a cohesive user experience. Requests for images, stylesheets, javascript files and other resources are logged under dependent requests and noted accordingly. Metadata from the request itself is associated with each entry and could include things like the HTTP response code, content length, cookie count, content header and load data.

Dependent requests are a great place to start when investigating a suspicious website. Seeing all the requests made to create the page and how it was loaded can sometimes reveal things like malvertising through injected script tags or iframes. Additionally, being able to see the resolving IP address means we can use that as another reference point for our investigation. Passive DNS data may reveal additional host names that have already been actioned by the team.

Cookies

Just like a standard browsing experience, RiskIQ web crawlers understand and support the concept of cookies. These are stored with the requests and include things like the original names and values, supported domains and whether or not the cookie has been flagged as secure. Often associated with legitimate web behavior, cookies are also used by malicious actors to keep track of infected victims or store data to be used later. Using the unique names or values of a cookie allow an analyst to begin making correlations that would otherwise be lost in datasets like passive DNS, or web scraping.

Links

To thoroughly crawl a web page, RiskIQ crawlers need to understand the HTML DOM to extract additional links. Each A element shows the original href and text properties while also preserving the XPath location of where it was found within the DOM. While a bit abstract, having the XPath location allows an analyst to begin thinking about the structure of the web page in a different way. Longer path locations imply deeper nesting of link elements, while shorter locations could be top-level references.

Headers

Headers make it possible for the modern web to work properly. They dictate the rules of engagement and describe what the client is requesting and how the server responds. RiskIQ keeps both ends of the headers and not only preserves the keys and values but also the order in which they were observed. This process captures both standard and custom headers which create the opportunity for unique fingerprinting of specific servers or services.

Response and DOM

Arguably, what makes web crawling technology useful is its ability to preserve what a page looked like at a certain point in time. RiskIQ not only keeps the full HTML content from the crawled page but also saves any dependent file that was used in the loading process. This means things like stylesheets, images, javascript and other details are kept as well. Having all the contents local means that RiskIQ can recall or recreate the web page as they were when it was crawled. When it comes to malicious campaigns, it’s not uncommon for actors to keep their infrastructure up for a short period to not attract attention. From an analyst’s perspective, being able not only to see, but interact with a page that may no longer exist is invaluable in understanding the nature of the attack.

DOM Changes

Today’s web experience is largely built on dynamic content. Web pages are fluid and may change hundreds of times after their initial load. In fact, some web pages are simply shells that only become populated after a user has requested the page. If the RiskIQ web crawlers only downloaded the initial pages, many of them would appear blank or lack any substantial content. Because the crawlers act as a full browser, they can observe changes made to the structure of the page and log them in a running list. This log becomes extremely powerful when trying to determine what happened inside the browser after the page loaded. What might start off as a benign shell of a page, may eventually become a platform for exploit delivery or the start of a long redirection sequence to a maze of subsequent pages.

Causes

If you think of a web crawl as a tree structure, you may have an initial trunk with multiple branches where each additional branch could have it’s own set of branches. The cause section of RiskIQ content subscribes to this idea of a tree and shows all of the web requests in the order of how they were called. Viewing the relationships to multiple web pages this way allows us to gain an understanding of the role of each page. The above example shows a long-running chain of redirection sequences starting off with bit.ly, then Google and eventually ending at Dropbox.

Data from the cause tree is what we use to create the host pair dataset inside of PassiveTotal. By seeing the full chain of events, we can be flexible in our detection methods. For example, if we had the final page of an exploit kit, we could just the cause chain to walk back up to the initial page that caused it. Along the way, this could reveal more malicious infrastructure or maybe even point to something like a compromised site.

Crawls for the Masses

As you can see from the above screenshots and explanations, RiskIQ has a lot of data that can be used for analysis purposes. Our goal at PassiveTotal is to introduce this data in a format that is easy to use and understand. So far, we have released host pairs, trackers and web components which are all derived from the web crawling.

Host Pairs

In the “cause” section of the crawl, it was easy to see how all the links related to each other and formed a tree structure. By keeping this structure, it’s possible for us to derive a set of host pairs based on the redirection sequences. For example, one of the bitly links in the screenshot redirects to an Amazon hosted page using an HTTP 302. This would create the host pair of the parent being the bitly link and child being the Amazon link.

Trackers

Since RiskIQ stores all of the page content from the web crawls, it’s easy to go back in time to extract content from the DOM. Trackers are generated from both inline processing during the crawl and post-processing of data based on extraction patterns. Using regular expressions or code-defined logic, we can extract codes from the web page content, pull the timestamp of the occurrence and for what hostname it was associated with.

Web Components

Creating the web components dataset is a combination of header and DOM content parsing similar to that of the trackers dataset. Regular expressions and code-defined logic aid in extracting details about the server infrastructure and other details like which web libraries were associated with a page.

Going Forward

We see the current host attributes dataset as just the beginning of what’s possible. In the coming months, we are going to start exploring how we can offer more data from the web crawls to our users in our web interface, API, and third-party integrations. Additionally, we are thinking of ways to expose our crawling capability directly to our community of users. In our next post, we will explore a crimeware campaign used to infect users with malicious Chrome Extensions through an extensive web process.