Introduction and Data
We discussed what ML opt–out protocols are in our previous blog post. We explored a few different ways to opt–out, as well as some associated emerging protocols, and how these fit into Common Crawl’s processes. Our mission is to make web data accessible to everyone, and to do so in an ethical, responsible fashion, and so it is important for us to try to discern the wishes of those who own it. We decided to investigate the prevalence of some of these protocols, by taking a deeper look at our WARC files, and finding which proportions of domains are using which opt–out protocols.
First, we need data to look for usage of the various opt–out protocols. We need a dataset large enough to be representative, but which can also be processed with a reasonable amount of resources. This dataset should also contain high–quality data, and avoid possible biases, for instance where too many subdomains would use the same template and therefore give a distorted representation of the use of a particular protocol.
Fortunately, as part of the pre–processing for our web crawls, we run a pre–crawl to gather candidate URLs for fetching. As a starting point this takes a list of the top hosts and domain names from our latest Web Graph. From there we do a few iterations of crawling with Apache Nutch™ and harvest URLs, some of which will be part of the next crawl.
The data that we used for these experiments is two iterations of the output of our pre–crawl seed generation used internally, which takes the familiar form of a list of WARC files. The first iteration is the pre–crawl seed WARC files for October (Week 40 of 2023, ~134.0 TiB) and the second iteration is for December (Week 50 of 2023, ~1008 GiB). We will refer to them as seed-crawl/CC-MAIN-2023-40 and seed-crawl/CC-MAIN-2023-50 respectively. Using two iterations is important because it gives us more insight on how the prevalence of each opt–out protocol may have changed over time.
Our hope is that these experiments will provide valuable insights on how to acknowledge opt–out protocols, increase visibility in emerging initiatives and act as a stepping–stone for your own projects.
Let’s highlight some experiments that we have done on some emerging opt–out protocols. To read more about these protocols, you can also take a look at our previous blog post.
Methodology
In order to perform our experiments, we decided to use Apache Spark™ (specifically the Python API – PySpark) to process the data. Spark is a distributed computing framework that is typically used when needing to process large amounts of data efficiently. Our project cc-pyspark is based on PySpark, and allows us to create separate PySpark jobs for specific tasks. We then compare the results of our PySpark job to the results produced from a StormCrawler based topology (which tells us details about how often different meta tags, headers and annotations appear) run on the same dataset. The StormCrawler based topology is simply used to confirm that the figures obtained from Spark are correct.
Experiments
Robots Exclusion Protocol
A commonly used opt–out method is to use the robots.txt part of the Robots Exclusion Protocol. We discuss this in more detail in our previous post about opt–out protocols, but the overall idea is that one is able to specify the permissions for different User Agents (which refer to specific crawlers, such as Google’s “Googlebot” or our own “CCBot”). We decided to measure the use of this to opt–out of data–mining for ML purposes. Some of these User Agents include Google-Extended (used for training Bard/Gemini) and GPTBot (used for OpenAI products). We used two different methodologies to learn more about the prevalence of these different User Agents within our crawl data.
We used the code in the cc-pyspark repository to process our data. First, we wrote a CCSparkJob which iterates over robots.txt files, and looks for any line starting with “User-agent:”, and then extracts the following token (which is the specified User Agent, such as “CCBot”, or a regular expression like “*”). This allows us to look for each time a specific User Agent is listed in a robots.txt (it is important to note that listing an agent does not necessarily mean it is barred from crawling, because in addition to “disallow” lists, one can also specify “allow” lists), but for our purposes we will be assuming a listing implies a “disallow” statement, as that is the typical usage, and typical crawler behavior would assume allowance unless otherwise specified. We saved the frequency of each token that we found in our datasets. Let’s look at some stats from the PySpark job results over the 20,038,781 robots.txt records that we processed in this dataset:
As expected, we have mentions of GPTBot and a much smaller number of mentions of the anthropic-ai crawler. Surprisingly, there is no mention of “Google-Extended”. If we look at the results of the Spark job over the 30,181,129 robots.txt records in our seed-crawl/CC-MAIN-2023-50 dataset:
We can see similar (slightly higher) proportions of GPTBot, anthropic-ai and CCBot being mentioned and suddenly many mentions of Google-Extended. This could be attributed to the fact that Google-Extended was announced at the end of September, just shortly before the CC-MAIN-2023-40 dataset was created. The sudden increase of occurrences can likely be attributed to people seeing the announcement.
HTTP Headers
As discussed in our previous blog post, another commonly used opt–out method is to use HTTP headers. We also experimented with finding metrics about this opt–out method in a similar way, using PySpark.
Similarly to how we processed the robots.txt stats, we used code in the cc-pyspark repository to process the data. We first wrote a CCSparkJob which iterates over WARC files from our pre–crawl seed datasets. This Spark job looks at the HTTP headers in each WARC record, and counts the frequency of each header. From running this job over the 19,689,733 records on seed-crawl/CC-MAIN-2023-40, we can investigate statistics about some HTTP header tags for ML opt–out, such as those in the TDM Reservation Protocol:
We can see that a very small (but non–zero) portion of the records from the 2023-40 dataset are using these headers. Now, let’s take a look at results for the 31,499,359 records in the seed-crawl/CC-MAIN-2023-50 dataset:
There seems to be a proportionally higher adoption rate of the TDM Reservation Protocol in the newer dataset, showing an increase in the adoption of this protocol, which we expected.
HTML Metadata
As discussed in our previous blog post, a third opt–out method is via meta tags in a website’s HTML. As with the other experiments, we used PySpark and StormCrawler for two stages of the experiment.
As with the other experiments, we used cc-pyspark by writing a CCSparkJob to iterate over the WARC files. While doing so, it keeps count of (name, content) pairs for HTML meta tags that it encounters. For example, one instance of:
Would result in:
The meta tags that we are interested in are tags with the names “tdm-reservation” and “tdm-policy”. We went through the 19,689,733 records in seed-crawl/CC-MAIN-2023-40:
We can see that there is a small (but non–zero) number of TDM Reservation Protocol implementations in this dataset. Now, let’s take a look at the results for the 31,499,359 records in the seed-crawl/CC-MAIN-2023-50 dataset:
Again, a small but non–zero number of occurrences were found.
Opting–Out via Additional Files
As discussed in our previous blog post, another out–out method is to add additional files to your web server. To investigate the prevalence of this method we used the StormCrawler project, because of technical limitations with the Common Crawl crawler (we currently only fetch robots.txt as additional files). For this experiment we analyzed the file ./.well-known/tdmrep.json from the TDM Reservation Protocol. We found that there were 263 hits out of the total 9,201,298 records scanned (roughly 0.003%).
Concluding Thoughts & Acknowledgements
In conclusion, we can see that among many of these emerging protocols, it is not yet clear which of them will be widely adopted. As we saw from comparing the results of seed-crawl/CC-MAIN-2023-40 and seed-crawl/CC-MAIN-2023-50, the adoption of these protocols is starting to increase. Each of these protocols is extremely valuable, and at the very least will pave the way for how we are able to control content usage. In the meantime, the Robots Exclusion Protocol is the tried and tested approach which we at CCF follow diligently, and we encourage everyone to use it.
This work could not be possible without the work from OpenWebSearch to create the commoncrawl-parser which we have used extensively for our experiments.
Apache Nutch™, and Apache Spark™ and their respective logos are trademarks of the Apache Foundation.