< Back to Blog
December 5, 2025

A Sampling of 2025 Research Referencing Common Crawl

Note: this post has been marked as obsolete.
As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.
Greg Lindahl
Greg Lindahl
Greg is the Chief Technology Officer at the Common Crawl Foundation.

As another year here at Common Crawl comes to a close, we present a dozen papers (selected from the thousands published in 2025) that demonstrate the range of topics and areas of study for which Common Crawl’s datasets and statistics are used and referenced. For more papers citing Common Crawl’s data, which has been regularly collected since 2008, see Research Papers and cc-citations, our curated BibTeX database.

Classification of Worldwide News Articles by Perceived Quality, 2018-2024

“This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024.”

https://arxiv.org/abs/2511.16416

Combating Health Misinformation With Fusion-Based Credible Retrieval Techniques

“This study aims to combat health misinformation by enhancing the retrieval of credible health information using effective fusion-based techniques. … The datasets for these events are based on the CommonCrawl News dataset”

https://journals.sagepub.com/doi/10.1177/14604582251388860

Geospatiality: The Effect of Topics on the Presence of Geolocation in English Text Data

“This study investigates the relationship between texts’ thematic categories and their likelihood of containing usable geolocation information by quantifying and modelling this relationship across seven diverse English text datasets of different types, including web forums, microblogs, news, and magazines.” Study uses Common Crawls Distribution of Languages statistics.

https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2460051#abstract

High-Fidelity Simultaneous Speech-To-Speech Translation

Introduces  Hibiki, “a decoder-only model for simultaneous speech translation.”  The “training dataset is made of filtered web pages from Common Crawl, as well as curated sources such as Wikipedia, StackExchange or scientific articles and it contains 12.5% of multilingual documents.”

https://arxiv.org/abs/2502.03382

Optimising Web Accessibility Evaluation: Population Sourcing Methods for Web Accessibility Evaluation

“We present a tool-supported framework, OPTIMAL-EM, that runs parallel to the Website Accessibility Conformance Evaluation Methodology (WCAG-EM). We aim to optimise web accessibility evaluation through the targeted use of automated tools and human evaluation to audit a more representative set of pages.”  Uses four approaches, including Common Crawl.

https://www.sciencedirect.com/science/article/pii/S1071581925000291?via%3Dihub

Paraphrase Detection for Urdu Language Text Using Fine-Tune BiLSTM Framework

“This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection’s intricacies.“ … The study “incorporates the GloVe approach for embedding words with 50 dimensions. [It uses] Common Crawl pre-trained vectors trained on a large amount of web-based text (42 billion tokens, 1.9 million words, 50 d vectors). “

https://www.nature.com/articles/s41598-025-93260-6

Reinforced Disentangled HTML Representation Learning with Hard-Sample Mining for Phishing Webpage Detection

“This study introduces a reinforced Triplet Network to optimize disentangled representation learning tailored for phishing detection…The datasets used in this study include benign data from Common Crawl and phishing data from Phishtank and Mendeley Data.”

https://www.mdpi.com/2079-9292/14/6/1080

Semantic Annotation Model and Method Based on Internet Open Dataset

“[T]his paper deeply studies the semantic annotation model and method based on internet open datasets, aiming to improve annotation efficiency and accuracy and promote data resource sharing and utilization. This paper selects Common Crawl dataset to provide sufficient training samples; methods such as removing stop words and deduplication are used to preprocess data to improve data quality; a keyword extraction model based on heuristic rules and text context is constructed.”

https://www.igi-global.com/gateway/article/370966

Scalable Private Partition Selection via Adaptive Weighting

Proposes “an algorithm for this problem, MaxAdaptiveDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output.” [Uses Common Crawl datasets]

https://arxiv.org/abs/2502.08878

SocialQuotes: Learning Contextual Roles of Social Media Quotes on the Web

Introduces SocialQuotes, “a new data set built from the Common Crawl of over 32 million social quotes, 8.3k of them with crowdsourced quote annotations.”

https://ojs.aaai.org/index.php/ICWSM/article/view/35882

Temporally Extending Existing Web Archive Collections for Longitudinal Analysis

This paper introduces “a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology [to identify] reasons URL candidates could be missing from the … [Environmental Governance and Data Initiative] EDGI dataset, and crawled the past web of 2008 in order to identify these missing pages.” Includes Common Crawl, in addition to Internet Archive and the End of Term Archive.

https://arxiv.org/abs/2505.24091

Web2Wiki: Characterizing Wikipedia Linking Across the Web

Presents “the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl [it identifies] over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function.”

https://arxiv.org/abs/2505.15837

With over 10,000 research papers referencing Common Crawl’s dataset, we are constantly surprised by the myriad ways in which our dataset is useful for academic research. We look forward to being surprised again in 2026!

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.