< Back to Blog
February 11, 2026

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.
Malte Ostendorff
Malte Ostendorff
Malte is a Senior Research Engineer at Common Crawl, based in Berlin, Germany. He holds a Ph.D. in computer science from the University of Göttingen.

Thousands of research papers mention, use, or cite Common Crawl data, making it difficult to get a meaningful overview from traditional academic search engines like Google Scholar.  To make exploration easier, we built an interactive visualization, available as a space on Hugging Face.

The visualization is implemented as a map-like web application in which 10,000+ research papers are displayed as markers on a map interface.  Users can explore the papers visually by dragging and zooming.  A search bar allows the user to find papers by their titles.

The positions of the papers on the map are defined by their semantic similarity.  Specifically, we use SciNCL paper embeddings based on paper titles and abstracts in combination with UMAP dimensionality reduction.  The different paper topics are visualized with different colors.  For topic detection we use LDA in combination with Anthropic’s Claude to come up with human readable topics.

Clusters of Research Papers

The visualization provides a clear overview of the research areas directly or indirectly using Common Crawl data.  Some topic clusters dominate, but many others are clearly visible too.  A few examples are listed below:

Security & Attack Detection

The topic of security and attack detection appears in several areas and clusters in the visualization (displayed in red).  Below are a few examples of papers from this topic:

A graph showing nodes representing research papers about security & attack detection in several clusters.
Research papers about security & attack detection in several clusters.

Machine Translation

One isolated topic cluster in the top-right corner is about machine translation research and related topics (displayed in green).  The cluster contains papers like:

A graph showing a cluster of research papers about machine translation.
A cluster of research papers about machine translation.

Ethics & Governance

At the very center of the map, there is a cluster about ethics and governance.  The cluster contains papers such as Multidimensional tie strength and economic development (Aiello et al., 2022).

A graph showing a cluster of research papers about ethics and governance.
A cluster of research papers about ethics and governance.

Other Topics

All research papers that cannot be assigned to a specific topic cluster are highlighted as “Other” (grey color).  For example, Tracking and Identifying International Propaganda and Influence Networks Online (Hanley, 2025) or Determining How Citations Are Used in Citation Contexts (Färber and Sampath, 2019) are part of this cluster.

A graph showing nodes representing research papers with other topics are scattered throughout the map.
Research papers with other topics are scattered throughout the map.

There are many more interesting research papers hidden here.  Please feel free to go and explore the interactive visualization on your own.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.