June 29, 2026

13th Web-as-Corpus Workshop @ EMNLP 2026

The WaC-13 workshop invites research submissions on web data, corpus building, and linguistic analysis.

Laurie is a Principal Research Engineer at the Common Crawl Foundation.

Felső-Víziváros and Matthias Church seen from the Danube — Felső-Víziváros és Mátyás-templom látványa a Dunáról · Thaler Tamas, Wikimedia Commons, CC BY-SA 3.0

On 29th October 2026, the Web-as-Corpus Workshop will return for its 13th edition, co-located with EMNLP 2026 in Budapest. The organising committee includes engineers from the Common Crawl Foundation alongside researchers from the Jožef Stefan Institute, the University of Oslo, and the University of Turku.

Research on the web as a corpus can be split into two main strands: using it as core data infrastructure for modern natural language processing, including Large Language Models (LLMs), or studying it as an object of societal and linguistic analysis in its own right. In both cases, questions about the curation, analysis, and responsible use of web-derived data have become increasingly critical. For those building systems, the "more is better" paradigm is under pressure from machine-generated content, data toxicity, limited metadata, and sparse data for many languages and domains. For those studying the web, issues around quality, representativeness, and ethical implications of data use are increasingly consequential. Both strands depend on understanding web data well enough to use it responsibly.

The WaC-13 workshop aims to connect researchers from multiple disciplines who share an interest in the web as a corpus. We invite submissions on methods, resources, and applications related to web corpora, with special emphasis on multilingual data and less-resourced languages.

Topics of interest include (but are not limited to):

Creation and evaluation of high-quality datasets for foundation models (e.g., data collection, filtering, enrichment, language identification)
Use of web data in empirical linguistic research
Analysis of web-scale corpora for quality, representativeness, and societal insights
Ethical and legal aspects of collecting, sharing, and using web data

There are two ways to submit your research: either directly by 7 August 2026, or through committing a pre-reviewed paper via ACL Rolling Review by 1 September 2026 (both deadlines AoE). Full details are available on the workshop website: https://wackyworkshop.org.

By bringing together researchers from NLP, linguistics, and the social sciences, WaC-13 aims to advance best practices for one of the field’s most influential data sources. If you work on any part of how web data is built, filtered, or studied, we’d love to read your submission.

This release was authored by:

Laurie Burchell

Laurie is a Principal Research Engineer at the Common Crawl Foundation.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

13th Web-as-Corpus Workshop @ EMNLP 2026

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use