News Crawl
News is a text genre that is often discussed on our user and developer mailing list.
Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.
Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.
![Header image](https://cdn.prod.website-files.com/6479b8d98bf5dcb4a69c4f31/64c819bc9dcce24965da34bf_newscrawl-midjourney.png)