@tblock let me know how it is going? Btw. Are you sure you can include the scraped text in the csv format? It might be better to include just the links and the code to fetch them from websites, otherwise your repo should have a license: noncomercial research only.

@piotr.czapla I’ll keep you updated. I’m finishing my thesis about it at the moment.

Regarding the licensing, please check out https://github.com/tblock/10kGNAD for more detail on the dataset. I didn’t scrape the news articles, they are extracted form the One Million Post Corpus. I detail the license in the project readme and on the project page. But thanks for the heads up!