Code

Papers

2022

2021

  • Corpulyzer: A Novel Framework for Building Low Resource Language Corpora — Bilal Tahir, Muhammad Amir Mehmood – University of Engineering and Technology, Lahore, Pakistan
  • A COVID-19 news coverage mood map of Europe — Frankie Robertson, Jarkko Lagus, Kaisla Kajava – University of Jyväskylä, Finland; University of Helsinki, Finland
  • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets — Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Javier Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi – Google Research; Masakhane NLP; Turkic Interlingua; Haverford College; RobotsMali; Intel Labs; University of Zambia; Google; AIMS-AMMI; Inria; University of Zurich; Stanford University; Kwame Nkrumah University of Science and Technology; Sorbonne Université; Niger-Volta LTI; University of WaterlooqUniversity of Electronic Science and Technology of China; University of Notre Dame; Bayero University Kano; University of South Florida; Hugging Face; Jacobs University Bremen; University of Moratuwa; EleutherAI; Obafemi Awolowo University; University of Ibadan; Instadeep; University of Maryland; Defence Space Administration Abuja
  • Documenting the English Colossal Clean Crawled Corpus — Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Matt Gardner – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, USA
  • mT5: A massively multilingual pre-trained text-to-text transformer — Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel – Google Research
  • Detecting Phishing Sites — An Overview — P. Kalaharsha, B. M. Mehtre – Institute for Development and Research in Banking Technology (IDRBT), Hyderabad, Indiab; School of Computer Science and Information Sciences (SCIS), University of Hyderabad, Hyderabad, India
  • What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus — Alexandra Sasha Luccioni, Joseph D. Viviano – Université de Montréal, Canada; Mila Québec AI Institute, Canada
  • HTLM: Hyper-Text Pre-Training and Prompting of Language Models — Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer – Facebook AI; University of Washington, USA
  • Naming unrelated words predicts creativity — Jay A. Olson, Johnny Nahas, Denis Chmoulevitch, Simon J. Cropper, Margaret E. Webb – Department of Psychology, Harvard University, Cambridge, MA, USA; Department of Psychology, McGill University, Montreal, QC, Canada; Melbourne School of Psychological Sciences, University of Melbourne, Australia
  • Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus — Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot – Inria, Paris, France; Sorbonne Université, Paris, France
  • The Pile: An 800GB Dataset of Diverse Text for Language Modeling — Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy – EleutherAI
  • The Danish Gigaword Corpus — Leon Derczynski, Manuel R. Ciosici, Rebekah Baglini, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen, Jens Madsen, Malte Lau Petersen, Jonathan Hvithamar Rystrøm, Daniel Varab – ITU Copenhagen, Denmark; Aarhus University, Denmark; Danish Language Council, Denmark; TV2 Regionerne, Denmark; Karnov Group, Denmark; USC Information Sciences Institute, USA; Alexandra Institute, Denmark; University of Copenhagen, Denmark; Technical University of Denmark; Novo Nordisk, Denmark
  • CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl — Maik Fröbe, Janek Bevendorff, Lukas Gienapp, Michael Völske, Benno Stein, Martin Potthast, Matthias Hagen – Martin-Luther-Universität Halle-Wittenberg, Germany; Bauhaus-Universität Weimar, Germany; Leipzig University, Germany
  • Practical Wavelet Tree Construction — Patrick Dinklage, Jonas Ellert, Johannes Fischer, Florian Kurpicz, Marvin Löbel – TU Dortmund University, Germany
  • Multimodal datasets: misogyny, pornography, and malignant stereotypes — Abeba Birhane, Vinay Uday Prabhu, Emmanuel Kahembwe – University College Dublin, Ireland; Lero, Dublin, Ireland; University of Edinburgh, UK
  • LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs — Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki – LAION.ai; Gentec Data, Romania; Technical University of Munich, Germany; Juelich Supercomputing Center, Germany; Georgia Institute of Technology; USA; EleutherAI
  • Voted In, Standing Out: Public Response to Immigrants’ Political Accession — Guy Grossman, Stephanie Zonszein – University of Pennsylvania, USA
  • No News is Good News: A Critique of the One Billion Word Benchmark — Helen Ngo, João G. M. Araújo, Jeffrey Hui, Nicholas Frosst – Cohere, Toronto, Canada
  • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets — Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Javier Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi – Google Research; Masakhane NLP; Turkic Interlingua; Haverford College; RobotsMali; Intel Labs; University of Zambia; Google; AIMS-AMMI; Inria; University of Zurich; Stanford University; Kwame Nkrumah University of Science and Technology; Sorbonne Université; Niger-Volta LTI; University of Waterloo; University of Electronic Science and Technology of China; University of Notre Dame; Bayero University Kano; University of South Florida; Hugging Face; Jacobs University Bremen; University of Moratuwa; EleutherAI; Obafemi Awolowo University; University of Ibadan; Instadeep; University of Maryland; Defence Space Administration Abuja
  • Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus — Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Matt Gardner – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, USA
  • An Empirical Exploration in Quality Filtering of Text Data — Leo Gao – EleutherAI
  • Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News — Michael Bugert, Iryna Gurevych – Ubiquitous Knowledge Processing Lab (UKP), Technical University of Darmstadt, Germany
  • The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus — Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel – Facebook AI Research; University College London, United Kingdom; University of Mannheim, Germany; ENS, PSL University, France; Inria, France; University of Washington, United States
  • On the Impact of Publicly Available News and Information Transfer to Financial Markets — Metod Jazbec, Barna Pàsztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm – ETH Zurich, Switzerland; New York University, New York, USA
  • The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus — Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel – Facebook AI Research; University College London, United Kingdom; University of Mannheim, Germany; ENS, PSL University, France; Inria, France; University of Washington, United States
  • On the Impact of Publicly Available News and Information Transfer to Financial Markets — Metod Jazbec, Barna Pàsztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm – ETH Zurich, Switzerland; New York University, New York, USA
  • MassiveSumm: a very large-scale, very multilingual, news summarisation dataset — Daniel Varab, Natalie Schluter – IT University of Copenhagen, Denmark

2020

2019

2018

2017

2016

2015

2014

2013

2012