Code

Papers

2021

  • Corpulyzer: A Novel Framework for Building Low Resource Language Corpora — Bilal Tahir, Muhammad Amir Mehmood – University of Engineering and Technology, Lahore, Pakistan
  • A COVID-19 news coverage mood map of Europe — Frankie Robertson, Jarkko Lagus, Kaisla Kajava – University of Jyväskylä, Finland; University of Helsinki, Finland
  • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets — Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Javier Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi – Google Research; Masakhane NLP; Turkic Interlingua; Haverford College; RobotsMali; Intel Labs; University of Zambia; Google; AIMS-AMMI; Inria; University of Zurich; Stanford University; Kwame Nkrumah University of Science and Technology; Sorbonne Université; Niger-Volta LTI; University of WaterlooqUniversity of Electronic Science and Technology of China; University of Notre Dame; Bayero University Kano; University of South Florida; Hugging Face; Jacobs University Bremen; University of Moratuwa; EleutherAI; Obafemi Awolowo University; University of Ibadan; Instadeep; University of Maryland; Defence Space Administration Abuja
  • Documenting the English Colossal Clean Crawled Corpus — Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Matt Gardner – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, USA
  • mT5: A massively multilingual pre-trained text-to-text transformer — Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel – Google Research
  • Detecting Phishing Sites — An Overview — P. Kalaharsha, B. M. Mehtre – Institute for Development and Research in Banking Technology (IDRBT), Hyderabad, Indiab; School of Computer Science and Information Sciences (SCIS), University of Hyderabad, Hyderabad, India
  • What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus — Alexandra Sasha Luccioni, Joseph D. Viviano – Université de Montréal, Canada; Mila Québec AI Institute, Canada
  • HTLM: Hyper-Text Pre-Training and Prompting of Language Models — Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer – Facebook AI; University of Washington, USA
  • Naming unrelated words predicts creativity — Jay A. Olson, Johnny Nahas, Denis Chmoulevitch, Simon J. Cropper, Margaret E. Webb – Department of Psychology, Harvard University, Cambridge, MA, USA; Department of Psychology, McGill University, Montreal, QC, Canada; Melbourne School of Psychological Sciences, University of Melbourne, Australia
  • Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus — Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot – Inria, Paris, France; Sorbonne Université, Paris, France
  • The Pile: An 800GB Dataset of Diverse Text for Language Modeling — Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy – EleutherAI
  • The Danish Gigaword Corpus — Leon Derczynski, Manuel R. Ciosici, Rebekah Baglini, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen, Jens Madsen, Malte Lau Petersen, Jonathan Hvithamar Rystrøm, Daniel Varab – ITU Copenhagen, Denmark; Aarhus University, Denmark; Danish Language Council, Denmark; TV2 Regionerne, Denmark; Karnov Group, Denmark; USC Information Sciences Institute, USA; Alexandra Institute, Denmark; University of Copenhagen, Denmark; Technical University of Denmark; Novo Nordisk, Denmark
  • CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl — Maik Fröbe, Janek Bevendorff, Lukas Gienapp, Michael Völske, Benno Stein, Martin Potthast, Matthias Hagen – Martin-Luther-Universität Halle-Wittenberg, Germany; Bauhaus-Universität Weimar, Germany; Leipzig University, Germany
  • Practical Wavelet Tree Construction — Patrick Dinklage, Jonas Ellert, Johannes Fischer, Florian Kurpicz, Marvin Löbel – TU Dortmund University, Germany

2020

2019

2018

2017

2016

2015

2014

2013

2012