Code
- Ransacking your password reset tokens by Lukas Euler
- Read Common Crawl Parquet Metadata with Python by Edward Ross
- C4 Dataset Script by Jianbin Chang
- hqurlfind3r – A passive reconnaissance tool for known URLs discovery by Hueristiq
- Visual Search by
- Common Crawl On Laptop – Extracting Subset Of Data by Chillar Anand
- Alexandria Search by alexandria.org
- Searching the web for < $1000 / month by Adrien Guillo
- Simple Search Engine by Hannes Rabo, Julius Recep Colliander Celik
- web-search-engine by Alexander Gao
- One click to download all the web pages you may want by Jader Dias
- Querying TB sized External Tables with Snowflake by Venkat Sekar
- Link Archive by Philip Waritschlager
- PWA Store – The largest collection of publicly accessible Progressive Web Apps* by Petr Gajdosik
- NewsFetch by Manoj Bharadwaj
- All Around The World: The Common Crawl Dataset – Attack Surface Research by Aliz Hammond
- Seldonite – A News Article Collection and Processing Library by McGill Network Dynamics Lab
- EMR Tutorial by haydenhw
- sigurls by Alex Munene
- Parse Petabytes of data from CommonCrawl in seconds by Stanislas Girard
- commoncrawl – a Node.js client for the commoncrawl.org index by
- Extracting Data from Common Crawl Dataset by Athul Jayson
- getallurls (gau) by Corben Leo
- CommonCrawl Host-IP Mapper by Mingwei Zhang
- MrURL by Sachin Verma
- tantivy_warc_indexer by Andreas Hauser
- pace-commoncrawl-scanner by Citizen Foundation
- WARC parser CPP by seo-explorer.io
- andresriancho/cc-lambda: Search the common crawl using lambda functions by Andres Riancho
- Analyzing Performance and Cost of Large-Scale Data Processing with AWS Lambda by Chris Madden, Aaron Bawcom (Candid Partners)
- Searching 100 Billion Webpages Pages With Capture Index by Edward Ross
- Extracing Text, Metadata and Data from Common Crawl by Edward Ross
- Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS by Basil Latif
- Extracting Job Ads from Common Crawl by Edward Ross
- Common Crawl Index Athena by Edward Ross
- Search the html across 25 billion websites for passive reconnaissance using common crawl by Ryan Elkins
- Common Crawl News 20200110212037-00310 – A single Web ARChive (WARC) file from Common Crawl News by Gabriel Altay
- LinkRun – A pipeline to analyze popularity of domains across the web by Sergey Shnitkind
- comcrawl – A python utility for downloading Common Crawl data by Michael Harms
- warcannon – High speed/Low cost CommonCrawl RegExp in Node.js by Brad Woodward
- Webxtrakt – building domain zone files by webxtract
- super-Django-CC by Jinxu
- I Got Urls – WaybackURLS + OtxURLS + CommonCrawl by xyele
- cc_net – Tools to download and cleanup Common Crawl data by Facebook Research
- S3 Throughput: Scans vs Indexes by Colin Dellow
- Analyzing crime reported in the U.S. using data derived from Common Crawl, New York Times API and Twitter data by Sai Saket Regulapati
- Hello, WARC: Common Crawl code samples by Colin Dellow
- commoncrawl_downloader by Leo Gao
- goCommonCrawl – Extraction of Web Archive data using Common Crawl index API by karust
- “CitizensFoundation/ac-keyword-scanner “ by Róbert Viðar Bjarnason
- SportsDataAnalysis by Yash Chandra
- Categorizing World Wide Web by Jay Pavagadhi
- CCrawlDNS – CommonCrawl data set subdomain extracter by Laurent Gaffié
- How to Retrieve Archived Pages of Specific Domain Using CommonCrawl Index by Liyan Xu
- Index fun by Philippe Suter
- … a free version of Helium Scraper that scrapes data from the Common Crawl database. by Juan Soldi
- mcn-source-ct – Scripts for downloading and extracting .no domains from the data of the commoncrawl.org project. by Anders Einar Hilden
- cc.py – Extracting URLs of a specific target based on the results of “commoncrawl.org” by SI9INT
- CommonCrawlScalaTools by Jeff Harwell
- Source real estate prices from the Common Crawl by Colin Dellow
- Extracting text from HTML in Python: a very fast approach by Artem Golubin
- Defining Data Science Using the Common Crawl Web Corpus by Paavo Pohndorff
- Large-scale Graph Mining with Spark by Win Suen
- Paskto – Passive Web Scanner by
- Parsing Common Crawl in 2 plain scripts in python by Alexander Veysov
- The prevalence of Web advertising by commecica.com
- Of using Common Crawl to play Family Feud by Paul Masurel
- Common Crawl Scala Example by Soner Altin
- Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl by Janek Bevendorff, Martin Potthast (Bauhaus-Universität Weimar)
- A toolkit for CDX indices such as Common Crawl and the Internet Archive’s Wayback Machine by Greg Lindahl
- Using Python and Common-Crawl to find products from Amazon.com by David Cedar
- Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R by hrbrmstr
- Clustering communities on web crawl data by Oluwaseyi Talabi, M. Rafay Aleem, Prashanth Rao, Nandita Dwivedi
- Virtual patent marking crawler by David Portabella
- Analyzing 4 Billions of Tags with R and Spark by Javier Luraschi
- newsplease/examples/commoncrawl.py – download WARC files from commoncrawl.org’s news crawl by Felix Hamborg
- cc-pyspark: process Common Crawl data with Python and Spark by Common Crawl
- KeywordAnalysis: Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends by CI-Research
- Go Crawl by Chris Cates
- go-warc: golang library to work with WARC files by Wolfgang Meyers
- sparkwarc: Load WARC Files into Apache Spark by Javier Luraschi
- Analysing Petabytes of Websites by Mark Litwintschik
- CommonCrawlJob – Extract data from common crawl using elastic map reduce by Sang Han (Qadium)
- Exploring the Common Crawl with Python by Derek Morgan
- Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AW by Jouke-Thiemo Waleson
- Mining Common Crawl with PHP by Paulius Rimavičius
- Crate.IO: How to import from custom data sources with a plugin by Claus Matzinger
- Index 1,600,000,000 Keys with Automata and Rust by Andrew Gallant
- Как погрепать интернет / How to grep the web by Aleksandr Kukushkin
- How Many Websites Provide RSS / Web Syndication Feeds by Victor Felder (eXascale Infolab)
- Analyzing the Common Crawl using Map-Reduce by Stefan Koch
- Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch – AWS Big Data Blog by Hernan Vivani
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/ by Ilya Kreymer
- Analyze Common Crawl index – http://index.commoncrawl.org/ by Tom Morris
- Common Crawl Document Download by Dominik Stadler
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles. by Ross Fairbanks
- Common Crawl WARC/WET/WAT examples and processing code for Java + Hadoop by Stephen Merity
- Java and Clojure examples for processing Common Crawl WARC files by Mark Watson
- Common web archive utility code by the IIPC
- A distributed system for mining Common Crawl using SQS, AWS-EC2 and S3 by Akshay Bhat
- Twelve steps to running your Ruby code across five billion web pages by Pete Warden
- Link Reverse by Nada Amin
- Is Money the Root of All Evil by Joyita Raksit
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts by Chris Han
- Bill Tracker – Online Sentiment Towards Congressional Bills by Albert Wavering
- Common Crawl URL Index by Jason Ronallo
- Web Data Commons – RDFa, Microdata, and Microformat Data Sets by University of Mannheim
Papers
2023
-
A Systematic Literature Review on Phishing Website Detection Techniques
— Asadullah Safi, Satwinder Singh – Nangarhar University, Afghanistan; Central University of Punjab, Bathinda, Punjab, India
-
A Golden Age: Conspiracy Theories’ Relationship with Misinformation Outlets, News Media, and the Wider Internet
— Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric – Stanford University, USA
-
Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations
— Josh A. Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, Katerina Sedova – Georgetown University’s Center for Security and Emerging Technology, USA; OpenAI; Stanford Internet Observatory, USA
-
Building Programmable Commons
— Petros Terzis – University College London, United Kingdom
-
Large Web Archive Collection Infrastructure and Services
— Xinyue Wang – Virginia Tech, USA
-
WDC Products: A Multi-Dimensional Entity Matching Benchmark
— Ralph Peeters, Reng Chiz Der, Christian Bizer – University of Mannheim, Germany
-
Reclaiming the Digital Commons: A Public Data Trust for Training Data
— Alan Chan, Herbie Bradley, Nitarshan Rajkumar – University of Cambridge, United Kingdom; Mila, Université de Montréal, Canada; EleutherAI
-
LLaMA: Open and Efficient Foundation Language Models
— Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample – Meta AI
-
Systems and Algorithms for Dynamic Graph Processing
— Khaled Ammar – University of Waterloo, Ontario, Canada
-
Poisoning Web-Scale Training Datasets is Practical
— Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr – Google; ETH Zurich, Switzerland; NVIDIA; Robust Intelligence
-
Generative AI and the Digital Commons
— Saffron Huang, Divya Siddarth – Collective Intelligence Project (cip.org)
2022
-
A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model
— Imad LAKIM, Ebtesam Almazrouei, Ibrahim Abu Alhaol, Merouane Debbah, Julien Launay – TII, Abu Dhabi, Arabic Emirates; LightOn, Paris, France
-
The hitchhiker’s guide Method handbook for quantification of online linguistic data in a country-specific context. Official research report, Linguistic Explorations of Societies (Work Package 1)
— Jonas Andersson Schwarz – Göteborgs Universitet, Sweden
-
JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus
— Makoto Morishita, Katsuki Chousa, Jun Suzuki, Masaaki Nagata – NTT Communication Science Laboratories, NTT Corporation, Japan
-
Does Corpus Quality Really Matter for Low-Resource Languages?
— Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa – Meta AI; HiTZ Center - Ixa, University of the Basque Country UPV/EHU
-
Datasheet for the Pile
— Stella Biderman, Kieran Bicheno, Leo Gao – EleutherAI
-
A Warm Start and a Clean Crawled Corpus — A Recipe for Good Language Models
— Vésteinn Snæbjarnarson, Haukur Barri Símonarson, Pétur Orri Ragnarsson, Svanhvít Lilja Ingólfsdóttir, Haukur Páll Jónsson, Vilhjálmur Þorsteinsson, Hafsteinn Einarsson – Miðeind ehf., Iceland; University of Iceland, Iceland
-
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
— Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi – Google Research; Masakhane NLP; Turkic Interlingua; Haverford College; RobotsMali; Intel Labs; University of Zambia; Google; AIMS-AMMI; Inria; University of Zurich; Stanford University; Kwame Nkrumah University of Science and Technology; Sorbonne Université; Niger-Volta LTI; University of Waterloo; University of Electronic Science and Technology of China; University of Notre Dame; Bayero University Kano; University of South Florida; Hugging Face; Jacobs University Bremen; University of Moratuwa; EleutherAI; Obafemi Awolowo University; University of Ibadan; Instadeep; University of Maryland; Defence Space Administration Abuja
-
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
— Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot – Inria, France; Sorbonne Université, France
-
OPT: Open Pre-trained Transformer Language Models
— Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer – Meta AI
-
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
— Alexandra Sasha Luccioni, Frances Corry, Hamsini Sridharan, Mike Ananny, Jason Schultz, Kate Crawford – Hugging Face; University of Southern California, USA; New York University, USA; Microsoft Research, USA
-
ClueWeb22: 10 Billion Web Documents with Rich Information
— Arnold Overwijk, Chenyan Xiong, Jamie Callan – Microsoft; Carnegie Mellon University
-
esCorpius: A Massive Spanish Crawling Corpus
— Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas – LHF Labs; Universidad Autónoma de Madrid, Spain; University of Edinburgh, United Kingdom; Universidad de Granada, Spain
-
Domain Parking: Largely Present, Rarely Considered!
— Johannes Zirngibl, Steffen Deusch, Patrick Sattler, Juliane Aulbach, Georg Carle, Mattijs Jonker – Technical University of Munich, Germany; University of Twente, The Netherlands
-
Homepage2Vec: Language-Agnostic Website Embedding and Classification
— Sylvain Lugeon, Tiziano Piccardi, Robert West – EPFL, Switzerland
-
HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML
— Florian Hantke, Ben Stock – CISPA Helmholtz Center for Information Security, Germany
-
Rethinking Data Governance: A Labor-Oriented Approach
— Hanlin Li, Nicholas Vincent – Northwestern University, USA
-
Deepfake Text Detection: Limitations and Opportunities
— Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, Bimal Viswanath, Virginia Tech, LUMS Pakistan – Virginia Tech, USA; University of Chicago, USA; LUMS, Pakistan, University of Virginia, USA
-
Equivocal URLs: Understanding the Fragmented Space of URL Parser Implementations
— Joshua Reynolds, Adam Bates, Michael Bailey – New Mexico State University, USA; University of Illinois at Urbana-Champaign, USA; Georgia Institute of Technology, USA
-
The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models
— Per E Kummervold, Freddy Wetjen, Javier de la Rosa – National Library of Norway (NLN), Norway
-
A Holistic Approach to Undesired Content Detection in the Real World
— Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, Lilian Weng – OpenAI
-
Dataset of intercity relationships between 293 Chinese cities extracted and classified on the basis of toponym co-occurrences on Common Crawl
— Wang Tongjing, Zhao Yin, Ziyu Bao, Evert Meijers – Utrecht University, The Netherlands; Delft University of Technology, The Netherlands
-
LAION-5B: An open large-scale dataset for training next generation image-text models
— Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev – LAION; UC Berkeley, USA; Gentec Data; TU Darmstadt, Germany; Hessian.AI; University of Washington, Seattle, USA; Technical University of Munich, Germany; Stability AI; EleutherAI; Juelich Supercomputing Center (JSC), Germany; Research Center Juelich (FZJ), Germany
-
C-OSINT: COVID-19 Open Source artificial INTelligence framework
— L. Ranaldi, A. Nourbakhsh, F. Fallucchid, FM. Zanzotto – Guglielmo Marconi University, Roma, Italy; University of Rome Tor Vergata, Roma, Italy
-
Fine-grained Czech News Article Dataset: An Interdisciplinary Approach to Trustworthiness Analysis
— Matyáš Boháček, Michal Bravanský, Filip Trhlík, Václav Moravec – Charles University, Prague, Czech Republic; Gymnasium of Johannes Kepler, Prague, Czech Republic; University College London, United Kingdom
-
A Hybrid Phishing Detection System Using Deep Learning-based URL and Content Analysis
— Mehmet Korkmaz, Emre Koçyiğit, Özgür Şahingöz, Banu Diri – Yildiz Technical University, Istanbul, Turkey; Biruni University, Istanbul, Turkey
-
The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability
— Mehtab Khan, Alex Hanna – Yale Law School, USA; Distributed AI Research Institute
-
Comparative Analysis of Machine Learning Classifiers for Phishing Detection
— Mohd Faizal Ab Razak, Mohd Izham Jaya, Ferda Ernawan, Ahmad Firdaus, Fajar Agung Nugroho – Universitas Dian Nuswantoro, Semarang, Indonesia
-
No Language Left Behind: Scaling Human-Centered Machine Translation
— {NLLB Team}, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang – Meta AI; UC Berkeley, USA; Johns Hopkins University, USA
-
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
— Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro – Microsoft; NVIDIA
-
Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
— Shuheng Liu, Alan Ritter – Georgia Institute of Technology
-
Analyzing the Web: Are Top Websites Lists a Good Choice for Research?
— Tom Alby, Robert Jäschke – Humboldt-Universität zu Berlin, Berlin, Germany
-
Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse
— Mark Edward Phillips, Sawood Alam – University of North Texas, USA; Internet Archive, USA
-
Coyo-700m: Image-text pair dataset
— Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim – Kakao Brain, South Korea
-
LaMDA: Language Models for Dialog Applications
— Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le – Google
2021
-
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora
— Bilal Tahir, Muhammad Amir Mehmood – University of Engineering and Technology, Lahore, Pakistan
-
A COVID-19 news coverage mood map of Europe
— Frankie Robertson, Jarkko Lagus, Kaisla Kajava – University of Jyväskylä, Finland; University of Helsinki, Finland
-
Documenting the English Colossal Clean Crawled Corpus
— Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Matt Gardner – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, USA
-
mT5: A massively multilingual pre-trained text-to-text transformer
— Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel – Google Research
-
Detecting Phishing Sites — An Overview
— P. Kalaharsha, B. M. Mehtre – Institute for Development and Research in Banking Technology (IDRBT), Hyderabad, Indiab; School of Computer Science and Information Sciences (SCIS), University of Hyderabad, Hyderabad, India
-
What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus
— Alexandra Sasha Luccioni, Joseph D. Viviano – Université de Montréal, Canada; Mila Québec AI Institute, Canada
-
HTLM: Hyper-Text Pre-Training and Prompting of Language Models
— Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer – Facebook AI; University of Washington, USA
-
Naming unrelated words predicts creativity
— Jay A. Olson, Johnny Nahas, Denis Chmoulevitch, Simon J. Cropper, Margaret E. Webb – Department of Psychology, Harvard University, Cambridge, MA, USA; Department of Psychology, McGill University, Montreal, QC, Canada; Melbourne School of Psychological Sciences, University of Melbourne, Australia
-
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
— Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot – Inria, Paris, France; Sorbonne Université, Paris, France
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
— Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy – EleutherAI
-
The Danish Gigaword Corpus
— Leon Derczynski, Manuel R. Ciosici, Rebekah Baglini, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen, Jens Madsen, Malte Lau Petersen, Jonathan Hvithamar Rystrøm, Daniel Varab – ITU Copenhagen, Denmark; Aarhus University, Denmark; Danish Language Council, Denmark; TV2 Regionerne, Denmark; Karnov Group, Denmark; USC Information Sciences Institute, USA; Alexandra Institute, Denmark; University of Copenhagen, Denmark; Technical University of Denmark; Novo Nordisk, Denmark
-
CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl
— Maik Fröbe, Janek Bevendorff, Lukas Gienapp, Michael Völske, Benno Stein, Martin Potthast, Matthias Hagen – Martin-Luther-Universität Halle-Wittenberg, Germany; Bauhaus-Universität Weimar, Germany; Leipzig University, Germany
-
Practical Wavelet Tree Construction
— Patrick Dinklage, Jonas Ellert, Johannes Fischer, Florian Kurpicz, Marvin Löbel – TU Dortmund University, Germany
-
Multimodal datasets: misogyny, pornography, and malignant stereotypes
— Abeba Birhane, Vinay Uday Prabhu, Emmanuel Kahembwe – University College Dublin, Ireland; Lero, Dublin, Ireland; University of Edinburgh, UK
-
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
— Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki – LAION.ai; Gentec Data, Romania; Technical University of Munich, Germany; Juelich Supercomputing Center, Germany; Georgia Institute of Technology; USA; EleutherAI
-
Voted In, Standing Out: Public Response to Immigrants’ Political Accession
— Guy Grossman, Stephanie Zonszein – University of Pennsylvania, USA
-
No News is Good News: A Critique of the One Billion Word Benchmark
— Helen Ngo, João G. M. Araújo, Jeffrey Hui, Nicholas Frosst – Cohere, Toronto, Canada
-
Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus
— Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Matt Gardner – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, USA
-
An Empirical Exploration in Quality Filtering of Text Data
— Leo Gao – EleutherAI
-
Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News
— Michael Bugert, Iryna Gurevych – Ubiquitous Knowledge Processing Lab (UKP), Technical University of Darmstadt, Germany
-
The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus
— Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel – Facebook AI Research; University College London, United Kingdom; University of Mannheim, Germany; ENS, PSL University, France; Inria, France; University of Washington, United States
-
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset
— Daniel Varab, Natalie Schluter – IT University of Copenhagen, Denmark
-
From web graphs to prioritizing web crawls
— Sebastian Nagel – Common Crawl
-
MuRIL: Multilingual Representations for Indian Languages
— Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar – Google; Indian Institute of Technology, Patna, India; Indian Institute of Technology, Bombay, India; Delhi Technological University, India
2020
-
CC-News-En: A large English news corpus
— Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat – The University of Melbourne, Melbourne, Australia; RMIT University, Melbourne, Australia; Amazon Alexa, Manhattan Beach, CA, USA
-
CCAligned: A Massive collection of cross-lingual web-document pairs
— Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn – Facebook AI; Johns Hopkins University
-
Language models are few-shot learners
— Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei – Johns Hopkins University; OpenAI
-
On the impact of publicly available news and information transfer to financial markets
— Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm – ETH Zurich, Switzerland; New York University, New York, USA
-
Can I take your subdomain? Exploring related-domain attacks in the modern web
— Marco Squarcina, Mauro Tempesta, Lorenzo Veronese, Stefano Calzavara, Matteo Maffei – TU Wien, Austria; Università Ca’ Foscari Venezia, Italy
-
Exploring the limits of transfer learning with a unified text-to-text transformer
— Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu – Google, Mountain View, CA, USA
-
Getting structured data from the internet
— Jay M. Patel – Specrom Analytics, Ahmedabad, India
-
Mapping languages: The Corpus of Global Language Use
— Jonathan Dunn – University of Canterbury, Christchurch, New Zealand
-
CLUECorpus2020: A large-scale Chinese corpus for pre-training language model
— Liang Xu, Xuanwei Zhang, Qianqian Dong – CLUE Organization
-
Exploring the Dominance of the English Language on the Websites of EU Countries
— Andreas Giannakoulopoulos, Minas Pergantis, Nikos Konstantinou, Aristeidis Lamprogeorgos, Laida Limniati, Iraklis Varlamis – Ionian University, Corfu, Greece; Harokopio University of Athens, Athens, Greece
-
Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies
— Mukund Srinath, Shomir Wilson, C Lee Giles – Pennsylvania State University, PA, USA
-
A longitudinal analysis of job skills for entry-level data analysts
— Tianxi Dong, Jason Triche – Trinity University, San Antonio, TX, USA; University of Montana, MT, USA
-
Extracting training data from large language models
— Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel – Google; Stanford University; UC Berkeley; Northeastern University; OpenAI; Harvard University; Apple
-
Going Back in Time to Find What Existed on the Web and How much has been Preserved: How much of Palestinian Web has been Archived?
— Thaer Sammar, Hadi Khalilia – Palestine Technical University, Tulkarm, West Bank
-
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
— Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith – Paul G. Allen School of Computer Science & Engineering, University of Washington, USA; Allen Institute for Artificial Intelligence, Seattle, USA
-
The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle
— Xinyue Wang, Zhiwu Xie – Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
-
Identifying Sensitive URLs at Web-Scale
— Srdjan Matic, Costas Iordanou, Georgios Smaragdakis, Nikolaos Laoutaris – TU Berlin, Germany; Cyprus University of Technology, Cyprus; IMDEA Networks Institute
-
Experiments using a Distributed Web Crawler to Process and Index Web Archives
— Sebastian Nagel – Common Crawl
2019
-
The SAGE Handbook of Web History
— Nils Brügger, Ian Milligan – Aarhus University, Denmark; University of Waterloo, Canada
-
Fast Dictionary-Based Compression for Inverted Indexes
— Giulio Ermanno Pibiri, Matthias Petri, Alistair Moffat – University of Melbourne, Australia; University of Pisa, Italy; ISTI-CNR, Pisa, Italy
-
CCNet: Extracting high quality monolingual datasets from web crawl data
— Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave – Facebook AI
-
Language models are unsupervised multitask learners
— A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, Ilya Sutskever – OpenAI, San Francisco, California, United States
-
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures
— Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary – Inria, Paris, France; Sorbonne University, Paris, France
-
Multi-Label Branchenklassifikation von Web-Texten
— Dominik Mottl – Hochschule Darmstadt, Germany
-
Accessing WARC files via SQL
— Sebastian Nagel – Common Crawl, USA
-
Defending against neural fake news
— Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi – University of Washington, USA; Allen Institute for Artificial Intelligence, USA
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach
— Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov – Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA; Facebook AI
-
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
— Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin – Facebook AI
-
Real or Fake? Learning to Discriminate Machine from Human Generated Text
— Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc'Aurelio Ranzato, Arthur Szlam – Facebook AI Research; Harvard University, USA
-
Unsupervised Cross-lingual Representation Learning at Scale
— Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov – Facebook AI
-
XLNet: Generalized Autoregressive Pretraining for Language Understanding
— Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le – Carnegie Mellon University, Google AI Brain Team
2018
-
A Deep Learning Approach for Extracting Attributes of ABAC Policies
— Manar Alohaly, Hassan Takabi, Eduardo Blanco – University of North Texas, USA
-
Preserved Structure Across Vector Space Representations
— Andrei Amatuni, Estelle He, Elika Bergelson – Duke University
-
Distributed Evaluation of Subgraph Queries Using Worstcase Optimal LowMemory Dataflows
— Khaled Ammar, Frank McSherry, Semih Salihoglu, Manas Joglekar – University of Waterloo, Canada; ETH Zürich, Switzerland; Google, Inc.
-
Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows
— Khaled Ammar, Frank McSherry, Semih Salihoglu, Manas Joglekar – University of Waterloo, Canada; ETH Zürich, Switzerland; Google, Inc.
-
Large scale distributed neural network training through online distillation
— Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton – Google; Google Brain; Google DeepMind
-
Large-Scale Analysis of Style Injection by Relative Path Overwrite
— Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson – Northeastern University, Boston, MA, USA; University of Trento, Trento, Italy
-
Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation
— Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre – University of the Basque Country, Spain
-
Big Data Integration for Product Specifications.
— Luciano Barbosa, Valter Crescenzi, Xin Luna Dong, Paolo Merialdo, Federico Piai, Disheng Qiu, Yanyan Shen, Divesh Srivastava – Universidade Federal de Pernambuco, Brazil; Roma Tre University, Italy; Amazon; Wanderio; Shanghai Jiao Tong University; AT&T Labs – Research
-
Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)
— Luciano Barbosa, Valter Crescenzi, Xin Luna Dong, Paolo Merialdo, Federico Piai, Disheng Qiu, Yanyan Shen, Divesh Srivastava – Universidade Federal de Pernambuco, Brazil; Roma Tre University, Italy; Amazon; Wanderio; Shanghai Jiao Tong University; AT&T Labs – Research
-
Follow The Money: Online Piracy and Self-Regulation in the Advertising Industry
— Michail Batikas, Jörg Claussen, Christian Peukert – LMU Munich, Germany; UCP – Católica Lisbon School of Business and Economics, Lisboa, Portugal
-
Beagle: Automated Extraction and Interpretation of Visualizations from the Web
— Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, Michael Stonebraker – University of Washington, Seattle, WA, USA; Massachusetts Institute of Technology, Cambridge, MA, USA; Tufts University, Medford, MA, USA
-
Data Science with Vadalog: Bridging Machine Learning and Reasoning
— Luigi Bellomarini, Ruslan R Fayzrakhmanov, Georg Gottlob, Andrey Kravchenko, Eleonora Laurenza, Yavor Nenov, Stephane Reissfelder, Emanuel Sallinger, Evgeny Sherkhonov, Lianlong Wu – University of Oxford, United Kingdom; Banca d’Italia, Italy; TU Wien, Austria
-
Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl
— Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast – Bauhaus-Universität Weimar, Germany; Leipzig University, Germany
-
D1. 2: Report on Improving Translation with Monolingual Data
— Fabienne Braune, Alex Fraser, Barry Haddow – University of Edinburgh
-
UWB at SemEval-2018 Task 10: Capturing Discriminative Attributes from Word Distributions
— Tomáš Brychcín, Tomáš Hercig, Josef Steinberger, Michal Konkol – University of West Bohemia, Czech Republic
-
Ten years of webtables
— Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, Eugene Wu – Google Inc.; University of Michigan, USA; Megagon Labs; University of Florida, USA; Columbia University, USA
-
Zewen at SemEval-2018 Task 1: An Ensemble Model for Affect Prediction in Tweets
— Zewen Chi, Heyan Huang, Jiangui Chen, Hao Wu, Ran Wei – Beijing Institute of Technology, China
-
Are Automatic Metrics Robust and Reliable in Specific Machine Translation Tasks?
— Mara Chinea-Rios, Alvaro Peris, Francisco Casacuberta – Universitat d'Alacant, Spain
-
A multilayer convolutional encoder-decoder neural network for grammatical error correction
— Shamil Chollampatt, Hwee Tou Ng – NUS Graduate School for Integrative Sciences and Engineering; Department of Computer Science, National University of Singapore
-
Pangloss: Fast Entity Linking in Noisy Text Environments
— Michael Conover, Matthew Hayes, Scott Blackburn, Pete Skomoroch, Sam Shah – Workday, Inc., San Francisco, CA, USA
-
Investigating open data portals automatically: a methodology and some illustrations
— Andreiwid Sheffer Correa, Pär-Ola Zander, Flavio Soares Correa da Silva – University of Sao Paulo, Sao Paulo, Brazil; Aalborg University, Aalborg, Denmark
-
An Approach to Geotag a Web Sized Corpus of Documents with Addresses in Randstad, Netherlands
— Alexander Czech – TU Wien, Austria
-
Unsupervised Domain Adaptation by Adversarial Learning for Robust Speech Recognition
— Pavel Denisov, Ngoc Thang Vu, Marc Ferras Font – University of Stuttgart, Germany
-
Absolute Orientation for Word Embedding Alignment
— Sunipa Dev, Safia Hassan, Jeff M Phillips – University of Utah
-
Understanding Back-Translation at Scale
— Sergey Edunov, Myle Ott, Michael Auli, David Grangier – Facebook AI Research, USA; Google Brain, Mountain View, CA, USA
-
A Geo-Tagging Framework for Address Extraction from Web Pages
— Julia Efremova, Ian Endres, Isaac Vidas, Ofer Melnik – HERE Technologies, Amsterdam, The Netherlands
-
Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web
— Diego Esteves, Aniketh Janardhan Reddy, Piyush Chawla, Jens Lehmann – University of Bonn, Germany; University of Ohio, USA; Carnegie Mellon University, Pittsburgh, USA;
-
MIsA: Multilingual IsA Extraction from Corpora
— Stefano Faralli, Els Lefever, Simone Paolo Ponzetto – University of Mannheim, Germany; Ghent University, Belgium
-
Analysis of the Web Graph Aggregated by Host and Pay-Level Domain
— Agostino Funel – ENEA, Italy
-
Not just about size-A Study on the Role of Distributed Word Representations in the Analysis of Scientific Publications
— Andres Garcia, Jose Manuel Gomez-Perez – expertsystem.com, Madrid, Spain
-
TabVec: Table Vectors for Classification of Web Tables
— Majid Ghasemi-Gol, Pedro Szekely – University of Southern California; Information Science Institute
-
Discovering Implicit Knowledge with Unary Relations
— Michael Glass, Alfio Gliozzo – IBM Research AI
-
A Dataset for Web-Scale Knowledge Base Population
— Michael Glass, Alfio Gliozzo – Knowledge Induction and Reasoning Group, IBM Research AINew YorkUSA
-
“I think it might help if we multiply, and not add”: Detecting Indirectness in Conversation
— Pranav Goel, Yoichi Matsuyama, Michael Madaio, Justine Cassell – Indian Institute of Technology (BHU), India; Carnegie Mellon University
-
Legal Deposit Web Archives and the Digital Humanities: A Universe of Lost Opportunity?
— Paul Gooding, Melissa Terras, Linda Berube – University of East Anglia, United Kingdom; University of Edinburgh, United Kingdom
-
Semantic projection: recovering human knowledge of multiple, distinct object features from word embeddings
— Gabriel Grand, Idan Asher Blank, Francisco Pereira, Evelina Fedorenko – Harvard University; Massachusetts Institute of Technology; Siemens Healthineers; Massachusetts General Hospital; Harvard Medical School
-
Learning word vectors for 157 languages
— Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov – Facebook AI Research; École polytechnique fédérale de Lausanne EPFL, Switzerland
-
Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation
— Roman Grundkiewicz, Marcin Junczys-Dowmunt – University of Edinburgh; Microsoft
-
Leveraging Meta-Embeddings for Bilingual Lexicon Extraction from Specialized Comparable Corpora
— Amir Hazem, Emmanuel Morin – Université de Nantes, France
-
ClaiRE at SemEval-2018 Task 7: Classification of Relations using Embeddings
— Lena Hettinger, Alexander Dallmann, Albin Zehe, Thomas Niebler, Andreas Hotho – University of Würzburg, Germany
-
ClaiRE at SemEval-2018 Task 7-Extended Version
— Lena Hettinger, Alexander Dallmann, Albin Zehe, Thomas Niebler, Andreas Hotho – University of Würzburg, Germany
-
Large Margin Neural Language Model
— Jiaji Huang, Yi Li, Wei Ping, Liang Huang – Baidu Research, Sunnyvale, CA, USA; School of EECS, Oregon State University, Corvallis, OR, USA
-
Közös crawlnak is egy korpusz a vége-Korpuszépítés a CommonCrawl .hu domainjából
— Balázs Indig – MTA-PPKE Magyar Nyelvtechnológiai Kutatócsoport, Hungaria
-
Cuttlefish: A Lightweight Primitive for Adaptive Query Processing
— Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, Johannes Gehrke – University of Washington; Microsoft
-
Webarchiválás és a történeti kutatások / Web Archiving and Historical Research
— Kokas Károly, Drótos László – Országos Széchényi Könyvtár, Hungary; SZTE Klebelsberg Könyvtár, Hungary
-
DL Team at SemEval-2018 Task 1: Tweet Affect Detection using Sentiment Lexicons and Embeddings
— Dmitry Kravchenko, Lidia Pivovarova – Ben-Gurion University of the Negev, Israel; University of Helsinki, Finland
-
Discriminator at SemEval-2018 Task 10: Minimally Supervised Discrimination
— Artur Kulmizev, Mostafa Abdou, Vinit Ravishankar, Malvina Nissim – University of Groningen, The Netherlands; Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic
-
Using toponym co-occurrences to measure relationships between places: review, application and evaluation
— Evert Meijers, Antoine Peris – Delft University of Technology, The Netherlands
-
TCS Research at SemEval-2018 Task 1: Learning Robust Representations using Multi-Attention Architecture
— Hardik Meisheri, Lipika Dey – TCS Research, New Delhi, India
-
A Quantitative Analysis of the Use of Microdata for Semantic Annotations on Educational Resources
— Rosa Navarrete, Sergio Luján Mora – Universidad de Alicante, Spain
-
eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
— Matteo Negri, Marco Turchi, Rajen Chatterjee, Nicola Bertoldi – Fondazione Bruno Kessler, Trento, Italy; University of Trento, Italy
-
Emergency vocabulary
— Dávid Márk Nemeskey, András Kornai – HAS Institute of Computer Science, Budapest, Hungary
-
EmbNum: Semantic labeling for numerical values with deep metric learning
— Phuc Nguyen, Khai Nguyen, Ryutaro Ichise, Hideaki Takeda – SOKENDAI (The Graduate University for Advanced Studies) Shonan Village, Hayama, Kanagawa, Japan; National Institute of Informatics, Tokyo, Japan
-
Bi-Directional Neural Machine Translation with Synthetic Parallel Data
— Xing Niu, Michael Denkowski, Marine Carpuat – University of Maryland; Amazon.com, Inc.
-
SDC: structured data collection by yourself
— Takuya Ohshima, Motomichi Toyama – Keio University, Yokohama, Kanagawa, Japan
-
Evaluation of sentence embeddings in downstream and linguistic probing tasks
— Christian S. Perone, Roberto Silveira, Thomas S. Paula – Universitat Politècnica de Catalunya, Barcelona, Spain
-
Compact inverted index storage using general-purpose compression libraries
— Matthias Petri, Alistair Moffat – University of Melbourne, Australia
-
Card-660: Cambridge Rare Word Dataset-a Reliable Benchmark for Infrequent Word Representation Models
— Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, Nigel Collier – University of Cambridge, United Kingdom
-
Analyzing conversations to automatically identify action items
— Roy Raanani, Russell Levy, Micha Yochanan Breakstone, Dominik Facher – Affectlayer Inc
-
Analyzing conversations to automatically identify customer pain points
— Roy Raanani, Russell Levy, Micha Yochanan Breakstone, Dominik Facher – Affectlayer Inc
-
Analyzing conversations to automatically identify product features that resonate with customers
— Roy Raanani, Russell Levy, Micha Yochanan Breakstone, Dominik Facher – Affectlayer Inc
-
Lessons from Natural Language Inference in the Clinical Domain
— Alexey Romanov, Chaitanya Shivade – University of Massachusetts Lowell, USA; IBM Almaden Research Center, San Jose, CA, USA
-
MEADE: Towards a Malicious Email Attachment Detection Engine
— Ethan M. Rudd, Richard Harang, Joshua Saxe – Sophos Group PLC, VA, USA
-
CUNI team: CLEF eHealth Consumer Health Search Task 2018
— Shadi Saleh, Pavel Pecina – Charles University, Czech Republic
-
BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern-and Graph-based Information to Identify Discriminative Attributes
— Enrico Santus, Chris Biemann, Emmanuele Chersoni – Massachussetts Institute of Technology, USA; Universität Hamburg, Germany; Aix-Marseille University, France
-
Learning Word Embeddings for Data Sparse and Sentiment Rich Data Sets
— Prathusha Kameswara Sarma – University of Wisconsin-Madison
-
Domain Adapted Word Embeddings for Improved Sentiment Classification
— Prathusha K Sarma, YIngyu Liang, William A Sethares – University of Wisconsin-Madison
-
Simple Algorithms For Sentiment Analysis On Sentiment Rich, Data Poor Domains.
— Prathusha K Sarma, William Sethares – University of Wisconsin-Madison
-
On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl
— Sebastian Schelter, Jérôme Kunegis – Technical University Berlin, Germany; University of Namur, Belgium
-
Filtering and Mining Parallel Data in a Joint Multilingual Space
— Holger Schwenk – Facebook AI Research
-
Evidence of semantic processing difficulty in naturalistic reading
— Cory Shain, Richard Futrell, Marten van Schijndel, Edward Gibson, William Schuler – Ohio State University; MIT; Johns Hopkins University
-
A Machine Learning Approach to Correlate Emotional Intelligence and Happiness Based on Twitter Data
— Sistla Sai Shravani, Niraj Kumar Jha, Rajlaksmi Guha – IT Kharagpur, India
-
Intent Generation for Goal-Oriented Dialogue Systems based on Schema.org Annotations
— Umutcan Şimşek, Dieter Fensel – University of Innsbruck, Austria
-
Using open data to predict market movements
— Ravinder Singh, Marina Levina, Nelson Jiao, Asha Saini – DELL EMC
-
ArgumenText: Searching for Arguments in Heterogeneous Sources
— Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, Benjamin Schiller, Christopher Tauchmann, Steffen Eger, Iryna Gurevych – Ubiquitous Knowledge Processing Lab, Department of Computer Science, Technische Universität Darmstadt, Germany
-
Ontology Augmentation Through Matching with Web Tables
— Oliver Lehmberg, Oktie Hassanzadeh – University of Mannheim, Germany; IBM Research, Yorktown Heights, New York, USA
-
Multimodal Language Analysis with Recurrent Multistage Fusion: Supplementary Material
— Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency – Carnegie Mellon University
-
Searching Arguments in German with ArgumenText
— Chris Stahlhut – Ubiquitous Knowledge Processing Lab TU Darmstadt, Germany
-
Shortcutting Label Propagation for Distributed Connected Components
— Stergios Stergiou, Dipen Rughwani, Kostas Tsioutsiouliklis – Yahoo Research, Sunnyvale, CA, USA; Google & Yahoo Research, Mountain View, CA, USA
-
Emotion Detection and Classification in a Multigenre Corpus with Joint Multi-Task Deep Learning
— Shabnam Tafreshi, Mona Diab – George Washington University
-
Can Common Crawl reliably track persistent identifier (PID) use over time?
— Henry S. Thompson, Jian Tong – University of Edinburgh, United Kingdom
-
Debunking Fake News One Feature at a Time
— Melanie Tosik, Antonio Mallia, Kedar Gangopadhyay – New York University
-
Inducing Grammars with and for Neural Machine Translation
— Ke Tran, Yonatan Bisk – University of Amsterdam; University of Washington
-
A Simple Method for Commonsense Reasoning
— Trieu H Trinh, Quoc V Le – Google Brain
-
AmritaNLP at SemEval-2018 Task 10: Capturing discriminative attributes using convolution neural network over global vector representation.
— Vivek Vinayan, Kumar M Anand, K P Soman – Amrita School of Engineering, India
-
Identifying Semantic Divergences in Parallel Text without Annotations
— Yogarshi Vyas, Xing Niu, Marine Carpuat – Department of Computer Science, University of Maryland
-
Code-Switched Named Entity Recognition with Embedding Attention
— Changhan Wang, Kyunghyun Cho, Douwe Kiela – Facebook AI Research; New York University
-
Detection of mergeable Wikipedia articles based on overlapping topics
— Renzhi Wang, Mizuho Iwaihara – Graduate School of Information, Production and Systems, Waseda University Japan
-
Paranmt-50m: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations
— John Wieting, Kevin Gimpel – Carnegie Mellon University, Pittsburgh, PA, USA; Toyota Technological Institute at Chicago, IL, USA
-
Heuristics-based Query Reordering for Federated Queries in SPARQL 1.1 and SPARQL-LD
— Thanos Yannakis, Pavlos Fafalios, Yannis Tzitzikas – University of Crete, Greece; Leibniz University of Hannover, Germany
-
GraphIt – A High-Performance DSL for Graph Analytics
— Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, Saman P. Amarasinghe – MIT CSAIL; Adobe Research
-
SLIND: Identifying Stable Links in Online Social Networks
— Ji Zhang, Leonard Tan, Xiaohui Tao, Xiaoyao Zheng, Yonglong Luo, Jerry Chun-Wei Lin – University of Southern Queensland, Australia; Anhui Normal University, Wuhu, China; Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
-
What can we learn from Semantic Tagging?
— Mostafa Abdou, Artur Kulmizev, Vinit Ravishankar, Lasha Abzianidze, Johan Bos – University of Groningen, The Netherlands; University of Copenhagen, Denmark; University of Oslo, Norway;
-
Learning emotion-enriched word representations
— Ameeta Agrawal, Aijun An, Manos Papagelis – York University, Toronto, Canada
-
Wikipedia text reuse: within and without
— Milad Alshomary, Michael Völske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, Martin Potthast – Paderborn University, Germany; Bauhaus-Universität Weimar, Germany; Martin-Luther-Universität Halle-Wittenberg, Germany; Leipzig University, Germany
-
Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations
— Mikel Artetxe, Gorka Labaka, Eneko Agirre – University of the Basque Country, Spain
-
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
— Mikel Artetxe, Gorka Labaka, Eneko Agirre – University of the Basque Country, Spain
-
Margin-based parallel corpus mining with multilingual sentence embeddings
— Mikel Artetxe, Holger Schwenk – University of the Basque Country, Spain; Facebook AI Research
-
Towards two-dimensional sequence to sequence model in neural machine translation
— Parnia Bahar, Christopher Brix, Hermann Ney – RWTH Aachen University, Germany
-
Entity-oriented search
— Krisztian Balog – University of Stavanger, Norway
-
Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment
— Luisa Bentivogli, Mauro Cettolo, Marcello Federico, Federmann Christian – FBK, Trento, Italy; Amazon AI, East Palo Alto, CA, USA, Microsoft Cloud+AI, Redmond, WA, USA
-
BUbiNG: Massive crawling for the masses
— Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna – Università degli Studi di Milano, Italy
-
Studying the Difference Between Natural and Programming Language Corpora
— Casey Casalnuovo, Kenji Sagae, Prem Devanbu – University of California, Davis, USA
-
Leveraging Unpaired Out-of-Domain Data for Image Captioning
— Xinghan Chen, Mingxing Zhang, Zheng Wang, Lin Zuo, Bo Li, Yang Yang – University of Electronic Science and Technology of China (UESTC), Chengdu, PR China
-
User-Centric Ontology Population
— Kenneth Clarkson, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski, Joseph Terdiman, Steve Welch – IBM Research Almaden, San Jose, USA
-
Bringing Order to Neural Word Embeddings with Embeddings Augmented by Random Permutations (EARP)
— Trevor Cohen, Dominic Widdows – University of Washington, Seattle, USA; Grab, Inc., Seattle, WA, USA
-
SentEval: An evaluation toolkit for universal sentence representations
— Alexis Conneau, Douwe Kiela – Facebook Artificial Intelligence Research
-
XNLI: Evaluating Cross-lingual Sentence Representations
— Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, Veselin Stoyanov – Facebook AI Research, USA; New York University, USA
-
Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018)
— J Shane Culpepper, Fernando Diaz, Mark D. Smucker – ACM
-
Zero-Shot Object Detection by Hybrid Region Embedding
— Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis – HAVELSAN Inc. Ankara, Turkey; Middle East Technical University Ankara, Turkey; Hacettepe University Ankara, Turkey
-
Interactions and influence of world painters from the reduced Google matrix of Wikipedia networks
— Samer El Zant, Katia Jaffrès-Runser, Klaus M. Frahm, Dima L. Shepelyansky – Université de Toulouse, France
-
M1. 2–Corpora for the Machine Translation Engines
— Cristina Espana-Bonet, Juliane Stiller, Sophie Henning – Universität des Saarlandes, Germany; Humboldt-Universität zu Berlin, Germany
-
Browserless web data extraction: challenges and opportunities
— Ruslan R. Fayzrakhmanov, Emanuel Sallinger, Ben Spencer, Tim Furche, Georg Gottlob – University of Oxford, Oxford, United Kingdom
-
Word embeddings quantify 100 years of gender and ethnic stereotypes
— Nikhil Garg, Londa Schiebinger, Dan Jurafsky, James Zou – Stanford University, USA; Chan Zuckerberg Biohub, San Francisco, CA, USA
-
Inducing implicit relations from text using distantly supervised deep nets
— Michael Glass, Alfio Gliozzo, Oktie Hassanzadeh, Nandana Mihindukulasooriya, Gaetano Rossiello – IBM Research AI, New York, USA; Universidad Politcnica de Madrid, Spain; University of Bari, Italy
-
Combining Shallow and Deep Learning for Aggressive Text Detection
— Viktor Golem, Mladen Karan, Jan Šnajder – University of Zagreb, Croatia
-
Pooling Word Vector Representations Across Models
— Rajendra Banjade, Nabin Maharjan, Dipesh Gautam, Frank Adrasik, Arthur C. Graesser, Vasile Rus – University of Memphis, USA
-
Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data
— Michael A. Hedderich, Dietrich Klakow – Saarland University, Saarbrücken, Germany
-
Adversarial example generation with syntactically controlled paraphrase networks
— Mohit Iyyer, John Wieting, Kevin Gimpel, Luke Zettlemoyer – Allen Institute of Artificial Intelligence, Seattle, United States; UMass Amherst, United States; Carnegie Mellon University, Pittsburgh, PA, USA; Toyota Technological Institute at Chicago, IL, USA; University of Washington, Seattle, WA, USA
-
Loss in translation: Learning bilingual word mapping with a retrieval criterion
— Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, Edouard Grave – Facebook AI Research
-
Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task
— Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, Kenneth Heafield – University of Edinburgh, United Kingdom; Microsoft
-
Measuring the evolution of a scientific field through citation frames
— David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, Dan Jurafsky – University of Michigan, USA; Stanford University, USA
-
Determination of content score
— Alexander Kagoshima, Kai Londenberg, Fang Xu – Searchmetrics GmbH
-
Methods and systems for query segmentation
— Ajinkya Gorakhnath Kale, Thrivikrama Taula, Amit Srivastava, Sanjika Hewavitharana – eBay Inc.
-
A domain is only as good as its buddies: detecting stealthy malicious domains via graph inference
— Issa M. Khalil, Bei Guan, Mohamed Nabeel, Ting Yu – Qatar Computing Research Institute, Doha, Qatar
-
Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation
— Huda Khayrallah, Brian Thompson, Kevin Duh, Philipp Koehn – Johns Hopkins University, USA
-
Dynamic meta-embeddings for improved sentence representations
— Douwe Kiela, Changhan Wang, Kyunghyun Cho – Facebook AI Research, USA; New York University, USA; CIFAR Global Scholar, Canada
-
Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment
— Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, Martin Potthast – Paderborn University, Germany; Bauhaus-Universität Weimar, Germany; Martin-Luther-Universität Halle-Wittenberg, Germany; Leipzig University, Germany; Ulm University, Germany
-
Textbook Question Answering with Knowledge Graph Understanding and Unsupervised Open-set Text Comprehension
— Daesik Kim, Seonhoon Kim, Nojun Kwak – Seoul National University, South Korea; V.DO Inc., South Korea; Naver Corporation, South Korea
-
Mixture of Expert/Imitator Networks: Scalable Semi-supervised Learning Framework
— Shun Kiyono, Jun Suzuki, Kentaro Inui – Tohoku University, Japan; Center for Advanced Intelligence Project, Japan
-
Context and Copying in Neural Machine Translation
— Rebecca Knowles, Philipp Koehn – Johns Hopkins University, USA
-
Abstractive Summarization Using Attentive Neural Techniques
— Jacob Krantz, Jugal Kalita – Gonzaga University, USA; University of Colorado, USA
-
Multilingual word embeddings and their utility in cross-lingual learning
— Artur Kulmizev – University of Groningen, The Netherlands
-
Inferring hidden causal relations between pathway members using reduced Google matrix of directed biological networks
— José Lages, Dima L. Shepelyansky, Andrei Zinovyev – Université de Franche-Comté, Besançon, France
-
Youtube av 50k: an annotated corpus for comments in autonomous vehicles
— Tao Li, Lei Lin, Minsoo Choi, Kaiming Fu, Siyuan Gong, Jian Wang – Purdue University, Indiana, USA
-
Cloud repository as a malicious service: challenge, identification and implication
— Xiaojing Liao, Sumayah Alrwais, Kan Yuan, Luyi Xing, XiaoFeng Wang, Shuang Hao, Raheem Beyah – Indiana University Bloomington, USA; King Saud University, Saudi Arabia; University of Texas at Dallas, USA; Georgia Institute of Technology, USA
-
The USTC-NEL Speech Translation system at IWSLT 2018
— Dan Liu, Junhua Liu, Wu Guo, Shifu Xiong, Zhiqiang Ma, Rui Song, Chongliang Wu, Quan Liu – University of Science and Technology of China, China; IFLYTEK Co. LTD.
-
Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos
— Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, Juan Carlos Niebles – Stanford University, USA; Google Cloud AI, Mountain View, USA
-
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
— Chi-kiu Lo, Michel Simard, Darlene Stewart, Samuel Larkin, Cyril Goutte, Patrick Littell – National Research Council, Canada
-
CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web
— Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar – amazon.com
-
Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs
— Gaurav Maheshwari, Priyansh Trivedi, Denis Lukovnikov, Nilesh Chakraborty, Asja Fischer, Jens Lehmann – University of Bonn, Germany; Ruhr University, Bochum, Germany
-
Information extraction meets the Semantic Web: A survey
— Jose L. Martinez-Rodriguez, Aidan Hogan, Ivan Lopez-Arevalo – Cinvestav Tamaulipas, Ciudad Victoria, Mexico; University of Chile, Chile
-
The natural language decathlon: Multitask learning as question answering
— Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher – Salesforce Research
-
Natural language processing using context-specific word vectors
— Bryan McCann, Caiming Xiong, Richard Socher – Salesforce.com, Inc.
-
Natural language processing using a neural network
— Bryan McCann, Caiming Xiong, Richard Socher – Salesforce.com, Inc.
-
Can a suit of armor conduct electricity? a new dataset for open book question answering
— Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal – Allen Institute for Artificial Intelligence, Seattle, USA; Heidelberg University, Germany
-
Efficient and Robust Question Answering from Minimal Context over Documents
— Sewon Min, Victor Zhong, Richard Socher, Caiming Xiong – Seoul National University, South Korea; Salesforce Research
-
Detecting signs of dementia using word vector representations
— Bahman Mirheidari, Daniel Blackburn, Traci Walker, Annalena Venneri, Markus Reuber, Heidi Christensen – University of Sheffield, United Kingdom; Royal Hallamshire Hospital, United Kingdom
-
Index compression using byte-aligned ANS coding and two-dimensional contexts
— Alistair Moffat, Matthias Petri – University of Melbourne, Australia
-
Merging search results generated by multiple query variants using data fusion
— Nkwebi Motlogelwa, Edwin Thuma, Tebo Leburu-Dingalo – University of Botswana, Botswana
-
Automatically Categorizing Software Technologies
— Mathieu Nassif, Christoph Treude, Martin Robillard – McGill University School of Computer Science, Montreal, Quebec, Canada
-
Analyzing uncertainty in neural machine translation
— Myle Ott, Michael Auli, David Granger, Marc'Aurelio Ranzato – Facebook AI Research, USA
-
Dank Learning: Generating Memes Using Deep Neural Networks
— Abel L. Peirson Peirson, E. Meltem Tolunay – Stanford University, USA
-
Style Transfer Through Back-Translation
— Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, Alan W Black – Carnegie Mellon University, Pittsburgh, PA, USA
-
Analyzing conversations to automatically identify product feature requests
— Roy Raanani, Russell Levy, Micha Yochanan Beakstone, Dominik Facher – Affectlayer Inc
-
Automatic generation of playlists from conversations
— Roy Raanani, Russell Levy, Micha Yochanan Breadstone – Affectlayer Inc
-
Coordinating voice calls between representatives and customers to influence an outcome of the call
— Roy Raanani, Russell Levy, Micha Yochanan Breakstone – Affectlayer Inc
-
Modeling voice calls to improve an outcome of a call between a representative and a customer
— Roy Raanani, Russell Levy, Micha Yochanan Breakstone – Affectlayer Inc
-
Automatic pattern recognition in conversations
— Roy Raanani, Russell Levy, Dominik Facher, Micha Yochanan Breakstone – Affectlayer Inc
-
Analyzing conversations to automatically identify deals at risk
— Roy Raanani, Russell Levy, Dominik Facher, Micha Yochanan Breakstone – Affectlayer Inc
-
Global normalized reader systems and methods
— Jonathan Raiman, John Miller – Baidu USA LLC
-
Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading
— Martin Raison, Pierre-Emmanuel Mazaré, Rajarshi Das, Antoine Bordes – Facebook AI Research, Paris, France; University of Massachusetts, Amherst, USA
-
A machine learning approach for product matching and categorization
— Petar Ristoski, Petar Petrovski, Peter Mika, Heiko Paulheim – University of Mannheim, Germany; Yahoo Labs, London, United Kingdom
-
Action Classification via Concepts and Attributes
— Amir Rosenfeld, Shimon Ullman – Weizmann Institute of Science, Rehovot, Israel
-
The RWTH Aachen University filtering system for the WMT 2018 parallel corpus filtering task
— Nick Rossenbach, Jan Rosendahl, Yunsu Kim, Miguel Graça, Aman Gokrani, Hermann Ney – RWTH Aachen University, Germany
-
Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance
— Dwaipayan Roy, Debasis Ganguly, Sumit Bhatia, Srikanta Bedathur, Mandar Mitra – Indian Statistical Institute, Kolkata, India; IBM Research, Dublin, Ireland, Dublin, Ireland; IBM Research, Delhi, India, Delhi, India; Indian Institute of Technology, Delhi, Delhi, India
-
On the Design and Tuning of Machine Learning Models for Language Toxicity Classification in Online Platforms
— Maciej Rybinski, William Miller, Javier Del Ser, Miren Nekane Bilbao, José F. Aldana-Montes – University of Málaga, Spain; Anami Precision, San Sebastián, Spain; TECNALIA, Bizkaia, Spain; Basque Center for Applied Mathematics (BCAM), Bizkaia, Spain; University of the Basque Country (UPV/EHU), Bilbao, Spain
-
A dataset and reranking method for multimodal MT of user-generated image captions
— Shigehiko Schamoni, Julian Hitschler, Stefan Riezler – Heidelberg University, Germany
-
The RWTH Aachen University supervised machine translation systems for WMT 2018
— Julian Schamper, Jan Rosendahl, Parnia Bahar, Yunsu Kim, Arne Nix, Hermann Ney – RWTH Aachen University, Germany
-
WBI at CLEF eHealth 2018 Task 1: Language-independent ICD-10 coding using multi-lingual embeddings and recurrent neural networks
— Jurica Ševa, Mario Sänger, Ulf Leser – Humboldt-Universität zu Berlin, Germany
-
Out-of-distribution detection using multiple semantic label representations
— Gabi Shalev, Yossi Adi, Joseph Keshet – Bar-Ilan University, Israel
-
A multimodal assessment framework for integrating student writing and drawing in elementary science learning
— Peter Andrew Miller Smith, Samuel Leeman-Munk, Angi Shelton, Bradford W Mott, Eric Wiebe, James Lester – North Carolina State University, Raleigh, NC, USA; SAS Institute Inc., Cary, NC, USA
-
The Knowledge and Language Gap in Medical Information Seeking
— Luca Soldaini – Georgetown University, USA
-
Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks
— Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, Daniel Gildea – University of Rochester, Rochester, NY, USA; IBM T.J. Watson Research Center, Yorktown Heights, NY, USA; School of Engineering, Westlake University, China
-
A social robot system for modeling children’s word pronunciation: socially interactive agents track
— Samuel Spaulding, Huili Chen, Safinah Ali, Michael Kulinski, Cynthia Breazeal – Massachusetts Institute of Technology, Cambridge, MA, USA
-
The University of Cambridge’s Machine Translation Systems for WMT18
— Felix Stahlberg, Adria de Gispert, Bill Byrne – University of Cambridge, United Kingdom; SDL Research, Cambridge, United Kingdom
-
Overview of the CLEF ehealth evaluation lab 2018
— Hanna Suominen, Liadh Kelly, Lorraine Goeuriot, Aurélie Névéol, Lionel Ramadier, Aude Robert, Evangelos Kanoulas, Rene Spijker, Leif Azzopardi, Dan Li, others – University of Turku, Turku, Finland; The Australian National University (ANU), Australia; Commonwealth Scientific and Industrial Research Organisation (CSIRO), University of Canberra, Canberra, Australia; Maynooth University, Maynooth, Ireland; Univ. Grenoble Alpes, CNRS, Grenoble, France; Université Paris-Saclay, Orsay, France; INSERM, France; University of Amsterdam, Amsterdam, Netherlands; Cochrane Netherlands and UMC Utrecht; Julius Center for Health Sciences and Primary Care, Utrecht, Netherlands; University of Strathclyde, Glasgow, UK; Queensland University of Technology, Brisbane, Australia; Vienna University of Technology, Vienna, Austria; Qatar Computing Research Institute, Doha, Qatar
-
Inferring missing categorical information in noisy and sparse web markup
— Nicolas Tempelmeier, Elena Demidova, Stefan Dietze – Leibniz Universität Hannover, Germany
-
Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation
— Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, Philipp Koehn – Johns Hopkins University, USA; University of Notre Dame, France; Air Force Research Laboratory, USA
-
Facilitating mapping of control policies to regulatory documents
— Swapna Buccapatnam Tirumala, Ashish Jagmohan, Elham Khabiri, Ta-Hsin Li, Matthew Daniel Riemer, Vadim Sheinin, Aditya Vempaty – International Business Machines Corp.
-
Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings
— Maksim Tkachenko, Chong Cher Chia, Hady Lauw – Singapore Management University, Singapore
-
Creation and optimization of resource contents
— Marcus Tober, Daniela Neumann – Searchmetrics GmbH
-
Watset: local-global graph clustering with applications in sense and frame induction
— Dmitry Ustalov, Alexander Panchenko, Chris Biemann, Simone Paolo Ponzetto – University of Mannheim, Germany; University of Hamburg, Germany; Skolkovo Institute of Science and Technology, Moskva, Russia
-
Unsupervised sense-aware hypernymy extraction
— Dmitry Ustalov, Alexander Panchenko, Chris Biemann, Simone Paolo Ponzetto – University of Mannheim, Germany; University of Hamburg, Germany
-
Unsupervised semantic frame induction using triclustering
— Dmitry Ustalov, Alexander Panchenko, Andrei Kutuzov, Chris Biemann, Simone Paolo Ponzetto – University of Mannheim, Germany; University of Hamburg, Germany; University of Oslo, Norway
-
Artificial intelligence, economics, and industrial organization
— Hal Varian – National Bureau of Economic Research, Cambridge, MA, USA
-
Neural Machine Translation with Decoding History Enhanced Attention
— Mingxuan Wang, Jun Xie, Zhixing Tan, Jinsong Su, Deyi Xiong, Chao Bian – Mobile Internet Group, Tencent Technology Co., Ltd; Xiamen University, China; Soochow University, China
-
Systems and methods for improved user interface
— Zhuxiaona Wei, Thuan Nguyen, Iat Chan, Kenny M Liou, Helin Wang, Houchang Lu – Baidu USA LLC
-
Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
— Genta Indra Winata, Chien-Sheng Wu, Andrea Madotto, Pascale Fung – Hong Kong University of Science and Technology, Hong Kong
-
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction
— Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, Dan Jurafsky – Stanford University, USA
-
Multi-channel encoder for neural machine translation
— Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu – Baidu Inc., China
-
Preferred Answer Selection in Stack Overflow: Better Text Representations… and Metadata, Metadata, Metadata
— Steven Xu, Andrew Bennett, Doris Hoogeveen, Jey Han Lau, Timothy Baldwin – University of Melbourne, Australia
-
Improving personalized consumer health search: notebook for ehealth at clef 2018
— Hua Yang, Teresa Gonçalves – University of Èvora, Portugal; ZhongYuan University of Technology, Zhengzhou, China
-
Ranking Documents by Answer-Passage Quality
— Evi Yulianti, Ruey-Cheng Chen, Falk Scholer, W Bruce Croft, Mark Sanderson – RMIT University, Melbourne, Australia; SEEK Ltd., Melbourne, Australia
-
Miracl at clef 2018: Consumer health search task
— Siwar Zayani, Nesrine Ksentini, Mohamed Tmar, Faiez Gargouri – University of Sfax, Tunisia
-
Fully convolutional speech recognition
— Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert – Facebook A.I. Research, Paris, France; Facebook A.I. Research, New York & Menlo Park, USA; CoML, ENS/CNRS/EHESS/INRIA/PSL Research University, Paris, France
-
Swag: A large-scale adversarial dataset for grounded commonsense inference
— Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi – University of Washington, USA
-
Comparing Theories of Speaker Choice Using a Model of Classifier Production in Mandarin Chinese
— Meilin Zhan, Roger Levy – Massachusetts Institute of Technology, USA
-
Two-Step Multi-factor Attention Neural Network for Answer Selection
— Pengqing Zhang, Yuexian Hou, Zhan Su, Yi Su – Tianjin University, China
-
On Link Stability Detection for Online Social Networks
— Ji Zhang, Xiaohui Tao, Leonard Tan, Jerry Chun-Wei Lin, Hongzhou Li, Liang Chang – University of Southern Queensland, Toowoomba, Australia; Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China; Guilin University of Electronic Technology, Guilin, China; Guilin University of Electronic Technology, Guilin, China
-
Neural Machine Translation with Deep Attention
— Biao Zhang, Deyi Xiong, Jinsong Su – Xiamen University, China; Soochow University, China
-
Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks
— Biao Zhang, Deyi Xiong, Jinsong Su, Qian Lin, Huiji Zhang – Xiamen University, China; Soochow University, China; Xiamen Meiya Pico information Co., Ltd. Xiamen, China
2017
-
Accurate and Efficient General-purpose Boilerplate Detection for Crawled Web Corpora
— Roland Schäfer – Freie Universität Berlin, Germany
-
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
— Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, Josie Li – Charles University, Czech Republic; Uppsala University, Sweden; University of Turku, Finland; University of Cambridge; Google; Bauhaus-Universität Weimar, Germany; UiT The Arctic University of Norway; University of the Basque Country, Spain; Istanbul Technical University, Turkey; Stanford University; New York University Abu Dhabi; City University of Hong Kong; Ohio State University, USA; University of Turin, Italy; University of Pisa, Italy; IBM Research; Nuance Communications; INRIA – Paris 7, France; University of Tübingen, Germany; DFKI, Germany; text & form, Germany
-
AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP
— Abu Bakr Soliman, Kareem Eissa, Samhaa El-Beltagy – Nile University, Egypt
-
Common Crawl Mining
— Tommy Dean, Ali Pasha, Brian Clarke, Casey J. Butenhoff – Virginia Polytechnic Institute and State University, USA; Eastman Chemical Company; USA
-
Representativeness of latent dirichlet allocation topics estimated from data samples with application to common crawl
— Yuheng Du, Alexander Herzog, Andre Luckow, Ramu Nerella, Christopher Gropp, Amy Apon – Clemson University, USA
-
ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem
— Shalini Ghosh, Phillip Porras, Vinod Yegneswaran, Ken Nitz, Ariyam Das – CSL, SRI International, Menlo Park
-
CoNLL 2017 Shared Task – Automatically Annotated Raw Texts and Word Embeddings
— Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, Daniel Zeman – Charles University, Czech Republic; University of Turku, Finland
-
Extracting Parallel Paragraphs from Common Crawl
— Jakub Kúdela, Irena Holubová, Ondřej Bojar – Charles University, Czech Republic
-
Understanding Regional Context of World Wide Web using Common Crawl Corpus
— Amir Mehmood, Hafiz Muhammad Shafiq, Abdul Waheed – UET, Lahore, Pakistan
-
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
— Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann – University of Hamburg, Germany; University of Mannheim, Germany
-
Open bibliometrics and undiscovered public knowledge
— David Stuart – University of Wolverhampton, Wolverhampton, UK
-
Towards semantic query segmentation
— Ajinkya Kale, Thrivikrama Taula, Sanjika Hewavitharana, Amit Srivastava – eBay Inc.
-
Common crawled web corpora: constructing corpora from large amounts of web data
— Kjetil Bugge Kristoffersen – University of Oslo, Norway
2016
2015
-
Alleviation of Disk I/O Contention in Virtualized Settings for Data-Intensive Computing
— Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara - Colorado State University
-
FUSE: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets
— Titus Barik, Kevin Lubick, Justin Smith, John Slankas, Emerson Murphy-Hill - ABB Corporate Research and North Carolina State University
-
Splitting Compounds by Semantic Analogy
— Joachim Daiber, Lautaro Quiroz, Roger Wechsler, Stella Frank - University of Amsterdam
-
Identifying Web Tables –Supporting a Neglected Type of Content on the Web
— Mikhail Galkin, Dmitry Mouromtsev, Sören Auer; IMTO University- St. Petersburg, Russia, University of Bonn- Germany
-
Principled Sampling for Anomaly Detection
— Brendan Juba; Washington University in St. Louis
-
Extracting Usage Patterns of Ontologies on the Web: a Case Study on GoodRelations Vocabulary in RDFa
— Kowalczuk Ewa, Jedrzej Potoniec, Agnieszka Ławrynowicz; Institute of Computing Science, Poznan University of Technology, Poland
-
Making Sense of Performance in Data Analytics Frameworks
— Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun; UC Berkeley, ICSI, Vmware, Seoul National University
-
Azmat: Sentence Similarity using Associative Matrices
— Evan Jaffe, Lifeng Jin, David King, Marten van Schinjdel; Dept. of Linguistics- Ohio State University
-
Text Segmentation based on Semantic Word Embeddings
— Alexander A Alemi, Paul Ginsparg; Dept. of Physics- Cornell University, Dept. of Physics and Information Science- Cornell University
-
The Graph Structure in the Web – Analyzed on Different Aggregation Levels
— Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer – University of Mannheim, Germany; Università degli Studi di Milano, Italy
-
Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce
— Alex Stolz, Martin Hepp – Universitaet der Bundeswehr Munich, Germany
-
Top-k Entity Augmentation Using Consistent Set Covering
— Julian Eberius, Maik Thiele, Katrin Braunschweig, Wolfgang Lehner – Technische Universität Dresden, Germany
2014
-
Improving In-Domain Data Selection For Small In-Domain Sets
— Mohammed Mediani, Joshua Winebarger, Alexander Waibel;Karlsruhe Institute of Technology, Germany
-
A Tunable Language Model for Statistical Machine Translation
— Junfei Guo, Juan Liu; Qi Han; Andreas Maletti; School of Computer, Wuhan University, China, Institute for Natural Language Processing, University of Stuttgart, Germany; nstitute for Visualization and Interactive Systems, University of Stuttgart, Germany; Institute of Computer Science, University of Leipzig, Germany
-
Deep Speech: Scaling up end-to-end speech recognition
— Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng; Baidu Research – Silicon Valley AI Lab
-
Dynamic Topic Adaptation for Phrase-based MT
— Eva Hasler, Philipp Koehn, Barry Haddow; Phil Blunsom; University of Edinburgh; University of Oxford
-
Bloom filter-based Routing in NDN
— Michele Tortelli; Politecnico di Bari
-
Fast Training of word2vec Representations Using N-gram Corpora
— Filip Ginter, Jenna Kanerva; University of Turku
-
Development and Utility of Automatic Language Processing Technologies
— Eva Hasler, Barry Haddow; Philipp Koehn; University of Edinburgh; University of Edinburgh
-
Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata
— Petar Petrovski, Volha Bryl and Christian Bizer; University of Mannheim, Germany- Research Group Data and Web Science
-
The Web Data Commons Microdata, RDFa and Microformat Dataset Series
— Robert Meusel, Petar Petrovski, Christian Bizer; University of Mannheim, Germany- Research Group Data and Web Science
-
Focused Crawling for Structured Data
— Robert Meusel; Peter Mika, Roi Blanko; University of Mannheim; Yahoo Labs- Barcelona
-
Document-level Re-ranking with Soft Lexical and Semantic Features for Statistical Machine Translation
— Chenchen Ding; Masao Utiyama, Eiichiro Sumita; National Institute of Information and Communications Technology Japan
-
Collecting Conceptualized Relations from Terabytes of Web Texts for Understanding Unknown Terms
— Masumi Shirakawa, Kotaro Nakayama, Eiji Aramaki, Takahiro Hara, Shojiro Nishio; Osaka University
-
Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish
— Jenna Kanerva, Juhani Luotolahti, Veronika Laippala, Filip Ginter
-
Number frequency on the web
— Willem Robert van Hage, Thomas Ploeger, Jesper Hoeksema; SynerScope B.V., VU University Amsterdam
-
N-gram Counts and Language Models from the Common Crawl
— Christian Buck, Kenneth Heafield, Bas van Ooyen University of Edinburgh, Stanford University, Owlin BV
-
Anaphora Models and Reordering for Phrase-Based SMT
— Christian Hardmeier, Sara Stymne, Jorg Tiedemann, Aaron Smith, Joakim Nivre; Uppsala University: Department of Linguistics and Philology
-
Machine Translation and Monolingual Postediting:The AFRL WMT-14 System
— Lane O.B. Schwartz, Timothy Anderson, Jeremy Gwinnup, Katherine M. Young; Air Force Research Laboratory, SRA International, N-Space Analysis LLC
-
Latent Domain Translation Models in Mix-of-Domains Haystack
— Hoang Cuong, Khalil Sima’an; University of Amsterdam - Institute for Logic, Language and Computation
-
Weaving the Web(VTT) of Data
— Thomas Steiner, Hannes Mühleisen, Ruben Verborgh, Pierre-Antoine Champin, Benoît Encelle, Yannick Prié Université de Lyon, Database Architectures Group, Multimedia LabGhent University – iMinds, Université de Nantes
-
TripleProv: Efficient Processing ofLineage Queries in a Native RDF Store
— Marcin Wylot, Philippe Cudré-Mauroux, Paul Groth eXascale InfolabUniversity of Fribourg; VU University Amsterdam
-
Graph Structure in the Web — Revisited
— Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer; Data and Web Science Group - University of Mannheim, Laboratory for Web - Algorithmics Università degli Studi di Milano
-
Web-scale Content Reuse Detection
— Calvin Ardi, John Heidemann; USC/Information Sciences Institute
-
Correlation Clustering in MapReduce
— Flavio Chierchetti, Nilesh Dalvi, Ravi Kumar
-
Neural Networks Leverage Corpus-wide Information for Part-of-speech Tagging
— Yuta Tsuboi; IBM Resarch
-
Translation project adaptation for MT-enhanced computer assisted translation
— Mauro Cettolo, Nicola Bertoldi, Marcello Federico, Holger Schwenk, Loïc Barrault, Christophe Servan; Fondazione Bruno Kessler, University of Le Mans, Xerox Research Centre Europe
-
Efficient Wordgraph Pruning for Interactive Translation Prediction
— Germán Sanchis-Trilles, Daniel Ortiz-Martınez, Francisco Casacuberta; PRHLT Centre - Universidad Politécnica de Valencia
-
Exploratory Analysis of a Terabyte Scale web Corpus
— Vasilis Kolias, Ioannis Anagnostopoulos, Eleftherios Kayafas; National Technical University of Athens, University of Thessaly
-
Building a Free General-Domain Paraphrase Database for Japanese
— Masahiro Mizukami, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura; Nara Institute of Science and Technology
-
GloVe: Global vectors for word representation
— Jeffrey Pennington, Richard Socher, Christopher D. Manning – Stanford University, California, USA
2013
-
Edinburgh SLT and MT System Description for the IWSLT 2013
— Alexandra Birch, Nadir Durrani, Phillip Koehn School of Informatics- University of Edinburgh
-
Dirt Cheap Web-Scale Parallel Text from the Common Crawl
— Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Phillipp Koehn, Chris Callison-Burch, Adam Lopez; Johns Hopkins University, University of Edinburgh, University of Zurich, University of Pennsylvania
-
Tunable Distortion Limits and Corpus Cleaning for SMT
— Sara Stymne; Christian Hardmeier; Jorg Tiedemann; Joakim Nivre; Uppsala University: Department of Linguistics and Philology
-
The KIT Translation Systems for IWSLT 2013
— Thanh-Le Ha, Teresa Herrmann, Jan Niehues, Mohammed Mediani, Eunah Cho, Yuqi Zhang, Isabel Slawik and Alex Waibel; Institute for Anthropomatics
-
Traitor: Associating Concepts using the World Wide Web
— Wanno Drijfhout, Oliver Jundt, Lesley Wevers, Djoerd Hiemstra; University of Twente
-
Deployment of RDFa, Microdata, and Microformats onthe Web – A Quantitative Analysis
— Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, Johanna Völker; Data and Web Science Group – University of Mannhein, Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands
2012