Simple Utilities to
Bootstrap
Corpora
And
Terms from the Web

Quick start

Introduction

Despite certain obvious drawbacks (e.g. lack of control, sampling,
documentation etc.), there is no doubt that the World Wide Web is a
mine of language data of unprecedented richness and ease of access.

It is also the only viable source of "disposable" corpora
built ad hoc for a specific purpose (e.g. a translation or
interpreting task, the compilation of a terminological database,
domain-specific machine learning tasks). These corpora are essential
resources for language professionals who routinely work with
specialized languages, often in areas where neologisms and new terms
are introduced at a fast pace and where standard reference corpora
have to be complemented by easy-to-construct, focused, up-to-date text
collections.

While it is possible to construct a web-based corpus through manual
queries and downloads, this process is extremely time-consuming. The
time investment is particularly unjustified if the final result is
meant to be a single-use corpus.

What BootCaT does

The BootCaT front-end is a graphical interface for the BootCaT toolkit
(Baroni and Bernardini 2004). It automates the process of finding
reference texts on the web and collating them in a single corpus.

The pipeline allows varying levels of control. In the first step, users
provide a list of single- or multi-word terms to be used as seeds for text
collection. These are then combined into “tuples” of varying length and sent
as queries to a search engine, which returns a list of potentially relevant
URLs. At this point the user has the option of inspecting the URLs and
trimming them; the actual web pages are then retrieved, converted to plain
text and saved in "txt" format. The corpus can thus be interrogated using
most concordancers.

Using BootCat one can build a relatively large quick-and-dirty corpus
(typically of about 80 texts, with default parameters and no manual quality
checks) in less than half an hour. This flexible approach to the task makes
BootCaT a very useful tool for translators and translation students, which
has been used in the translation and terminology classroom to build small
DIY corpora of varying size and specialization. As of June 2017, the
software has been downloaded and installed by over 2800 single users, from
74 countries.