Gromoteur is a tool for linguists that gives easy access to textual corpora.
It allows to get pages from the Web or from local files, treat them, analyze them, and output results.

Here is a simple diagram of Gromoteur's text handling chains:

A short description of the different paths

Gromoteur can download whole websites, following your rules, or start with a search engine and get the results. Since version 2, it can use your login to access restricted websites (login with Firefox or Chrome).

It transforms the files to pure Unicode text and puts them into your database.
It can handle Html files and Pdf files.

Gromoteur can import a whole folder of text or pdf files from your hard drive or import tab separated tables that you have exported from your spreadsheet.

Gromoteur allows you to look through your data, sort it, filter it, apply simple tools like lemmatizers, taggers, and word segmentation for different languages. It includes a graphical selection tool that allows to select specific parts of Webpages, for example the central part, thus excluding the repetitive links and ads around the content.

Gromoteur allows you to export the data into separate files, into one unique file. Specific words can be highlighted and Gromoteur can even output a concordancer view of the data:

Gromoteur includes the Nexico tool, a simplified version of Lexico3. It can compute the specific terms of any selection of pages and it can compute textual co-occurrences based on a fast implementation of the cumulative hypergeometric distribution. Tables and images can be exported.

Collocation graph around "harmless" from the Hitchhiker's Guide to the Galaxy made with the Gromoteur.Arrows point to words that appear astonishingly often in 5-grams around the source word. For example, "one" appears often in the 5-grams around "harmless", but "one" does not have "harmless" very often in its 5-grams.

Source code

You can also click on the number of the latest version and then on "download tarball".

You'll need:

Python 2 at least 2.7, QT and PyQT at least 4.4. Easy_install or pip from setuptools can be of use to install the following modules gromoteur relies on:

numpy

scipy

pymmseg

python-magic

requesocks

slate

pdfminer

pattern

gensim

python gromoteur.py should start the system.

FAQ

What can Gromoteur do that I can't do with a simple websearch?

Well, you can live without the Gromoteur, it's just faster to have it.

You can, for example, look for a word and get all the sentence that contain this word.
You can do this by hand, which may take years, you can use Gromoteur which takes a few minutes or hours.

If you look for all the sentences containing two words, Bing allows you to look for these words, giving
you all the pages containing these words, but these words can be anywhere on the page. Gromoteur can
check these pages and only keep the pages where these two words occur in some way that you can express in a regular expression.

Gromoteur can do this while checking the language of the page and only keeps the pages in the right langauage. Later, you can use it to put the results nicely into
a concordancer table for further analysis.

My result page always remains empty. What to do?

This may be a bug, but more likely you put some restrictions that are a bit too high. Check this:

Can you access the start pages in your browser without using any proxies or VPN? - You can configure the VPN in the last configuration page (expert mode)

Are you doing a Bing powered search? If yes, try the "try Bing" button. Do you get a few results?

Do you have anything in the "constraints", the "levels", "restrict to pages containing"? If yes,
erase it all.

Does your start URL match the URL restriction? If it doesn't Gromoteur closes down the page collection immediately (because it tries out the first URL, finds that it doesn't match and has no other URLs to continue).

Put the maxima to a few dozens.

For the time being, relax most constraints, and put them back into place one by one:

Run again.

Do you still have an empty file? Mail me your configuration file (please find it in the subfolder lib/spiderConfigurations where Gromoteur is installed).

How to add a new language?

For the moment, not possible from the interface. Get the source package, add a little language
corpus to the language folder and run "python ngram.py". That should do it.

Or mail me the language example file and I'll add it for everyone in the next version.

Where does this dumb name "Gromoteur" come from and how do you pronounce it?

Many years ago, in the beginning of this millenium, I was working with someone on
possible borders in German compound nouns. We needed real world examples and I programmed a little
script that walked around German web pages keeping the 100 longest words it stumbled upon.

This script was still in Java, and not a line of code is left in the recent Gromoteur,
but the idea remains that this can be used to find words in specific forms, for example really long words.

The French expression gros mot means "cuss word" in English and literaly it translates as "fat word".
And a moteur (de recherche) is a search engine in French. So this is a search engine for
fat words and loads of text, a Gromoteur in a word...

And it just sounds great: It is pronounced as gro [gʁo] and moteur [mɔtœʁ], thus, with vowel harmony, as [gʁomotœʁ].

During my work at the Academy of Sciences in Peking, the tool also received the nice translation 胖摩托, Pàngmótuō in pinyin. To get the tones right, stamp your foot while saying pàng, think "really?" while saying mó, and sing the last syllable tuō.

Can you add the great feature X to the Gromoteur?

There are a quite a few things to improve in this software.

Some of the future developments that are planned for include:

Bing search with proxy is not working yet

Collocation computation of other segments than the current n-grams or complete pages, for example sentences

Finish the implementations of selections in Gromoteur's main window and link those selection to Nexico!