following it up with this plain script [5] (the script is published via a gist, it is kept functional and concise for extendability)

python3 process_wikipedia.py

This will produce a plain .csv file with your corpus.

Obviously:

http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2 can be tweaked to obtain your language, for details see [4];

For wikiextractor params simply see its man page (I suspect that even its official docs are out of date);

The post-processing script turns Wikipedia files into a table like this:

idx

article_uuid

sentence

cleaned sentence

cleaned sentence length

0

74fb822b-54bb-4bfb-95ef-4eac9465c7d7

Жан I де Шатильон (граф де Пентьевр)Жан I де Ш…

жан i де шатильон граф де пентьевр жан i де ша…

38

1

74fb822b-54bb-4bfb-95ef-4eac9465c7d7

Находился под охраной Роберта де Вера, графа О…

находился охраной роберта де вера графа оксфор…

18

2

74fb822b-54bb-4bfb-95ef-4eac9465c7d7

Однако этому воспротивился Генри де Громон, гр…

однако этому воспротивился генри де громон гра…

14

3

74fb822b-54bb-4bfb-95ef-4eac9465c7d7

Король предложил ему в жёны другую важную особ…

король предложил жёны другую важную особу фили…

48

4

74fb822b-54bb-4bfb-95ef-4eac9465c7d7

Жан был освобождён и вернулся во Францию в 138…

жан освобождён вернулся францию году свадьба м…

52

article_uuid is pseudo-unique and sentence order is supposed to be preserved.

Motivation

Arguably, the state of current ML instruments enables practitioners [8] to build and deliver scalable NLP pipelines within days. The problem arises only if you do not have a trust-worthy public dataset / pre-trained embeddings / language model. This article aims to make this a bit eae} a bit by illustrating that preparing Wikipedia corpus (the most common corpus for word vector training in NLP) is real in just a couple of hours. So, if you spend 2 days to build a model, why spend much more time engineering some crawler to get the data?)

High level script walk-through

The wikiExtractor tool saves Wikipedia articles in the plain-text format separated into <doc> blocks. This can be easily leveraged using the following logic:

Obtain the list of all output files;

Split the files into articles;

Remove any remaining HTML tags and special characters;

Use nltk.sent_tokenize to split into sentences;

To avoid code bulk, we can keep our code simple by assigning a uuid to each article;

As text pre-processing I just used (change it to fit your needs) this:

Removing non-alphanumeric characters;

Removing stop words;

I have the dataset, what to do next?

Basic use cases

Usually people use one of the following instruments for the most common applied NLP task - embedding a word: