Search form

Natural language processing on computational cluster

The aim of the course is to introduce methods required in natural language
processing (processing huge data sets in distributed environment and performing
machine learning) and show how to effectively execute them on ÚFAL computational
Linux cluster. The course will cover ÚFAL network and cluster architecture, SGE
(Sun/Oracle/Son of Grid Engine), related Linux tools and best practices.

The course follows the outline in the ÚFAL wiki:
Introduction to ÚFAL
(you will need an ÚFAL wiki account to access that site; each ÚFAL PhD student
is entitles to get a wiki account).

The whole course is taught in several first weeks of the semester. If you
plan to attend the course, please contact any of the Guarantors listed below.

GPUs

GPU jobs are scheduled as SGE jobs, but in a special gpu-ms.q queue. You need
to specify how many GPUs and of what kind you want, using

-l gpu=3: ask for 3 GPUs on a single machine

-l gpu=1,gpu_ram=8G: ask for at least 8GB GPU

-l gpu=1,gpu_cc_min6.1=1: ask for a GPU with Cuda capability at least 6.1

During execution, CUDA_VISIBILE_DEVICES is set to the allocated GPUs.
Note that qrsh jobs by default does not read the environment variables created
by SGE, so you need to use qrsh -l ... -pty yes bash.

Then, you need a framework which can use the GPU, and you also need to set paths
correctly. To use for example CUDA 9.0 and cuDNN 7.0 (which is a good default as
of Nov 2018), use

Spark is a framework for distributed computations.
Natively it works in Python, Scala and Java.

Apart from embarrassingly parallel computations, Spark framework is suitable for
in-memory and/or iterative computations, making it suitable even for machine
learning and complex data processing. The Spark framework can run either locally
using one thread, locally using multiple threads or in a distributed fashion.

Initialization

You need to add the following to your .profile (or other suitable place):

export PATH="/net/projects/spark/bin:/net/projects/spark/sge:$PATH"

Running

An interactive ipython shell can be started using

PYSPARK_DRIVER_PYTHON=ipython pyspark

Such shell will use current cluster, starting a local cluster with as many
threads as cores.

To create a distributed cluster using SGE, you can run one of the following
commands:

The files are in UTF-8 and contain one article per line. Article name is
separated by a \t character from the article content.

unique_words

Implement a SGE distributed job to create a list of unique words used in the
articles. Convert the article texts to lowercase to ignore case.

Because the article data is not tokenized, use the provided
/net/data/npfl118/wiki/{cs,en}/tokenizer, which reads untokenized UTF-8 text from
standard input and produces tokenized UTF-8 text on standard output. It
preserves line breaks and separates tokens on each line by exactly one space.

inverted_index

In a distributed way, compute inverted index – for every lemma from the articles, compute ascending
(article id, ascending positions of occurrences as word indices) pairs. In
order to do so, number the articles using consecutive integers and produce also
a list of articles representing this mapping (the article on line iii is the
article with id iii; you can use the example articles.sh).

The output should be a file with the list of articles ordered by article id,
and a file with one lemma on a line in this format:

Both the article_ids and the occurence indices should be in ascending order.

To generate the lemmas, use the provided
/net/data/npfl118/wiki/{cs,en}/lemmatizer, which again reads untokenized UTF-8
text and outputs the space separated lemmas on the output, preserving line
breaks.

gpu_determinant

Install CPU and GPU version of TensorFlow in respective virtual environments:

Also, make sure you have added CUDA and cuDNN to your paths using
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cuda/9.0/lib64:/opt/cuda/9.0/cudnn/7.0/lib64.

Finally, use /net/data/npfl118/assignments/gpu_determinant.py to measure how long it
takes to compute a determinant of a matrix, both on a CPU and GPU version.
The given script measures the required time for all given matrix dimensions.

For CPU version, use dimensions up to 5000 with step 100.

For GPU version, use dimensions up to 20000 with step 1000.

Finally, estimate the speedup of using GPU instead of CPU for this task.

spark_lemmas

Template: /net/data/npfl118/assignments/spark_lemmas.py

Using the provided /net/data/npfl118/wiki/{cs,en}/lemmatizer, generate
list of 100 most frequent lemmas in Czech and English wiki on standard output.

To utilize the lemmatizer, use rdd.pipe. However, you need to use Python3
using PYSPARK_PYTHON=python3.

spark_anagrams

Template: /net/data/npfl118/assignments/spark_anagrams.py

Two words are anagrams if one is a character permutation of the other
(ignoring case).

For a given wiki language, find all anagram classes that contain at least AAA
words (a parameter of the script). Output each anagram class (unique words with
the same character permutation) on a separate line

Use the /net/data/npfl118/wiki/{cs,en}/tokenizer, to tokenize the input,
again using rdd.pipe and PYSPARK_PYTHON=python3.

spark_inverted_index

Template: /net/data/npfl118/assignments/spark_inverted_index.py

Compute the inverted index in the format described in inverted_index
assignment.

Materials

Introduction to ÚFAL
at ÚFAL wiki (you will need an ÚFAL wiki account to access that site; each
ÚFAL PhD student is entitles to get a wiki account).