The aim of this 2-parts blog post is to show some useful, yet not very
complicated, features of the Pandas Python library that are not found in most
(numeric oriented) tutorials.

We will illustrate these techniques with Geonames data : extract useful data
from the Geonames dump, transform it, and load it into another file. There is
no numeric computation involved here, nor statictics. We will prove that pandas
can be used in a wide range of cases beyond numerical analysis.

While the first part was an introduction to Geonames data, this second part
contains the real work with Pandas. Please read part 1 if you are not
familiar with Geonames data.

The goal is to read Geonames data, from allCountries.txt and
alternateNames.txt files, combine them, and produce a new CSV file with the
following columns:

gid, the Geonames id,

frname, French name when available,

gname, Geonames main name,

fclass, feature class,

fcode, feature code,

parent_gid, Geonames id of the parent location,

lat, latitude,

lng, longitude.

Also we don't want to use too much RAM during the process. 1 GB is a maximum.

You may think that this new CSV file is a low-level objective, not very interesting.
But it can be a step into a larger process. For example one can build a local
Geonames database that can then be used by tools like Elastic Search or Solr
to provide easy searching and auto-completion in French.

Another interesting feature is provided by the parent_gid column. One can use
this column to build a tree of locations, or a SKOS thesaurus of concepts.

and there is only one place with such properties, Geonames id 3,013,767, namely
Département de la Haute-Garonne. Thus we must find a way to derive the
Geonames id from the feature_code, country_code and adminX_code
columns. Pandas will make this easy for us.

Let's get to work. And of course, we must first import the Pandas library to
make it available. We also import the csv module because we will need it to
perform basic operations on CSV files.

And indeed, that's a big file. To save memory we won't load the whole file.
Recall from previous part, that the alternateNames.txt file provides the
following columns in order.

alternateNameId : the id of this alternate name, int
geonameid : geonameId referring to id in table 'geoname', int
isolanguage : iso 639 language code 2- or 3-characters; (...)
alternate name : alternate name or name variant, varchar(400)
isPreferredName : '1', if this alternate name is an official/preferred name
isShortName : '1', if this is a short name like 'California' for 'State of California'
isColloquial : '1', if this alternate name is a colloquial or slang term
isHistoric : '1', if this alternate name is historic and was used in the pastq

For our purpose we are only interested in columns geonameid (so that we can
find corresponding place in allCountries.txt file), isolanguage (so
that we can keep only French names), alternate name (of course), and
isPreferredName (because we want to keep preferred names when possible).

Another way to save memory is to filter the file before loading it. Indeed, it's
a better practice to load a smaller dataset (filter before loading) than to load a big one and then
filter it after loading. It's important to keep those things in mind when you are working
with large datasets. So in our case, it is cleaner (but slower) to prepare the
CSV file before, keeping only French names.

The read_csv function is quite complex and it takes some time to use it
rightly. In our cases, the first few parameters are self-explanatory:
encoding for the file encoding, sep for the CSV separator, quoting
for the quoting protocol (here there is none), and header for lines to be
considered as label lines (here there is none).

usecols tells pandas to load only the specified columns. Beware that
indices start at 0, so column 1 is the second column in the file (geonameid
in this case).

index_col says to use one of the columns as a row index (instead of
creating a new index from scratch). Note that the number for index_col is
relative to usecols. In other words, 0 means the first column of
usecols, not the first column of the file.

names gives labels for columns (instead of using integers from 0). Hence we
can extract the last column with frnames['pref'] (instead of
frnames[3]). Please note, that this parameter is not compatible with
headers=0 for example (in that case, the first line is used to label
columns).

The dtype parameter is interesting. It allows you to specify one type per
column. When possible, prefer to use NumPy types to save memory (eg
np.uint32, np.bool_).

Since we want boolean values for the pref column, we can tell Pandas to
convert '1' strings to True and empty strings ('') to False.
That is the point of using parameters true_values and false_values.

But, by default, Pandas detect empty strings and affect them np.nan
(NaN means not a number). To prevent this behavior, setting na_filter
to False will leave empty strings as empty strings. Thus, empty strings in the
pref column will be converted to False (thanks to the false_values
parameter and to the 'bool_' data type).

Finally, skipinitialspace tells Pandas to left-strip strings to remove
leading spaces. Without it, a line like a, b (commas-separated) would give
values 'a' and ' b' (note leading space before b).

And lastly, error_bad_lines make pandas ignore lines he cannot understand
(eg. wrong number of columns). The default behavior is to raise an exception.

There are many more parameters to this function, which is much more powerful
than the simple reader object from the csv module. Please, refer to the
documentation at
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for
a full list of options. For example, the header parameter can accept a list
of integers, like [1, 3, 4], saying that lines at position 1, 3 and 4 are
label lines.

There is one last thing to do with the frnames dataframe: make its row
indices uniques. Indeed, we can see from the previous example that there are
two entries for the Haute-Garonne department. We need to keep only one, and,
when possible, the preferred form (when the pref column is True).

This is not always possible: we can see at the beginning of the dataframe (use
the head method) that there is no preferred French name for Rwanda.

In such a case, we will take one of the two lines at random. Maybe a clever
rule to decide which one to keep would be useful (like involving other
columns), but for the purpose of this tutorial it does not matter which one is
kept.

Back to our problem, how to make indices uniques ? Pandas provides the
duplicated method on Index objects. This method returns a boolean NumPy
1d-array (a vector), the size of which is the number of entries. So, since our
dataframe has 55,684 entries, the length of the returned vector is 55,684.

The meaning is simple: when you encounter True, it means that the index at
this position is a duplicate of a previously encountered index. For example, we
can see that the third value in dups_idx is True. And indeed, the third
line of frnames has an index (49,518) which is a duplicate of the second
line.

So duplicated is meant to mark as True duplicated indices and to keep
only the first one (there is an optional parameter to change this: read the
doc). How do we make sure that the first entry is the preferred entry ? Sorting
the dataframe of course! We can sort a table by a column (or by a list of
columns), using the sort_values method. We give ascending=False because
True > False (that's a philosophical question!), and inplace=True to
sort the dataframe in place, thus we do not create a copy (still thinking about
memory usage).

We end up with 49,047 French names. Notice the ~ in the filter expression ?
That's because we want to keep the first entry for each duplicated index, and
the first entry is False in the vector returned by duplicated.

First we need the administrative part of the file allCountries.txt, that is
all places with the A feature class.

Of course we can load the whole file into Pandas and then filter to keep only
A-class entries, but now you know that this is memory intensive (and this
file is much bigger than alternateNames.txt). So we'll be more clever and
first prepare a smaller file.

Now we load this file into pandas. Remembering our goal, the resulting
dataframe will only be used to compute gid from the code columns. So all
columns we need are geonameid, feature_code, country_code,
admin1_code, admin2_code, admin3_code, and admin4_code.

What about data types. All these columns are strings except for geonameid
which is an integer. Since it will be painful to type <colname>: str for
the dtype parameter dictionary, let's use the fromkeys constructor
instead.

We recognize most parameters in this instruction. Notice that we didn't use
index_col=0: gid is now a normal column, not an index, and Pandas will
generate automatically a row index with integers starting at 0.

Two new parameters are na_values and keep_default_na. The first one
gives Pandas additional strings to be considerer as NaN (Not a Number). The
astute reader would say that empty strings ('') are already considered by
Pandas as NaN, and he would be right.

But here comes the second parameter which, if set to False, tells Pandas to
forget about its default list of strings recognized as NaN. The default
list contains a bunch of strings like 'N/A' or '#NA' or, this is
interesting, simply 'NA'. But 'NA' is used in allCountries.txt as
country code for Namibia. If we keep the default list, this whole country will
be ignored. So the combination of these two parameters tells Pandas to:

The replace method has lot of different signatures, refer to the Pandas
documentation for a comprehensive description. Here, with the dictionary, we
hare saying to look only in column fcode, and in this column, replace
strings matching the regular expression with the given value. Since we are
using regular expressions, the regex parameter must be set to True.

Remember the goal: we want to be able to get the gid from the others
columns. Well, dear reader, you'll be happy to know that Pandas allows an index
to be composite, to be composed of multiple columns, what Pandas calls a
MultiIndex.

To put it simply, a multi-index is useful when you have hierarchical indices
Consider for example the following table.

lvl1

lvl2

N

S

A

AA

11

1x1

A

AB

12

1x2

B

BA

21

2x1

B

BB

22

2x2

B

BC

33

3x3

If we were to load such a table in a Pandas dataframe df (exercise: do it),
we would be able to use it as follows.

So, basically, we can query the multi-index using tuples (and we can omit
tuples if column indexing is not involved). But must importantly we can query a
multi-index partially: df.loc['A'] returns a sub-dataframe with one level
of index gone.

Back to our subject, we clearly have hierarchical information in our code
columns: the country code, then the admin1 code, then the admin2 code, and so
on. Moreover, we can put the feature code at the top. But how to do that ?

It can't be more simpler. The main issue is to find the correct method:
set_index.

Time has finally come to load the main file now: allCountries.txt. On one
hand, we will then be able to use the frnames dataframe to get French name
for each entry and populate the frname column, and on the other hand we
will use the admgids dataframe to compute parent gid for each line
too.

On my computer, loading the whole allCountries.txt at once takes 5.5 GB of
memory, clearly too much! And in this case, there is no trick to reduce the
size of the file first: we want all the data.

Pandas can help us with the chunk_size parameter to the read_csv
function. It allows us to read the file chunk by chunk (it returns an
iterator). The idea is to first create an empty CSV file for our final data,
then read each chunk, perform data manipulation (that is add frname and
parent_gid) of this chunk, and append the data to the file.

So we load data into Pandas the usual way. The only difference is that we add a
new parameter chunksize with value 1,000,000. You can choose a smaller or a
larger number of rows depending of your memory limit.

merge expects first the two dataframes to be joined. The how parameter
tells what type of JOIN to perform (it can be left, right,
inner, ...). Here we wants to keep all lines in chunk, which is the first
parameter, so it is 'left' (if chunk was the second parameter, it would
have been 'right').

left_index=True and right_index=True tell Pandas that the pivot column
is the index on each table. Indeed, in our case the gid index will be use
in both tables to compute the merge. If in one table, for example the right
one, the pivot column is not the index, one can set right_index=False and
add parameter right_on='<column_name>' (same parameter exists for left
table).

Additionally, If there are name clashes (same column name in both tables), one
can also use the suffixes parameter. For example suffixes=('_first',
'_second').

The parent_gid column is trickier. We'll delegate computation of the gid
from administrative codes to a separate function. But first let's define two
reverse dictionaries linking the fcode column to a level number

In this function, we first look at the row fcode to get the row
administrative level. If the fcode is ADMX we get the level directly.
If it is PCL<X>, we get level 0 from PCL. Else we set it to level 5 to
say that it is below level 4. So the parent's level is the found level minus
one. And if it is -1, we know that we were on a country and there is no parent.

Then we compute all available administrative codes, removing codes with NaN
values from the end.

With the level and the codes, we can search for the parent's gid using the
previous function.

Now, how do we use this function. No need for a for loop, Pandas gives us
the apply method.

None values have been converted to NaN. Thus, integer values have been
converted to float (you cannot have NaN within an integer column), and this
is not what we want. As a compromise, we are going to convert this into str
and suppress the decimal part.

We also add a label to the column. That's the name the column in our future
dataframe.

... parent_gids=parent_gids.rename('parent_gid')

And we can now append this new column to our chunk dataframe.

... chunk=pd.concat([chunk,parent_gids],axis=1)

We're almost there. Before we can save the chunk in a CSV file, we must
reorganize its columns to the expected order. For now, the frname and
parent_gid columns have been appended at the end of the dataframe.

Currently, creating the new CSV file from Geonames takes hours, and this is not
acceptable. There are multiple ways to make thing go faster. One of the most
significant change is to cache results of the parent_geonameid function.
Indeed, many places in Geonames have the same parent ; computing the parent gid
once and caching it sounds like a good idea.

If you are using Python3, you can simply use the @functools.lru_cache
decorator on the parent_geonameid function. But let us try to define our
own custom cache.

The only difference with the previous version is the use of a gid_cache
dictionary. Keys for this dictionary are tuples (<level>, <code0>,
[[<code1>],...,<code4>]) (stored in the code_tuple variable), and the
corresponding value is the parent gid for this combination of level and codes.
Then the returned parent_gid is first looked in this dictionary for a
previous cached result, else is computed from the geonameid_from_codes
function like before, and the result is cached.

We have defined a function computing a Geoname id from administrative codes.

defgeonameid_from_codes(level,**codes):"""Return the Geoname id of the *administrative* place with the
given information.
Return ``None`` if there is no match.
``level`` is an integer from 0 to 4. 0 means we are looking for a
political entity (``PCL<X>`` feature code), 1 for a first-level
administrative division (``ADM1``), and so on until fourth-level
(``ADM4``).
Then user must provide at least one of the ``code0``, ...,
``code4`` keyword parameters, depending on ``level``.
Examples::
>>> geonameid_from_codes(level=0, code0='RE')
935317
>>> geonameid_from_codes(level=3, code0='RE', code1='RE',
... code2='974', code3='9742')
935213
>>> geonameid_from_codes(level=0, code0='AB') # None
>>>
"""try:idx=tuple(codes['code{0}'.format(i)]foriinrange(level+1))exceptKeyError:raiseValueError('Not enough codeX parameters for level {0}'.format(level))idx=(level_fcode[level],)+idxtry:returnadmgids.loc[idx,'gid'].values[0]except(KeyError,TypeError):returnNone

We have defined a function computing the parent's gid of a Pandas row.

defparent_geonameid(row):"""Return the Geoname id of the parent of the given Pandas row.
Return ``None`` if we can't find the parent's gid.
"""# Get the parent's administrative level (PCL or ADM1, ..., ADM4)level=fcode_level.get(row.fcode)if(levelisNoneandisinstance(fcode,string_types)andlen(fcode)>=3):fcode_level.get(row.fcode[:3],5)level=levelor5level-=1iflevel<0:# We were on a country, no parentreturnNone# Compute available codesl=list(range(5))whilelandpd.isnull(row['code{0}'.format(l[-1])]):l.pop()# Remove NaN values backwards from code4codes={}code_tuple=[level]foriinl:ifi>level:breakcode_label='code{0}'.format(i)code=row[code_label]codes[code_label]=codecode_tuple.append(code)code_tuple=tuple(code_tuple)try:parent_gid=(gid_cache.get(code_tuple)orgeonameid_from_codes(level,**codes))exceptValueError:parent_gid=None# Put value in cache if not already to speed up future lookupifcode_tuplenotingid_cache:gid_cache[code_tuple]=parent_gidreturnparent_gid

And finally we have loaded the file allCountries.txt into Pandas using
chunks of 1,000,000 rows to save memory. For each chunk, we have merged it
with the frnames table to add the frname column, and we applied the
parent_geonameid function to add the parent_gid column. We then
reordered the columns and append the chunk to the final CSV file.

This final part is the longest, because the parent_geonameid function takes
some time on each chunk to compute all parents gids. But at the end of the
process we'll proudly see a final.txt file with data the way we want it,
and without using too much memory... High five!

Regarding Geonames, to be honest, we've only scratch the surface of its
complexity. There's so much to be done.

If you look at the file we've just produced, you'll see plenty of empty values
in the parent_gid column. May be our method to get the Geonames id of the
parent needs to be improved. May be all those orphan places should be moved
inside their countries.

Another problem lies within Geonames data. France has overseas territories, for
example Reunion Island, Geoname id 935,317. This place has feature code
PCLD, which means "dependent political entity". And indeed, Reunion Island
is not a country and should not appear at the top level of the tree, at the
same level as France. So some work should be done here to have Reunion Island
linked to France in some way, may be using the until now ignored cc2 column
(for "alternate country codes").

Still another improvement, easier this one, is to have yet another parent level
for continents. For this, one can use the file countryInfo.txt, downloadable
from the same page.

Considering speed this time, there is also room for improvements. First, the code
itself might be better designed to avoid some tests and for loops. Another
possibility is tu use multiprocessing, since each chunk in allCountries.txt
isindependent. Processes can put their finished chunk on a queue that a writer
process will read to write data in the output file. Another way to go is Cython (see: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html).

The aim of this 2-parts blog post is to show some useful, yet not very
complicated, features of the Pandas Python library that are not found in most
(numeric oriented) tutorials.

We will illustrate these techniques with Geonames data : extract useful data
from the Geonames dump, transform it, and load it into another file. There is
no numeric computation involved here, nor statictics. We will prove that pandas
can be used in a wide range of cases beyond numerical analysis.

This first part is an introduction to Geonames data. The real work with Pandas
will be shown on the second part You can skip this part and go directly to
part 2 if you are already familiar with Geonames.

Geonames data can be downloaded from http://download.geonames.org/export/dump/
The main file to download is allCountries.zip. Once extracted,
you'll get a CSV file named allCountries.txt which contains nearly all
Geonames data. In this file, CSV data are separated by tabulation.

What is the meaning of each column ? There is no header line at the top of the
file... Go back to the web page from where you downloaded the file. Below
download links, you'll find some documentation. In particular, consider the
following excerpt.

So we see that entry 2,972,315 represents a place named Toulouse, for which
latitude is 43.60 and longitude is 1.44. The place is located in France (FR
country code), its population is estimated to 433,055, elevation is 150 m, and
timezone is the same than Paris.

This is entry number 11,071,623, and we have latitude, longitude, population,
elevation and timezone as usual. The place's class is A and its code is
ADM1. Information about these codes can be found in the same previous page.

A country, state, region,...
ADM1 first-order administrative division

This confirms that Occitanie is a first-order administrative division of
France.

Now look at the adminX_code part of the Occitanie line (X=1, 2, 3, 4).

76

Only column admin1_code is filled with value 76. Compare this with the same
part from the Toulouse line.

76 31 313 31555

Here all columns admin1_code, admin2_code, admin3_code and
admin4_code are given a value, respectively 76, 31 313, and 31555.

Furthermore, we see that column admin1_code matches for Occitanie and
Toulouse (value 76). This let us deduce that Toulouse is actually a city
located in Occitanie.

So, following the same logic, we can infer that Toulouse is also located in a
second-order administrative division of France (feature code ADM2) with
admin1_code of 76 and admin2_code equals to 31. Let's look for such a
line in allCountries.txt.

That's a surprise. That's because in France, an arrondissement is the
smallest administrative division above a city. So for Geonames, city of
Toulouse is both an administrative place (feature class A with feature code
ADM4) and a populated place (feature class P with feature code
PPLA), so there is two different entries with the same name (beware that
this may be different for other countries).

And that's it! We have found the whole hierarchy of administrative divisions
above city of Toulouse, thanks to the adminX_code columns: Occitanie,
Département de la Haute-Garonne, Arrondissement de Toulouse, Toulouse
(ADM4), and Toulouse (PPL).

That's it, really ? What about the top-most administrative level, the country ?
This may not be intuitive: feature codes for countries start with PCL (for
political entity).

$ grep -P '\tPCL' allCountries.txt | grep -P '\tFR\t'

Among the result, there's only one PCLI, which means independant
political entity.

In the previous example, "Republic of France" is the main name for Geonames.
But this is not the name in French ("République française"), and even the
name in French is not the most commonly used name which is "France".

In the same way, "Département de la Haute-Garonne" is not the most commonly
used name for the department, which is just "Haute-Garonne".

The fourth column in allCountries.txt provides a comma-separated list of
alternate names for a place, in other languages and in other forms. But this is
not very useful because we can't decide which form in the list is in which
language.

For this, the Geonames project provides another file to download:
alternateNames.zip. Go back to the download page, download it, and extract
it. You'll get another tabulation-separeted CSV file named
alternateNames.txt.

The table 'alternate names' :
-----------------------------
alternateNameId : the id of this alternate name, int
geonameid : geonameId referring to id in table 'geoname', int
isolanguage : iso 639 language code 2- or 3-characters; (...)
alternate name : alternate name or name variant, varchar(400)
isPreferredName : '1', if this alternate name is an official/preferred name
isShortName : '1', if this is a short name like 'California' for 'State of California'
isColloquial : '1', if this alternate name is a colloquial or slang term
isHistoric : '1', if this alternate name is historic and was used in the pastq

We can see that "Haute-Garonne" is a short version of "Département de la
Haute-Garonne" and is the preferred form in French. As an exercise, the reader
can confirm in the same way that "France" is the preferred shorter form for
"Republic of France" in French.

And that's it for our introductory journey into Geonames. You are now familiar
enough with this data to begin working with it using Pandas in Python. In fact,
what we have done until now, namely working with grep commands, is not very
useful... See you in part 2!

Recently, I've faced the problem to import the European Union thesaurus, Eurovoc, into cubicweb using the SKOS cube. Eurovoc doesn't follow the SKOS data model and I'll show here how I managed to adapt Eurovoc to fit in SKOS.

This article is in two parts:

this is the first part where I introduce what a thesaurus is and what SKOS is,

A common need in our digital lives is to attach keywords to documents,
web pages, pictures, and so on, so that search is easier. For example,
you may want to add two keywords:

lily,

lilium

in a picture's metadata about this flower. If you have a large
collection of flower pictures, this will make your life easier when you
want to search for a particular species later on.

In this example, keywords are free: you can choose whatever keyword you
want, very general or very specific. For example you may just use the
keyword:

flower

if you don't care about species. You are also free to use lowercase or
uppercase letters, and to make typos...

On the other side, sometimes you have to select keywords from a list.
Such a constrained list is called a controlled vocabulary. For
instance, a very simple controlled vocabulary with only two keywords is
the one about a person's gender:

male (or man),

female (or woman).

But there are more complex examples: think about how a
library organizes books by themes: there are very general themes (eg.
Science), then more and more specific ones (eg.
Computer science -> Software -> Operating systems). There may also
be synonyms (eg. Computing for Computer science) or referrals
(eg. there may be a "see also" link between keywords Algebra and
Geometry). Such a controlled vocabulary where keywords are organized
in a tree structure, and with relations like synonym and referral, is
called a thesaurus.

For the sake of simplicity, in the following we will call thesaurus
any controlled vocabulary, even a simple one with two keywords like
male/female.

SKOS, from the World Wide
Web Consortium (W3C), is an ontology for the semantic web describing
thesauri. To make it simple, it is a common data model for thesauri that
can be used on the web. If you have a thesaurus and publish it on the
web using SKOS, then anyone can understand how your thesaurus is
organized.

SKOS is very versatile. You can use it to produce very simple thesauri
(like male/female) and very complex ones, with a tree of keywords,
even in multiple languages.

To cope with this complexity, SKOS data model splits each keyword into
two entities: a concept and its labels. For example, the concept
of a male person have multiple labels: male and man in
English, homme and masculin in French. The concept of a lily
flower also has multiple labels: lily in English, lilium in
Latin, lys in French.

Among all labels for a given concept, some can be preferred, while
others are alternative. There may be only one preferred label per
language. In the person's gender example, man may be the preferred
label in English and male an alternative one, while in French
homme would be the preferred label and masculin and alternative one.
In the flower example, lily (resp. lys) is the preferred label
in English (resp. French), and lilium is an alternative label in
Latin (no preferred label in Latin).

And of course, in SKOS, it is possible to say that a concept is broader than another one (just like topic Science is broader than topic Computer science).

So to summarize, in SKOS, a thesaurus is a tree of concepts, and each
concept have one or more labels, preferred or alternative. A thesaurus
is also called a concept scheme in SKOS.

Also, please note that SKOS data model is slightly more complicated than
what we've shown here, but this will be sufficient for our purpose.

This is the second part of an article where I show how to import the Eurovoc thesaurus from the
European Union into an application using a plain SKOS data model. I've recently faced the problem of importing Eurovoc into CubicWeb using the SKOS cube, and the solution I've chose is discussed here.

Eurovoc is the main thesaurus covering European Union business domains.
It is published and maintained by the EU commission. It is quite complex
and big, structured as a tree of keywords.

You can see Eurovoc keywords and browse the tree from the Eurovoc
homepage using the link Browse the
subject-oriented version.

For example, when publishing statistics about education in the EU, you
can tag the published data with the broadest keyword
Education and communications. Or you can be more precise and use the
following narrower keywords, in increasing order of preference:
Education, Education policy, Education statistics.

The EU commission uses SKOS to publish its Eurovoc thesaurus, so it
should be straightforward to import Eurovoc into our own application.
But things are not that simple...

For some reasons, Eurovoc uses a hierarchy of concept schemes. For
example, Education and communications is a sub-concept scheme of
Eurovoc (it is called a domain), and Education is a sub-concept
scheme of Education and communications (it is called a
micro-thesaurus). Education policy is (a label of) the first
concept in this hierarchy.

But with SKOS this is not possible: a concept scheme cannot be contained
into another concept scheme.

The ZIP archive contains only one XML file named eurovoc_skos.rdf.
Put it somewhere where you can find it easily.

To read this file easily, we will use the RDFLib Python library. This
library makes it really convenient to work with RDF data. It has only one
drawback: it is very slow. Reading the whole Eurovoc thesaurus with it
takes a very long time. Make the
process faster is the first thing to consider for later improvements.

Reading the Eurovoc thesaurus is as simple as creating an empty RDF
Graph and parsing the file. As said above, this takes a long long time
(from half an hour to two hours).

the first one, uriref(), will allow us to build RDFLib URIRef
objects from simple prefixed URIs like skos:prefLabel or
dcterms:title,

the second one, capitalized_eurovoc_domains(), is used to convert
Eurovoc domain names, all uppercase (eg.
32 EDUCATION ET COMMUNICATION) to a string where only first
letter is uppercase (eg. 32 Education and communication)

Now we dump this new graph to disk. We choose the Turtle
format as it is far more readable than RDF/XML for humans, and slightly
faster to parse for machines. This file will contain plain SKOS data that
can be directly imported into any application able to read SKOS.