I spent more time on understanding the dataset and also on the geonames.org website. It is convenient to search for points and locations and if you sign up for a user account you can add or update the data. A good opportunity to help and enlarge the dataset for the benefit of all.

So meanwhile I have added the following data to my processing chain with Logstash, Elasticsearch and Kibana:

Administrative Devision 1st to 4th Order

Continent

Continent Code

Here is an example taken from Kibana:

And I wanted to have an automated solution, so I wrote a shell script that only expects the path and name of the geonames input file and the rest is done automatically:

remove double quotes from the file. It looks as if there are some unbalanced quotes in the file, so I remove them all together (usind sed)

create two lookup CSV files for the Administrative Devision 3rd and 4th order. The first two are already availabe as files but these ones have to be derived from the data file itself (using awk)

Lookup the continent code and name from the country code of each row of data

finally the data is sent to Elasticsearch

With this being automated, I can quickly reload the data into Elasticsearch and make it available for searching and visualization in Kibana. And as I am using the geonameid which uniquely identifies each row in the dataset, I use Elasticsearchs capability to upsert (update or insert) the data.

The script uses environment variables that are passed to logstash, so that no paths or file names are hardcoded in the logstash pipeline file.

In Part 1 we saw how to use Logstash to read a CSV file and prepare the data to send it to Elasticsearch. The data is in JSON fomrat and that's what Elasticsearch expects.

Now we will send JSON formatted data and see how we deal with the schema. If you send data to an Elasticsearch index, the first record that arrives is used to detrmine the schema. So Elasticsearch does a dynamic mapping. But this might not always work 100%. The alternative is to define a schema/mapping manually. We will have a look at both.

The geonames database I use has about 17 million rows. That takes a while to ingest, so I created a separate file, which only contains 10000 records. I will use this with logstash for this example. Also, I am running this on Linux, so if you use Windows it might work a little bit different in terms of commands and paths, etc.

Before we continue, make sure you installed Elasticsearch and also Kibana. Kibana is the frontend for creating visualizations and dashboards. Download the relevant packages for your operating system from the Elastic website, install them and then run both.

I named my Logstash file: geonames_1.yml and it is located in the Logstash config folder. The complete code is listed at the bottom of part 1 of this post. Adjust the path as appropriate for your system. I also changed the file to use the index "geonames_01" in Elasticsearch. To run it with Logstash do this:

Logstash takes a while to startup, and if everything runs well, the data is sent to Elasticsearch. Start your webbrowser and go to http://localhost:5601, if you have a local installation of Kibana. Once started, on the left lower side, click on the "Management" link, then under "Elasticsearch" click on "Index Management". You will get a list of available indexes. The index we just create is also there:

You can see that 10000 documents have been created in the index. Click on the name of the index and on the right side you get a popup with a summary. Click on the tab labeled "Mapping". It will show you a JSON representation of the dynamic mapping that Elasticsearch created when we imported the data.

If you read the first part, then you£ll remember, that we made some type conversions. The CSV data comes all as strings and we converted e.g. "elevation" and "population" to number fields. All fields have been properly sent to Elasticsearch in their appropriate types. Even the "modification_date" was detected as a date field.

Now in Logstash we have converted the latitude and longitude position to a float value. And they are like this in the mappings file shown above. To make use of the geo-indexing features of Elasticsearch, the position field needs to defined as type "geo_point" in Elasticsearch.

Once the index mapping has been defined for a field, it can not be changed. So I will go ahead and delete the "geonames_01" index and we will manually add a different schema which properly maps the position field.

The popup where you saw the definition of the mapping for our index has a button labeled "Manage". Click on it and then select "Delete Index" to delete it.

Click on the "Dev Tools" link on the left side of the browser window. Per default you will be on the "Console" tab. In the left part of the console paste the code from below:

The code tells Elasticsearch to create a new index with the given mapping. I have only changed the "position" field as discussed above to the following value:

Elasticsearch will understand that the geo_point type field will have a latitude and a longitude value attached and index it appropriately.

Still in the "Console" tab, click on the green icon to execute the code.

Note: If you go back to the management page you won't find the index there - not yet. As there are no documents in the index, it is not shown.

So let's go back to Logstash and run the config file once again.

After Logstash started and after a short time, you will find the data again in Kibana as discussed above. If you go to the management view and look at the mapping defined for the index, you will see that now the position with latitude and longitude is properly mapped to the "geo_point" type.

That's all for the second part. The third part will concentrate on creating visualizations and a dashboard in Kibana on top of the data we imported.

I have been learning Elasticsearch the last few weeks. Actually the elastic stack: Elasticsearch - Logstash - Kibana.

The initial learning curve was very low: install Elasticsearch and Kibana and simply add some data using the Kibana dev console. Easy starter. I then used Apache Nifi to read a file and send it to the Elasticsearch server. Also very easy. I did not touch Logstash at that time.

With the little data I had - a few thousand records - I was learning how to use Kibana for creating visualizations and dashboards. And while doing that, I ran into two questions: how to handle dates and how to handle geo locations correctly? I went back and forth to read the documentation and search the Internet. It takes some time to understand how Elastic works with dates, times and how one can (or should) use them in KIbana. How does one works with timezones and time in general? Or geo locations.

So after some time I found out the following things:

think thoroughly about the Elasticsearch schema (data types) and transform your data accordingly. Spend a lot of time for this. The better the schema, the better the analytics in Kibana.

devide your data into slices, Analytics and queries can be done over multiple indexes and on the other side individual slices can be deleted easily. A typical and natural devision is e.g by date.

don't create complex (or a lot of) visualizations or dashboards on incomplete indexes. Because when you delete the index - e.g. because you want to roll out a new version - then all the visualizations and dashboards are useless.

If you like visual tools, use Apache Nifi to feed Elasticsearch. Getting started is easy and reading e.g. from files or databases is easily/quickly done.

A major plus point in ELasticsearch is, that it inserts or updates the data based on a key. A no-brainer once you have a key defined for the data.

The exciting fact about Elasticsearch and Kibana was for me to have a system, that allows to visualize the data as it happens. So I also use data from Kafka. The data is stored in Kafka as events and as they happen - in realtime. I use again Nifi to read from Kafka and push it to Elasticsearch.

Apart from this, working with Kibana is real fun and it allows to create great dashboards.

Ok. This was the first intro from my side. I will publish more detailed projects here in the next weeks.