Data, data and more datahttp://van-maanen.com
Tom's playgroundWed, 23 May 2018 10:54:31 +0000en-GBhourly1https://wordpress.org/?v=4.9.6Write an AVRO filehttp://van-maanen.com/index.php/2018/05/23/write-an-avro-file/
Wed, 23 May 2018 10:51:28 +0000http://van-maanen.com/?p=1905Below, I provide some Python code to write an AVRO file. An AVRO file consists of a scheme and a set of records. The records are written in binary format. The scheme is as follows:

The script can be run on the command line as C:\ProgramData\Anaconda3\python.exe C:\Users\tmaanen\.spyder-py3\TomHdfs.py C:\Users\tmaanen\.spyder-py3\a.avro

]]>Copying content of an Oracle table into an avro filehttp://van-maanen.com/index.php/2018/05/21/copying-content-of-an-oracle-table-into-an-avro-file/
Mon, 21 May 2018 20:14:26 +0000http://van-maanen.com/?p=1898Below, you will find a listing on how to copy the content of an Oracle table into an avro file. The trick is quite straight forward. A table is read via a cursor. Each record is then appended to an avro file. The scheme is:

]]>Sending Avro file via HTTPhttp://van-maanen.com/index.php/2018/05/15/sending-avro-file-via-http/
Tue, 15 May 2018 21:02:13 +0000http://van-maanen.com/?p=1887It is possible to send an AVRO file via HTTP. The idea is that one sets up a server process. Once the server process runs, a client call is made.
I found a neat scheme how such process works.
We see on the server side, a socket must be set up. This set-up must also be undertaken with an indication which type of network is utilized. The most common network in the IP4 system: 4 numbers, indicating the position in a network. This is indicated with “AF_INET”. Once the socket is set up, it must be linked to an (internal) IP address. This link is indicated with the port. After that the server can be instructed to start listening to a particular port.
The system will be instructed to receive AVRO files. Therefore the avro modules are imported.

One is then able to see how the server reacts upon the client call. We use Python3 to work this out. The server process runs as;

]]>Show content of an AVRO file with Pythonhttp://van-maanen.com/index.php/2018/05/12/show-content-of-an-avro-file-with-php/
Sat, 12 May 2018 20:14:56 +0000http://van-maanen.com/?p=1880This note describes how we can show the content of an AVRO file with Python.
We use python3 as tool here. We use this from an Anaconda framework. I checked whether this installation already contained an AVRO package, but this wasn’t the case. Therefore, AVRO was downloaded (as avro-python3-1.8.2.tar.gz) and next command was issued: C:\ProgramData\Anaconda3\python.exe -m pip install C:\ProgramData\Anaconda3\avro-python3-1.8.2.tar.gz. The screen then looked like:.
Once, AVRO was installed, an AVRO file can be created with:

]]>What is inside an AVRO file?http://van-maanen.com/index.php/2018/05/10/what-is-inside-an-avro-file/
Thu, 10 May 2018 21:03:42 +0000http://van-maanen.com/?p=1876In the Hadoop world, one may encounter AVRO files quite often. AVRO file can be used in more or less the same way as text files. They can be read, stored and written out. However, the AVRO files are binary files that make them difficult to read. However, help is there. I downloaded the avro-tools-1.8.2.jar from http://apache.cs.uu.nl/avro/avro-1.8.2/java/. See: avro-tools-1.8.2.jar.
This opens the possibility to investigate what is inside an AVRO file. Let us assume, we have this AVRO file data.avro. Next commands provides an overview of the possibilities:

java -jar avro-tools-1.8.2.jar

Then, the avro file is written to a json file:

java -jar avro-tools-1.8.2.jar tojson data.avro

or the scheme is provides with:

java -jar avro-tools-1.8.2.jar getschema data.avro

]]>mounting driveshttp://van-maanen.com/index.php/2018/05/06/mounting-drives/
Sun, 06 May 2018 20:17:03 +0000http://van-maanen.com/?p=1869Today, I played a bit with mounting external drives. Boy, that was more difficult than I thought.
The situation was a follows. I had a linux desktop and I wanted to use network resources. I have a NAS, a samba drive and a few windows shares.
A the end of the day, I only got the samba drive working. This was a directory on a raspberryPi. The directory is made available through samba. The terminal commands are:

That is all I got. No access to windows shares, no access to my NAS. Hum, Windows is easier is this respect.

]]>Sqoophttp://van-maanen.com/index.php/2018/05/01/sqoop-2/
Tue, 01 May 2018 21:08:07 +0000http://van-maanen.com/?p=1861Sqoop is used in an Hadoop environment to load data from a relational database upon an Hadoop platform. I have installed an Oracle 12.2 database. I tried to load data from that database to Hadoop. To do so, one needs to install a java jdbc driver to be able to access that database. For Oracle 12, two drivers are available: the ojdbc6.jar and the ojdbc7.jar. Both seem to work fine. I established this by putting both files on the /opt/cloudera/parcels/CDH/lib/sqoop/lib, which is used in a Cloudera environment to install the database drivers for sqoop. Subsequently, I verified the correct working with:

]]>Installing Cloudera Managerhttp://van-maanen.com/index.php/2018/04/25/installing-cloudera-manager/
Wed, 25 Apr 2018 09:44:45 +0000http://van-maanen.com/?p=1852Cloudera Manager is a great tool to install a Big Data platform. I worked with this tool and let me share some findings. I realise that we have only a limited range of platforms that can be used to install the Cloudera Manager upon. I started with Ubuntu 16.04. In a first attempt, I used a newer version of Ubuntu but this was not supported.Therefore, I started with Ubuntu 16.04.
During an attempt of installation, I was given a warning that ubuntu uses 127.0.1.1 as the host address. During that attempt, it was stated that 127.0.1.1 will lead to issues. In subsequent step, I encountered the promised issues. Therefore, in another attempt, I modified the /etc/hosts file:

127.0.0.1 ubuntu
127.0.0.1 localhost

I also like to work with a fixed network address. This can be attained by including in /etc/network/interfaces these lines:

In these lines, we see ens33 as the network interface. Previous versions of Ubuntu use eth0. The network interface can be known from the “ifconfig” statement.

During the installation, use is made of a ssh connection. This connection will be used with a sudo command. However, during installing I discovered that the sudo command must be given without a password. So we have a series of steps to take. SSH can be installed with:

sudo apt-get install openssh-server

We then set up the possibility to connect to ssh via root. This can be done in file /etc/ssh/sshd_config. There we have a clause on PermitRootLogin. This must be set to “yes”:

PermitRootLogin yes

To allow a sudo without providing a password, we have to adjust the /etc/sudoers file. There we have to provide the permission for the root usage via tom:

I noted that the web browser didn’t run. I got it running via sudo firefox -Profilemanager. I also had the privs with ~/.mozilla/firefox/ to make them readable.

The installing was then straightforward. But above things have to settled.

]]>the Oracle Tuning Advisorhttp://van-maanen.com/index.php/2018/04/03/the-oracle-tuning-advisor/
Tue, 03 Apr 2018 19:30:42 +0000http://van-maanen.com/?p=1838Oracle has now included an Oracle tuning advisor. Its purpose is to advise on a SQL statement. For me, it provides an answer on whether to add an index or not. I was taught that one should always start with tables that have no indices. Only when it is demonstrated that indices are used, one should create an index. The idea is that maintenance of indices cost time, which must be offset against the gains from an index. If no gains are expected, it is useless to create such index. Of course, we could use the explain plan for that. This reveals whether an index is used or not.
First, such a SQL statement should be properly included in a structure where it can be analysed. Such a structure can be created with (with SQL Plus:):

The existence of this structure can be shown if we look into the enterprise manager under the heading Performance > SQL Tuning Advisor > Manual.
There, we see the entry that we just created.
Subsequently, we may execute this entry with:

Clicking on it shows a suggestion:
It states that an index could be added to improve performance.
One could also receive such information with:

SET LONG 10000;
SET PAGESIZE 1000
SET LINESIZE 200
SELECT DBMS_SQLTUNE.report_tuning_task('STA_SPECIFIC_EMP_TASK') AS recommendations FROM dual;
SET PAGESIZE 24

If that is done (an index on the filter), and the entry is executed again, the suggestion is disappeared.
We may finally remove the entry with:

exec dbms_sqltune.drop_tuning_task('STA_SPECIFIC_EMP_TASK');

]]>SCD-2 and the Oracle Merge statementhttp://van-maanen.com/index.php/2018/02/21/scd-2-and-the-oracle-merge-statement/
Wed, 21 Feb 2018 11:10:28 +0000http://van-maanen.com/?p=1823Yesterday I found a very nice script that allows processing records in a target table with a SCD-2 mechanism. The script is very elegant. It contains several logical steps that are knitted together in one SQL statement.

Th mechanism of SCD2 is as follows. First, a comparison is made between a source table and a target table. One may encounter two situations: one whereby a situation/ record is provided that is completely new and a situation/ record that is in reality an update from a previous situation.

When a new situation is provided, a new record must be inserted into the target table. When an update is provided, a record in the target table must be updated with an end date that indicates that the record can be seen as being a description of history. Moreover in case of an update a record must be inserted.

Let us look at an example. We start with:

Id stands for the identification of a situation, that can be updated. Naam is a field that can be updated. sk is a surrogate key that acts as a primary key. Startdate is the moment of an insert; enddate is the point in time until when the description is valid.

Let us assume, we have new data that look like

, we may derive that the final version will look like:

We see tom and ine being unchanged. The Stella record is finished with an end date (21-02-2018). Paula and Tessa are inserted with a startdate and an end date that is set at 31-12-99.

The logic of this procedure is that is a first step a comparison is made between the target table and the source table. If no match is found between the target table and source table (based on id), a record is inserted.
If a match is found but the descriptions differ, two actions will be taken: records with end date are set at the system date and a new record are inserted.