Big Data Analysis

Similarly to what has been described in section Publication of context
information as Open Data, Cygnus software allows to store all the selected data
published in the Context Broker in an HDFS based storage. This allows having a
long term historic database of context information that can be used for later
analysis, for instance implementing map & reducing algorithm or performing
queries over big data through Hive.

Similarly to what has been described in the section Publication of context
information as Open Data, the Cygnus component can be configured to gather data
from the Context Broker and store it in HDFS. The configuration in this case
should include the Cosmos Namenode endpoints, the service port, user’s
credential, the Cosmos API used (webhdfs, infinity, httpfs), the type of
attribute and endpoint of Hive server.

Once the Context Data has been stored, it is possible to use the Big Data GE to
process it either with a map & reduce application or with Hive. It is of course
also possible to process in the Big Data GEs other big datasets, either by
themselves or in combination with context information.

A typical example would be to analyse massive information gathered from sensors
in a city over a long period of type. All the data would have been gathered
through Context Broker and Cygnus and stored in Big Data for a long period of
time. In order to analyse the data, a few steps should be followed (the examples
in this whitepaper are based on a global and shared instance of Cosmos Big Data
GE):

Browse the Cosmos portal (http://cosmos.lab.fiware.org/cosmos-gui/). Use an
already registered user in FI-LAB to create a Cosmos account. The details of
your account will be given once registered, typically:

Cosmos username: if your FI-LAB username is<my_user>@mailprovider.com,
your cosmos username will be <my_user>. This will give you a Unix-like
account in the head node of the global instance, being your user space
/home/<my_user>/.

Cosmos HDFS space: Apart from your Unix-like user space in the Head Node,
you will have a HDFS space located at the entire cluster, it will be
/user/<my_user>/

Now you should be ready to login into the head node of the global instance of
Cosmos in FI-LAB, simply using your FI-LAB credentials:

[remote-vm]$ export COSMOS_USER= // this is not strictly necessary, junt in order the example commands can be copied&pasted
[remote-vm]$ ssh [email protected]

Once logged, you can have access to your HDFS space by using the Hadoop file
system commands:

Apart from using the context data stored, you can upload your own data to your
HDFS space using the Hadoop file system commands. This can be only done after
logging into the Head Node, and allows uploading Unix-like local files placed in
the Head node:

However, using the WebHDFS/HttpFS RESTful API will allow you to upload files
existing outside the global instance of Cosmos in FI-LAB. The following example
uses HttpFS instead of WebHDFS (uses the TCP/14000 port instead of TCP/50070),
and curl is used as HTTP client (but your applications should implement your own
HTTP client):

As you can see, the data uploading is a two-step operation, as stated in the
WebHDFS specification: the first invocation of the API talks directly with the
Head Node, specifying the new file creation and its name; then the Head Node
sends a temporary redirection response, specifying the data node among all the
existing ones in the cluster where the data has to be stored, which is the
endpoint of the second step. Nevertheless, the HttpFS gateway implements the
same API but its internal behaviour changes, making the redirection to point to
the head node itself.

If the data you have uploaded to your HDFS space is a CSV-like file, i.e. a
structured file containing lines of data fields separated by a common character,
then you can use Hive to query the data:

These Hive tables can be queried locally, by using the Hive CLI as well:

[head-node]$ hive
hive> select * from <my_user>_star_wars; // or any other SQL-like sentence, properly called HiveQL

Or remotely, by developing a Hive client (typically, using JDBC, but there are
some other options for other non-Java programming languages) connecting
tocosmos.lab.fi-ware.org:10000. Several pre-loaded MapReduce examples can be
found in every Hadoop distribution. You can list them by ssh'ing the head node
and commanding Hadoop:

[head-node]$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar

For instance, you can run the word count example (this is also known as the
"hello world" of Hadoop) by typing: