Designing Big Data
Datarella’s Big Data process leverages the maximum value from the data. Designing data driven applications starts with developing the data engineering infrastructure – data generation, storage, and pre-processing. The second step is building the models from the data, and feeding the resulting predictions into the application. All steps are developed agile to keep up with changing conditions and requirements.

First step: Data engineering
Data science starts at the point when the data is generated. Understanding the technical specifications and restrictions and the level of the sensory probes is the first step to understanding data. As little information as possible should lost before the data can enter meaningful analyses. Signal processing and data aggregation at the source can dramatically limit the ways in which the data can be used afterwards.

Data preprocessing has to be considered at every step of the signal chain, until finally the data arrives at the backend to be stored. Storage starts with uploading or streaming the data into a bucket, a more or less unstructured repository that can keep it, no matter if the following processes might be delayed or stopped. From the bucket, the data may be extracted, transformed into a table-like formate and uploaded into a data warehouse – the so called ETL process, while it might be also processed in a streaming analytics service in real time, right away when the data packages arrive.

Second step: Predictive Analytics
From the data – stored and real time – mathematical models are formed to make the information in the data accessible and understandable. These models consist typically of two kinds of functions: Non-linear regressions predict the value of a target variable from the values of its assumed influences, while pattern recognition algorithms reduce the high dimensionality of the data to basic structures. The models are implemented as services in applications.

With machine learning the model parameters can be updated in real time with each new data point arriving. The quality of the predictions is continuously controlled; variations of the models are constantly tested and benchmarked to guarantee optimal parameter selection. For all incoming measurements, a multitude of results is calculated bay variations of the model parameters.

The set of parameters that has the best fit, i.e. the smallest differences from its estimated result to the real measurement, is used as the prediction, until a better fitting set of parameters is found. This method of model fit mimics biological evolution; the sets of parameters are equivalent to mutations and only the best fitting model survive. Like in biology, this leads to fast adaption to changing conditions and avoids getting stuck with a sub-optimal predictions.

Agile analytics with machine learning and continuous model updates provide the best input for the data driven applications.