The postings on this site solely reflect the personal views of each author and do not necessarily represent the views,positions,strategies or opinions of IBM or IBM management. IBM reserves the right to remove content deemed inappropriate.

Analytics Infrastructure: Choosing the Right Platform

In a previous post, I introduced the viewpoint that selecting the right infrastructure to perform analytical tasks is key to an organisation's ability to deliver business results. Let us explore this thought further with reference to a specific client analytics challenge.

An analytical problem can be broken down into a number of steps:

Ingest – Integration – Analysis – Interpretation

And yes, I did try to find a synonym for 'analyse' starting with an 'i' so we could have “four 'i's” to complement the “three (or four, or five depending upon your source) 'v's” of big data (e.g., Volume, Variety, Velocity, Veracity etc.)

Data needs to be brought into the analytics environment from the originating sources, it needs to be integrated with other information to potentially cleanse or enhance it to make the analysis phase deliver more meaningful results. Finally, some interpretation of the analysis in the context of the business needs to occur.

There is both a technical focus and a business one at play here and the relationship between them changes over time. The front end – ingest and integration – are more technically focused. Moving or manipulating any volume of data takes a finite amount of time. Physics limits what any system can do. Contrast this with interpretation which has very little to do with technology and is almost entirely business focused: what does the organisation want to do with the results of the analysis?

This particular client is struggling with the first two steps. The volumes of business data that needs to be ingested are growing rapidly whilst the time to do this has remained constant. Similarly the integration with other (also rapidly growing) client data sources is taking longer and longer.

Whereas in the past at the end of each working day they had a clear view of their business position, nowadays they are constantly trying to catch up and are unable to clearly understand how the business is performing.

Classical approaches such as using disk, tape or network to move data simply take too long when it comes to large volumes. The latest networking technologies may help in some situations but for this client, their current server platforms are unable to support these. They have been forced to use alternative approaches that rely on parallel copies to try to overcome this limitation but have now run into limitations on the number of parallel network connections they can support.

Put simply, they cannot physically ingest the volume of data in the time allowed.

It is not just in the ingest phase that they are struggling. Integration is taking much longer than before. One way of processing a large volume of data is to split it up and perform the data integration operations in parallel against these subsets.

For instance, divide the transaction records by client name and processing names starting with different letters on different systems. This initially looks to be a great fit for an infrastructure consisting of many small systems such as the client has today.

However, as part of this processing, the whole dataset needs to be brought together again and then split up again – this time by geographic location. If, after this joining and splitting process, processing for a given client will be on a different system then data needs to move. It is first written to disk and then needs to be re-read by the other system. With multiple join-split operations needing to be performed, the overall processing time is now ten times longer than before.

As the volumes of data grew, the times involved in moving data have made the solution unsustainable. More time is spent moving data than actually performing useful data integration work. This is a common problem in many infrastructures struggling with rapidly growing data leading to highly inefficient environments.

Consider now the case where this manipulation of data can occur in memory. Without moving data to disk one can see how a single large system with a large quantity of memory allows these challenges to be overcome.

A data challenge that initially seems to be an excellent fit for many small systems is in fact better solved with a single large system. By better understanding the characteristics of the challenge the client is facing, a more suitable solution to their problems can be found.

Tags

A tag is a keyword you assign to make a blog or blog content easier to find. Click a tag to find content that has been assigned that keyword. Click another tag to refine the search further. Click Find a tag to search for a tag that is not displayed in the collection.