Of late, “Big Data” has become one of the hottest topics in the worlds of business and computing.

I’ve done a presentation recently in an internal conference about Big Data, related technologies and how today’s Database Administrators are affected by it. Here’s a background/outline of the talk followed by the presentation itself (slimmed down version).

Background/Outline

Customer feedback is very important for any business with out which there cannot be any improvement in the products/services it offers. So, over time, companies have followed several different approaches like in person surveys, random phone calls to customers to get the feedback.

Today, connected devices like mobile phones, tablet PCs, social networks with hundreds of millions of users and various sensors are generating enormous amounts of data (of the orders of hundreds of terabytes to a few petabytes), and at exceptional speeds (A single jet engine on the Boeing 737 generates as much as 240TB of data in a cross-country flight).

So businesses are calling this large volumes of data as “Big Data” and is learning to leverage it for surveys and feedbacks. Therefore it has become highly crucial for the success of any business.

Traditional database softwares are proving to be inefficient to capture and store this avalanche of data. With a little out-of-the-box thinking, the open source community has come up with several systems that can acquire, manage, and process such large volumes. Organizations must make a wise decision in picking a system.

Big Data is also creating several new roles and early birds can make a fortune. :)

In this talk, I discussed various technologies available, their working principles and the opportunities available for today’s DBA to explore in this new world of Big Data.

“Big Data” is not new to any of us. We see that everyday, every moment. We are contributing to Big Data every minute and are making it Bigger.

Yes, right now, before you ended up on this post, I’m sure enough you’ve clicked a couple of links that would have updated your IP address, geographic location, the website you were browsing on, and several other details on a server.

And your smart phone might have been conversing with it’s manufacturer about the current version of O.S or a recent crash report or updates for existing applications. So hundreds of millions of users like you and me, that means a vast amount of data generated every hour.

A Boeing 737 generates 240TB of data in a single cross country flight – the speed/velocity of data generation is very high.

So what is happening to all the information being generated at this rate? This could easily measure a few hundreds of gigabytes to a few tera bytes or even peta bytes.

So, Big Data technologies are all about answering the following two questions

How and where do you capture all this data?

How do you organize and make meaningful business decisions based on this data?

Capturing Big Data

So the data being fed by sources like vast number of surveillance cameras, microphones, a wide variety of sensors, mobile phones, Internet Click Streams, tweets, facebook messages is actually, not structured. The velocity of this data demands very high write performance from the datastore – So much so that the ACID properties promised by RDBMS themselves could become a bottleneck for performance. A poor write performance means inability to capture the data as it comes.

Availability of the store at all times necessitates that the data be distributed on multiple servers, and that brings in the problems of replication and consistency.

Simple Key Value stores have evolved in the recent times that work without conforming to ACID principles. These stores accept an application defined “Key” and some “Value” and persist the record as per a preset configuration. They offer a variety of SLAs for Replication, Consistency and speed of access. These are also called NoSQL databases.

As per the CAP theorem, in a distributed environment it is impossible to guarantee all three of Consistency, Availability and Partition tolerance, you need to sacrifice one of them. All NoSQL databases are built to be operated in distributed environments (although they can be operated on lone hosts). They are optimized for very high write performance by conforming to BASE properties (RDBMS follow ACID properties, we all know that).

BASE – Basically Available Soft-state Eventual consistency

Data from the application need not be normalized to multiple tables (as with RDBMS), so an object is written or read in one shot, into a single Key-Value table. All the data for an object resides at one place, and is co-located on the disk. So it is a sequential read which means very high disk through put.

Key Value stores are classified into four types based on the type of value they store.

Simple KeyValue stores (Amazon Dynamo)

Graph DB (Flock DB)

Column families (Cassandra)

Document databases (Mongo DB)

Making sense out of Big Data

Map Reduce is a programming model to process large data sets.

The idea is to send the code to where the data resides, because we are talking of large data sets and moving them around could be expensive and time consuming.

Map:

In this step, the actual problem is divided into sub-problems and are assigned to worker nodes (typically where the data resides).

Reduce:

All the results will be gathered from the worker nodes and will be merged to produce the final result.

There was a discussion with IRCTC’s chief regarding website bandwidth issues, scaling, and related stuff on techgig. I am not totally satisfied with his response for a particular question.

The question & his response:

So, is it a deliberate strategy by IRCTC to not increase website capacity to allow more transactions per minute?

No. We have increased our bandwidth 10 times, to allow more people to come and log in. But traffic keeps on peaking, especially during a particular time in the morning. Suppose, we were to increase capacity to unlimited ticket bookings a minute, then most tickets will be booked in the first hour. Of the 8 lakh people who log on simultaneously between 8 am 8.10am, only 50,000 manage to get tickets. There are about 7.5 lakh people who go dissatisfied each day. If we increase our capacity to handle 15 lakh concurrent connections, then about 14.5 lakh customers will go dissatisfied. The solution is to increase the capacity and number of trains. My aim is to double the 50,000 bookings to over-a-lakh successful bookings in the first hour, which will reduce the number of dissatisfied customers.

I’d say:

Why not at least scale it up that much so the customers doesn’t have to wait hours together only to discover that the tickets are all sold out. 8 lakh customers, 7.5 lakh dissatisfied customers. There will be at least a few thousands of customers who were waiting just for the page to load at multiple stages through the booking and the wait time ranges from a few minutes to more than half an hour – i.e,. so many thousands of man hours wasted in just an hour.

There are a few very simple bottlenecks that can be fixed with out much of an effort:

Home page

Homepage for irctc is approximately 25KB (with all the images, stylesheets and scripts).

He said there are 8 lakh logins bet’n 8AM and 8.10 AM. which means they have gone past the login page, i.e,. 800,000 * 25KB ~ 20GB of data transfer in a matter of ten minutes. Needless to mention that the number of users that have already loaded the page and in the process of authentication could be much higher than this 8 lakhs.

Many browsers will have these images cached, so at least there is 5 to 10 gigs of data transfer just in the first 10 minutes.

Users are only interested in the log in box, why not load that part first (using pagelets or a similar technology?), or just the login box, as in a mobile app, during the 8AM – 10AM window?

Search for trains

Station codes – The Ajax stuff makes the users wait so long to get the station codes. Until this part’s performance is tuned, these codes can instead be hosted as a static file which can be downloaded once, so users can download it & manually look it up using ctrl + f ; this will be faster by several folds.

Calendar – This is worse. The widget doesn’t load (during the peak – I’ve experienced it myself several hundreds of times), and you cannot enter the date manually.

Captcha

Why would those captcha images take so long to load?

I agree the fact that there will always be a good number of users dissatisfied. But, if the aforementioned issues are addressed, IRCTC can effectively save hundreds or thousands of man hours for it’s customers by not letting them wait for unnecessary advertisements and useless widgets/ajax components.

So, yesterday, I was discussing with a friend of mine who is looking for a simple in-memory key-value store. She ran into memcached and a couple others, but I thought I could quickly scribble up one and gave it a shot:

Trends in search patterns for technologies & businesses in the decade ended 2010

These graphs from Google trends may not reflect the actual & exact number of people who had their interests on each of these technologies over time, but I believe, it is to some extent, true. In here are my personal opinions on how the trends in the industry have been & going to be.

The picture clearly shows that A.I. ruled for about three years in a row, the volume of search queries in this space has been slowly falling though. We have seen so many A.I. products coming to light. Chat Bots, Spam filters, and fraud detection systems have all become popular over time by incorporating Natural Language Processing & Machine learning. In spite of studies, and R&D over a decade, A.I. has still got a whole lot of shapely changes to bring about on the industry. Knowledge representation and application can get a lot better than it is today!

To A.I., it has already been a long journey as compared to other technologies, ‘coz Google trends are available only from 2004, but A.I. as such, actually, dates back to 1956 (Ref: Wikipedia). To my knowledge, Artificial Intelligence has been in the curricula at universities since late 90s, or could be even before that. Reasearch & development has also been “on” ever since, and it appears that the search trend had its highest peak in early 2004.

When it comes to “Virtualization” and “Cloud computing”, there is a steep rise in the volume of search queries from mid 2004 and 2007 respectively. Cloud computing has revolutionized the market. So much so that… Larry Ellison, CEO & founder of Oracle corporation, has once said that Cloud computing is an insane concept (Ref: 1, 2, 3, …), and is now providing Oracle 11g database on Amazon Web Services and his team is presenting about Cloud in various conferences.

Cloud computing dramatically drives down costs and improves overall server utilization how?. This one factor helped the technology create shock waves in business computing. That is the reason why the spike is so steep for this technology.

The NoSQL movement, MapReduce, and CAP theorem have started gaining popularity slowly in the mid 2008 as per the graph above. These technologies have had a great impact on the way businesses are classifying and handling the data they work with. They scale to vast sizes and offer High Availability.

Now that we have observed the technology trends, let us look at the Businesses and/or business areas that have had the edge over these years.

Yahoo, Google, Wikipedia, are over the software giants Oracle and Microsoft. It can be inferred that open-source and free-to-use services gained the edge.

It is needless to state now that open source & free-to-use websites are the Go for startups, but is worth noting that it is the technology & the scope of business you choose that makes the difference! Know the trend before you hit the ground.

a. There is a lot of hype all over because.. cloud computing is a revolution to come… http://bit.ly/g2z3su;

3. Is that something fictional? Is that really possible?

a. It is not fictional.. If you believe you can create more than a few virtual machines on a single physical computer, Cloud computing is very much possible – In fact, it is proven and current, you cannot really question it’s practicality (http://aws.amazon.com,http://cloudpower.in, …);

4. How can a Virtual Machine (VM) be scaled to a capacity more than what the underlying host has?

a. There is not one but a whole lot of physical servers that serve as the cloud. Usually there will be a master host which controls the behavior of the rest of the hosts. It manages to get VMs that have uncorrelated work-load peaks and valleys on to a physical host, and is transparent..

5. I don’t think a VM on the cloud is as efficient as a physical, dedicated server for my application, I’d rather opt to have a dedicated server?

a. Latencies may pop-in, but there is always a trade-off. You evaluate the TCO (Total Cost of Ownership), for your business. Physical dedicated servers **may** do good, but if I were you, I would consider the maintenance costs like datacenter cooling, power, building lease and other misc. maintenaces & also overall server utilization against the efficiency of the application. I would not want my servers to be idle during non-peak hours – I have to pay the same amount out of pocket even it is non-peak as for my business. I want my business to be **cost-effective**