Using Big Data to solve real world problems

Big Data - large data sets on the order of petabytes - this is a buzz word that you must have heard each time you hear about the next big internet website or social network or e-commerce website. Amazon, Google,Facebook, Netflix and many more large internet corporations have proved time and again that Big Data and analysis on these data sets can provide you the critical business intelligence that will take you to the next level and make you the absolute leader. Great!

Big Data (Photo credit: Kevin Krejci)

You might be tempted to think that Big Data is relevant only within this field of internet websites, but in reality it is used and can be used in a variety of ways not connected to an internet website. Banks for example lave huge data sets on positions, prices and other market parameters and use this data to derive information on their risk and also to plan trading strategies. Its not exactly exciting to talk about when compared to social networks, but is a good use case.

The other day I was reading something that said Big Data can revolutionize the way we provide healthcare to millions of people - great! This is a more real world problem that I think will help us a lot. Take this example of using mobile phones to fight drug counterfeiting in Nigeria, they are not talking about large sets of user information, but there is a data set of all the codes from all the legitimate drugs and the users have a simple interface of a text message to check if they packet they are holding is a legitimate drug or not - this can be considered a good use case of Big Data - http://www.bbc.co.uk/news/world-africa-20976277

This post is an attempt to understand how we can use Big Data in real life scenarios.

A pre-web 2.0 example on Big Data in real world solutions

About 10 years ago, long before we saw the NOSQL databases which are so popular and I think just around the time that Google released the MapReduce and BigTable papers, I visited the National Remote Sensing Agency (NRSA) offices in India as part of a college academic tour. The main aim was to understand how the agency acquired satellite images, how it processed them and how it would extract information from them.

I learnt that the Indian remote sensing satellites, about 4 or 5 of them, would orbit around the poles and every day during the day would photograph different parts of India and the different other countries they passed over while they were visible to the ground station. All these images would be stored in raw format and the next day they would be processed to separate them into grids and of locations, tagged and cataloged in a large database.

And there was a large amount of data, collected over a period of years, much like the Google Maps satellite imagery if you want a Web example. And then the analysts at the agency had the ability to look though all this data and see how the terrain, vegetation, soil composition, water table etc changed in any given region over a period of time. In one example that I saw of Hyderabad, India, I could see that over the preceding 1o years, the amount of greenery had fallen down all across the city and surroundings and along with this the water table - all shown as single image with colored regions to highlight the change.

While this was measurement of changes in nature because of human activity like constructions, there was another example where the images showed the salinity of the ground - I don’t remember what kind of imaging it was , probably infrared - it turns out the Indian government was conducting a desalination project in the region. In order to track the progress that was being made on the ground, the government turned to NRSA to provide proof in terms of satellite images that showed the changes in salinity. This would have been solid proof in the face of any contractors not doing any work on the ground.

And all this was being done before we actually started hearing of the word Big Data.

Sometimes all that is needed to solve a problem is to think about how easily you can check the validity of a fact. Its like using MD5 hashes to ensure that the file that you downloaded is correct, or taking it to a different scale where Google decided to use atomic clocks and GPS to stamp every data operation with a time and error probability and let the database nodes handle how to sync the data within this error period instead of trying to synchronize the updates.

The example from Nigeria about checking for legitimate drugs is such a example. You know that there is a large amount of malaria drug out in the market and if you had to somehow check every packet out there and validate it in person, then you really are looking at a huge task. You know that there is a large amount of static identification information on the packets that the counterfeiters are able to copy - you cannot prevent them from copying that. So you add an additional random piece of information that is generated by a small process that you own and guard. As long as your process is not compromised, no one else can generate the same set of information as you. And you can create a huge database of this information on your end as you ship out the medicines.

So you now have the data, and you have the items on the market. As you are amassing this huge amount of information with every batch that ships out, how do you use it? ow do you get value from this? Because this is such a critical issue where every individual who buys the medicine will be impacted, any value that can be derived should be available directly to the person who bought the medicine. In most developing countries like Nigeria, everyone owns a cell phone, even the cheap basic Nokia ones if not the iPhones. And all cell phones can send text messages. Nigeria already has the beautiful example of mPesa - mobile money transfers - so use that and let the person who purchased the medicine check the codes against your database. And problem almost solved!

Its like a master stroke where all this data and a simple interface can make so much of a difference.

In the real world, the problem or the solution need not be too fancy

When we talk about Big Data solving problems for Google or Amazon, the value add of the solution has to be in monetary or growth terms. It needs to help the bottom line, give an edge over the competitor etc. In the real world, it has to make lives better. And this can be done in very simple ways.

Take for example the Unique Identity project in India or Aadhar project. Their aim is to fingerprint, iris scan and generate identification numbers for over a billion people. Compared to other countries where national ID cards are issued to everyone or even passports in India this is a maniacally huge project. Even Facebook does not have so many users. Although I must admit Facebook collects a large amount of data. Facebook needs all that data to build your social graph etc. However from a Aadhar project point of view, the social graph or sending your updates to everyone isn’t the key goal. The key goal is to identify a person uniquely - and this only needs a few parameters to be captured correctly.

When we define a problem so clearly, we will know what data to collect, and if there are tools to collect this data easily - like mobile fingerprint scanners, then both the interface to collect and validate the data can be built into the same small application that can be deployed easily.

There is no need for a fancy website or all the log scraping or even Map Reduce processing or anything that involves a buzz word - all we need a is large team of volunteers or field agents who can go out into the world, take a reading and upload it real time or as a end of day batch to some database. And over a period of time the data base will grow and you can do magic with the data in the database - without ever using anything fancy. Of course you will have to build a rock solid database, but that is a problem that has been around for a long time and there are many tried and tested solutions for this.

Data security in the real world

What is the worst that can happen if from all the large amounts of data that Facebook stored about you, someone stole a small part - you will lose your privacy no doubt, but assuming there was no bank password or critical information about you, the amount of personal damage that can be done to you is very less. At least there is nothing that will happen immediately to you physically.

But in case of real world projects using large data sets with detailed information about people - information that can be used to access a persons bank accounts or falsify official records - the impact of the lives of people will be huge. Or in case of the example from Nigeria, if someone ever managed to get into the database and insert validation codes for fake drugs, then it would lead to a loss of life.

Securing real world Big Data databases will be very critical.

Re-usability and reducing duplication and out of sync data

When you are online, if the user profile in Twitter does not match the user profile in Facebook, not much is lost. But if we were talking about personal data collected for large real world projects, if the details in the banks don’t match with for example the benefits system, then you will lose your benefits - it would mean a lot in your real life.

Big Data sets always have a tendency to go out of sync over a period of time. A simple example will be when lets say you move from online updates to batch updates of data. Someone updates his information in 2 such systems - one of them does a online update while the other parks the data for a batch update. If there was any validation or transaction attempted while the systems were out of sync, then there will be an error. Now imagine this error happening when you are being authenticated at the bank or at the immigration check point or a security check point.

Keeping large data sets in sync is not trivial especially when the data sets are not owned by the same application. So if we were to apply Big Data principles in real world, then we should strive to achieve a situation where everyone uses a common database - this will means that at any point of time, everyone will see the same snapshot of information. This will also mean that every application can be confident about the data and data format and instead concentrate on how they can use the data without having validate it again.

Conclusion

We have only touched a few aspects of how we can use Big Data in solving real world problems. This is not something that came after Facebook, its been there before, but the advantage now is that we have more advanced technologies which we can use to better implement these projects.