Sunday, February 24, 2013

Connecting the Dots

If you look up at the sky on a moonless night, far away from any city lights, you will see many thousands of individual stars. An asterism is a group of those stars that can be connected together in our minds to form a stick figure. Constellations are ancient asterisms that gained popular names like Virgo or Ursa Major. Other asterisms that just make up a portion of a constellation have also been given popular names like "The Big Dipper" or "Orion's Belt". People who star gaze and either find some of these popular asterisms or form their own, are looking for "patterns" among the thousands of stars.

Searching for patterns is also common when we deal with all that data that exists as individual files or database records on our hard drives, flash memory cards, or DVDs. Sometimes these patterns are already established for us. A popular software package may consist of a dozen separate executable files along with their configuration files and documentation. They are often copied into one or two folders or directories during an installation process to keep them together. Sometimes installation programs copy them into common folders like /usr/bin so that they get all mixed in with other programs and they are not so easy to sort out and figure which files belong to which programs.

But even files that seem to be completely independent of other kinds of data (e.g. a photo or a song) can often be grouped together with other files to form ad hoc groups (e.g. a photo or a music album). We are constantly trying to make connections between different data points to form new and interesting patterns. Facebook and other social media sites provide mechanisms to form some of these patterns. A user posts messages, pictures, documents, videos, and other personal information in order to tell a story about their life, their interests, and their friends. It is the connections between lots of individual pieces of data that can lead to new interactions and help us make decisions.

The current trend in "Big Data" and various forms of analytics is all about finding patterns in large amounts of data to drive business decisions. Analyze a million customer orders to look for patterns of shopping behaviour when it is cold outside in order to figure out what items to put on sale when the next big storm hits. Analyze emails sent by everyone over 65 years old in Florida to figure out what political messages will most likely sway the most voters.

The trick to establishing meaningful patterns among millions or billions of individual data points lies in the ability to quickly analyze each point and determine if it has a significant connection to another point. The system that is used to store the information is a critical component to being able to quickly check lots of data points for a certain condition in order to sift the wheat from the chaff. The system must not only be able to match things like strings or numbers, but it must provide some kind of context in order to make more meaningful connections.

For example, if someone wanted to analyze a group of messages to gain intelligence about military hardware, the word "Tank" would be a meaningful keyword to search for. However, such a "brute force" search might turn up every message that deals with water tanks, gas tanks, and R&B music. It is much more meaningful if the search was conducted using "Vehicle=Tank" instead.

The Didget Management System was designed to not only manage large numbers of data points, but to also aid in making connections between points in order to find new patterns. By attaching many searchable tags to any given piece of data and by providing context for every single tag, the system makes it easy to find all the data that share a common attribute. It can also rank various connections between any two points based on the number of attributes they share in order to give hints about more relevant connections.

Big Data Analytics is all about finding hidden patterns and unknown correlations in large amounts of data. This means that specialized queries must be conducted against all that data to try and find meaningful patterns. When the data is created and stored, the nature of such queries is largely unknown. In other words, the data must be stored in such a way as to make as wide as possible, a variety of potential queries.

The speed at which a query can execute is a major factor in finding that "needle in a haystack". If a big data set consists of 10 billion data points and every query takes several hours to complete, then it becomes very hard to conduct lots of different types of queries, looking for a pattern. If, on the other hand, such a query can execute in a minute or less, then it becomes practical to conduct a wide variety of queries hoping that a meaningful pattern just "pops out in front of you".

Several other "big data" projects like Hadoop, MapReduce, HBase, Cassandra, and MongoDB have been structured to be spread across a cluster of nodes so that the processing of data can occur in parallel. This can greatly reduce the time necessary to perform a query. Such systems can be very complex to set up and administer, however. Our system has been designed to greatly simplify such configurations.

But finding patterns should not just be exclusive to large companies with big data sets. Individual users could greatly benefit from finding meaningful patterns among a few million pieces of information. If I got a message from Mary about her vacation in Hawaii, it would be helpful if there was an "about" button next to her name that when pushed would bring up a list of every message, photo, and document that she had sent me or was about her. Likewise it would be helpful if the message itself had hyper-links in it that when clicked would bring up my own photos of Hawaii or information about scuba diving or whale watching. These links could be generated automatically by the system based on tags already present on other Didgets.