Relational databases like structured data, tables of columns and rows in a defined schema so everyone knows what to expect in every place. Unstructured data like text, data or numeric values are a different story. Hadoop certainly fills a need to provide some structure using key value pairs to create some structure where there is no structure. To get the two worlds of structured and unstructured data to work together, there is usually a bridge of some sort. The relational database may allow an import of key value data so it can be incorporated into the relational schema, although usually with some work. This handshaking between relational and key value databases is limping along and is workable. Under heavy loads the networking and performance impact to move massive quantities of data around can be taxing. Is there a better way?

To make some sense out of unstructured data some sort of framework needs to be overlaid on the raw data to make it more like information. This is the reason that Hadoop and similar tools are iterative. You’re hunting for logic in randomness. You keep looking and trying different things till something looks like a pattern.

Besides unstructured and structured data is the in-between land of semi-structured data. This refers to data that has some beginnings of structure. It doesn’t have formal discipline like the rows and columns of structured data. It is usually schema-less or self-describing structure in that it has tags or something similar that provide a starting point for structure. Examples of semi-structured data might include emails, XML, and similar entities that are grouped together. Pieces can be missing, or size and type of attributes might not be consistent, so it represents an imperfect structure, but not entirely random. Hence the in-between land of semi-structured data.

Hadapt takes advantage of this semi-structured data with a data exchange tool to create structured data construct. They use JSON to exchange file formats. This is rather clever since JSON is a java script derivative that is fairly well known. JSON is very good at file exchanges, which can solve the semi-structured to structured problem. By starting with semi-structured data, they get a head start on structure. JSON is particularly well suited to key value pairs or order arrays.

The semi-structured data must be parsed with JSON which can then create an array of data that is then available to be manipulated with SQL commands to complete the cycle. Once there is a structure in place SQL is quite comfortable with the further manipulation. After all, it is the structured query language. The most sophisticated tools are in the relational world, hence most efforts to make sense of unstructured or semi-structured data is to add more structure to allow more analysis and reporting. And after a few steps you indeed can get order out of chaos.

Machine learning is widely perceived as getting its start with chess. When the skills of the program exceeded the skills of the programmer, the logic went, you’ve created machine learning. The machine now has capabilities that the programmer didn’t. Of course, this is something of a fiction. Massive calculation capabilities alone don’t really mean there’s profound learning occurring. Massive calculation capabilities might reveal learning, however.

In the meantime machine learning has done some really interesting things, such as driving a car, face recognition, spam mail identification, and with it robots can even vacuum the house.

The core problem for machine learning is structure. Machine learning will do well with structured environments, but not so well in unstructured environments. Driving a car is a great example. It did poorly until the problems it couldn’t identify such as subtleties in the driving surface, or the significance of certain objects were defined in a way that could be dealt with by the program. The underlying problem is structuring a problem in a way that the program can analyze and resolve it.

Consider the way people learn, and the way that machines learn. A child starts learning in an unstructured way. It takes a long time for a child to learn how to speak or walk. Each child will learn at their own rate based on their environment and unique genetic makeup. Once they have an unstructured basis for learning do we add a structured learning environment; school. With machines we provide a structured environment first, and then hope they can learn the subtleties of a complex world.

There is an excellent paper by Pedro Domingos of University of Washington looking at the growth areas of machine learning. One observation is that when humans make a new discovery they can create language to describe the new concept. These concepts are also comparable in the human mind where people can see the comparison between situations and apply new techniques to existing areas by taking skills from one area and applying to another. An example of human learning is using physicists to create mathematical models for the financial industry to create models for high speed computer trading.

Structured machine learning models are making progress in solving some interesting problems, like those mentioned previously. The approaches mostly look at providing more layers of complexity in the way problems are analyzed and resolved. Indeed, the future of machine learning is not in the volume of data, but the complexity of the issues to be studied. The world is a complex place, and human understanding of it is comprised of a combination of literal and intuitive approaches. The intuitive is the ability to reach across domains of knowledge, extensions of understanding to new areas, and qualitative judgments.

Because machines literal due to the nature of machine structure, programming has been likewise very literal. The machine learning models are getting far more sophisticated in terms of complexity. Still, the challenge of creating a structured tool (machine learning) to tackle an unstructured world may be a problem that will never be entirely resolved till we restructure the machine.

Perhaps we create a machine with a base level of “instinct”, enough to allow the machine to perceive the world around them, and then let it learn. If we were to create a machine that might take a few years of observation of the world and create their own basis for understanding the universe, what kind of intelligence would be created? Would we be able to control it? Will it provide any useful service to us?

The Hadoop Summit for 2013 has just concluded in San Jose. There were a few themes that seemed to recur throughout the two-day summit with over 2,500 people. The overall story is the continued progress to take Hadoop out of the experimental and fringe case environment, and move it into the enterprise with all the appropriate controls, function and management. A related objective is to have 50% of the world’s data on Hadoop in five years.

The latest Hadoop release 2.0 is known as Yarn (yet another resource negotiator). To be a little more precise Hadoop is still less than one at release 0.23, but MapReduce is now version 2.0 or MRv2. The new MRv2 release addresses some of MapReduces’ long known problems such as security and scheduling limitations. Hadoop’s Job Tracker/resource manager/job scheduling have been re-engineered to provide more control with a global resource manager and an application focused “application master”. The new Yarn APIs are backwards compatible with the previous version with a recompile. Good news, of course. You can get more details of the new Hadoop release at the Apache site hadoop.apache.org

The other themes in the Hadoop Summit included in-memory computing, DataLakes, 80% rule, and the role of open source in a commercial product.

Hadoop traditionally is a batch job. Enterprise applications demand an interactive capability. Hadoop is moving into an interactive capability. But it doesn’t stop there. The step beyond interactive capability is stream processing with In-memory computing. In-memory computing is becoming more popular as the cost of memory plummets and people are increasingly looking for “real-time” response from MapReduce related products like Hadoop. The leading player with in-memory computing is SAP’s Hana, but there are several alternatives. In-memory processing provides blazing speed, but higher costs than a traditional paging database that moves data in and out of rotating disc drives. Performance can be enhanced by the use of Flash memory, but it may still not be enough. In-memory typically will have the best performance, and several vendors like Qubole, Kognitio (which pre-dates Hadoop by quite a bit), Data Torrent as well as others showing at the conference were touting the benefits of their in-memory solutions. They provide a great performance boost, if that’s what your application needs.

DataLakes came up in the kickoff as a place to put your data till you figure out what to do with it. I immediately thought of data warehouses, but this is different. In a data warehouse you will usually need to create a schema and scrub the data before it goes in the warehouse so you can process it more efficiently. The idea of a DataLake is to put the data in, and figure out the schema as you do the processing. A number of people I spoke with are still scratching their heads about the details of how this might work, but the concept has some merit.

The 80% rule, the Pareto Principle, refers to 80% of the results coming from 20% of the work, be it customers, products or whatever. In regards to Big Data this is how I view many of the application specific products for Big Data. Due to the shortage of Data Scientists, creating products and platforms for people with more general skills provides 80% of the benefit of Big Data with only 20% of the skills required. I spoke with the guys at Talend and that is clearly their approach. They have a few application areas that have specific solutions aimed at user analyst skills to address the fat part of the market.

Finally, there remains tension between open source and proprietary products. There are some other examples of open source as a mainstream product, and Linux comes to mind as the poster child for the movement. Most of the open source projects are less mainstream. Commercial companies need to differentiate their products to justify their existence. The push behind Hadoop to be the real success story for open source is pretty exciting. Multiple participants I spoke with saw open source as the best way to innovate. It provides a far wider pool of talent to access, and has enough rigor to provide a base that other vendors can leverage for proprietary applications. The excitement at Hadoop Summit generated by moving this platform into the enterprise is audacious, and the promise of open source software seems to be coming true. Sometimes dreams do come true.

In any job, it helps when you use the right tool for the job. In the Big Data universe there can be many different kinds of data. Structured data in tables. Text from email, tweets, facebook, or other sources. Log data from servers. Sensor data from scientific equipment. To get answers out of this variety of data, there are a variety of tools.

As always with Big Data, it helps to have the end in mind before you start. This will guide you to the sources of data you need to address your desired result. It will also indicate the proper tool. Consider a continuum from a relational database management system (RDBMS) and Hadoop/MapReduce engine on the other end. RDBMS architectures, like Oracle, has ACID (Atomicity, Consistency, Isolation, Durability), a set of properties to assure that database transactions are processed reliably. This is why for critical data that must be correct, and cost is secondary, RDBMS is the standard due to this reliability. For example, you want to know what amount should be on the payroll check. It has to be right. On the other end are the MapReduce solutions. Their primary concern is not coherency like the RDBMS, but parallel processing massive amounts of data in a cost effective manner. Fewer assurances are required for this data because of the result desired. This is often the case when looking for trends or trying to find some correlation between events. MapReduce might be the right tool to see if your customer is about to leave you for another vendor.

The NoSQL world is somewhere in between. While the RDBMS has consistent coherency, the NoSQL world works on eventual consistency. The two-stage commit with the use of logs is a way to get things sorted out eventually, but at any given point in time, a user might get data that hasn’t been updated. This might be adequate for jobs that need faster turnaround time than MapReduce, but don’t want to spend the money to build out the expensive infrastructure for a full RDBMS. MapReduce is a batch job, meaning that the processing has a definite start and stop to produce results. If MapReduce can’t deliver adequate latency, NoSQL provides continuous processing, instead of batch processing for lower latency. Another advantage of NoSQL, similar to MapReduce is scalability. NoSQL provides horizontal scaling up to thousands of nodes. Job are chopped up, as in MapReduce, and spread among a large number of servers for processing. It might be just the ticket for a Facebook update.

One of the downsides of a NoSQL database is the potential for deadlock. A deadlock occurs when two processes are waiting for the other to finish, and needs the other to finish before it proceeds. Hence this stare-down called a deadlock. This might be because the processes are updating records in a difference sequence and they are in conflict resulting in a permanent wait state. There are some tools to minimize the impact of this potential. The workarounds might result in someone seeing outdated data, but again, if it is acceptable for the desired result, then NoSQL could be a good fit. Eventually things get sorted out, if properly designed.

As you see, understanding the job at hand, the desired result, and what kind of issues are acceptable will determine if RDBMS, NoSQL or a MapReduce solution will fit. NoSQL options are growing all the time, which might indicate that this middle ground is finding more suitable jobs.

With increasing connectedness of devices and people, the data just keeps coming. What to do with all that data is becoming an increasing problem, or opportunity if you have the right mindset. In general there are three things that can be done with this flood of data:

Any combination of the above might make sense depending on the intent of the project, the amount and kinds of data, and of course, your budget. I find it interesting that the traditional RDBMS still has legs with the movement to utilize in-memory processing which is made possible by continually falling memory prices, making this a “not crazy” alternative. Of course it gets back to what did you want to do with what kind and amount of data. For instance, a relational database for satellite data may not make sense, even if you could do it.

Here’s where the file system can become very interesting. It might be ironic that unstructured data must be organized to be able to analyze it, but I think of it as farming. You cultivate what you have to get what you want. Ideally, the file system will provide a structure for the analysis that will follow. There doesn’t seem to be a shortage of file systems out there, but because the flood of unstructured data is relatively recent, there might be even better file systems on the way.

There are a number of file structures available: local, remote, shared, distributed, parallel, high performance computing, network, object, archiving and security being some examples. The structure of these can be very different. For the flood of unstructured data, parallel file systems seem to offer a way to organize this data for analytics. In many cases the individual record is of little value, indeed the value in most unstructured datasteams is in aggregate. Users are commonly looking for trends or anomalies within a massive amount of data.

An application with massive amounts of new data would suggest that traditionally structured file systems for static data (like data warehouses) might not be able to grow as needed, since the warehouse typically takes a point-in-time view. Traditional unstructured static data like medical imaging might be appropriate based on the application, but most analytics can’t do much with images. Dynamic data has its own challenges. Unstructured dynamic data like CAD drawings or MS Office data (text, etc.) may lend themselves to a different file structure than dynamic structured data like CRM and ERP systems where you are looking for a specific answer from the data.

Dealing with massive amounts of new data may be a recipe for a non linear approach to keep up with the traffic. Parallel file systems started life in the scientific high performance computing (HPC) world. IBM created a parallel file system in the 1990’s called GPFS, but it was proprietary. The network file system (NFS) provided the ability to bring a distributed file system to the masses and share files more easily with a shared name space. Sun created NFS and made it available to everyone, and it was generally adopted and enhanced. There are some I/O bandwidth issues with NFS, which companies like Panasas and open systems oriented Lustre have tried to address. I/O bandwidth remains the primary reason to consider a parallel file system. If you have a flood of data, it’s probably still the best way to deal with it.

I expect to see more parallel and object file systems to provide improved tools over what is available today to better manage the massive data flooding into our data centers. Increasingly, the sampling approach will be diminished since the cost of storage continues to fall, and some of the most interesting data are outliers. The “long tail” analysis to find situations where the rules seem to change when events become extreme can be very valuable. This may require the analysis of all the data, since sampling may not give sufficient evidence to “long tail” events that occur infrequently.

In summary, managing the flood of data is a question of identifying what you want to get from the data. That combined with the nature of the data will guide you to an appropriate file system. In most cases a parallel file system will be the solution, but you have to know your application. The good news is as our sophistication grows, we will have more options to fine tune the systems we build to analyze the data we have to get the results we want.

Oren Etzioini of the University of Washington held a talk at Adobe in March, and gave a rundown on the current state of the art in IE. We’ll get to that in a minute, but what is IE? Information Extraction is the science of making sense of unstructured human text. The challenge is that human language can be imprecise. Structured data is so named because of the systematic categorization of the data into tables in a way that optimizes its analysis. Unstructured data, as in human speech, does not lend itself to tables nor structure. In analyzing human language, it is common to employ natural language processing to create a system that will derive useful information from human language. This may not be possible in politics, but perhaps in business it could work.

Why is this useful? Today’s technology allows us to ask “What is the best Mexican restaurant in San Jose” and get a ranking by star ratings that users have input. IE allows us to ask “Where can I get the best margarita in San Jose?” and get a ranking by comments about margaritas. To get a ranking based on attributes that weren’t defined in advance, queries require a more advanced understanding of what is being said in reviews, not just star ranking.

How do you analyze unstructured text? The key to answering the attribute based questions is context. Information extraction is machine learning. Algorithms will attempt to determine what is relevant. Scalability combined with algorithms are the keys to generate useful results,. The IE model identifies the tuple and the probability of a relationship. An example might be trying to find out who invented the light bulb, and getting results such as: invented(Edison, light bulb), 0.99, indicating a strong link between Edison and inventing the light bulb.

If one is looking for examples of some attribute, they often occur in context with other terms, which we might consider as clues. One can then use clues to find more instances of the attribute. This is how we pick apart context in more detail.

The challenge is extracting the information we’re looking for, and ignore the rest. One of the more interesting applications is a shopping tool, decide.com that will check different websites for rumors about new product introductions. It even goes further to estimate when a new product might come along based on the company’s previous history, what will happen to the pricing, and what kind of features are being talked about. It creates a summary of rumors compiling the results of multiple sources, saving time, and results can be displayed on a mobile device.

Dr. Etzioni’s pet project is Open IE. His premise is that word relations have canonical structure. By looking at this structure you extract the relationships for analysis. You don’t generally pre-identify the concepts, you want to be able to find interesting stuff. His extractors find these relationships. You can play with his model on the web at openie.cs.washington.edu or get Open IE extractors for download without license fees.

There are some early IE efforts out there today. Google Knowledge Graph and Facebook Graph Search are a couple. Oren is part of a startup, Decide.com, that is also in the space. All of them are relatively early stages as algorithms improve the usefulness of the data will improve. People want an answer, not a bunch of results to sort through. This becomes increasingly important as we are increasingly looking for these results on our mobile device. This forces a more succinct response, like talking to a person, and getting an answer. Oren did mention that Siri, which can respond to a query with an answer is very limited. He wants to use all available documents, tweets, reviews, posts, blogs and everything he can get his hands on to formulate an answer.

Check out his extractor on the website mentioned above for more information, and a free test drive.

In answer to the question, IE is almost ready for prime time. There are promising signs for this technology, but I wouldn’t bet my house on it yet.

FAST13 USENIX Conference on File and Storage Technologies February 12–15, 2013 in San Jose, CA

If you’re not familiar with the geekfest called USENIX and their file and storage technology conference, it is a very scholarly affair. Papers are submitted on a variety of file and storage topics, and the top picks present their findings to the audience. The major component and system vendors are there along with a wide variety of academic and national labs.

Let’s review a paper about using SSDs in high performance computing where there are a large number of nodes. See the reference at the end for details regarding the paper.*

The issue is how to manage two jobs on one data set. The example in the paper is a two-step process in the high-end computing world. Complex simulations are being done on supercomputers then the results are moved to separate systems where the data is subject to analytics. Analytics are typically done on smaller systems in a batch mode. The high-end computing (HEC) systems that do the simulations are extremely expensive, and keeping them fully utilized is important. This creates a variety of issues in the data center that include the movement of data between the supercomputer and the storage farm, analytic performance and the power required for these operations. The approach proposed is called “Active Flash”.

The floating point operations performed on the HEC systems are designed for the simulation, not the typical analytic workload. This results in the data being moved to another platform for analytics. The growth in the data (now moving to exabytes) is projected to increase costs so that just moving the data will be comparable in cost to the analytic processing. In addition, extrapolating the current power cost to future systems indicates this will become the primary design constraint on new systems. The authors expect the power to increase 1000X in the next decade while the power envelope available will only be 10X greater. Clearly, something must be done.

The authors have created an openSSD platform Flash Translation Layer (FTL) with data analysis functions to prove their theories about an Active Flash configuration to reduce both the energy and performance issues with analytics in a HEC environment. Their 18,000 compute node configuration produces 30TB of data each hour. On-the-fly data analytics are performed in the staging area, avoiding data migration performance and energy issues. By staging area we are talking about the controller in the SSDs.

High Performance Computing (HPC) tends to be bursty with I/O intensive and compute intensive activity. It’s common that a short I/O burst will be followed by a longer computational activity period. These loads are not evenly split, indeed I/O is usually less than 5% of overall activity. The nature of the workload creates an opportunity for some SSD controller activity to do analytics. As SSD controllers move to multi-core this creates more opportunity for analytics activity while the simulations are active.

The model to identify which nodes will be suitable for SSDs is a combination of capacity, performance, and write endurance characteristics. The energy profile is similarly modeled to predict the energy cost and savings of different configurations. The author’s experimental models were tested in different configurations. The Active Flash version actually extends the traditional FTL layer with analytic functions. The analytic function is enabled with an out-of-band command. The result is elegant and outperforms the offline analytic or dedicated analytic node approach.

The details and formulas are in the referenced paper, and are beyond my humble blog. But for those thinking of SSDs for Big Data, it appears the next market is to enhance the SSD controller for an Active Flash approach to analytics.