Ninety percent, or more, of the data produced by enterprise-class businesses and organizations in the public and not-for-profit sectors is entirely structured. But content published on web pages is unstructured text data and, therefore, a difficult challenge. So should these consumers think differently about how to process this information? According to an article published on Computerworld January 30, 2015 titled Test shows big data text analysis inconsistent, inaccurate, they should. The article is written by Kevin Fogarty.

The point of Fogarty’s article is to expose the actual inaccuracy of a key component of most “modern” analytical tools for text information, a “modeling technique” called the “Latent Dirichlet allocation (LDA)”. Fogarty writes the LDA has been recently proven to be highly inaccurate, at least according to some research attributed to Luis Amaral, a physicist “from Northwestern University”.

Fogarty quotes Amaral admonishing ISVs offering text data analytical tools to these enterprise consumers to come clean about just how useful (or useless) their tools may prove to be before money changes hands to the ultimate dissatisfaction of the group making their purchase.

But, at another level, are enterprise consumers already thinking differently about text data? Microsoft SharePoint and SharePoint Online both offer Managed Metadata Services (MMS), the term store and taxonomy support. OpenText and Microsoft’s own circle of “Managed Partners” (meaning ISVs who work closely with Microsoft to fill in the blanks on high value solutions for enterprise consumers) have already come to market with complete solutions to cultivate useful data from content published on SharePoint sites. These platforms are ubiquitous across enterprise consumers, with, perhaps, as much as 80% of Fortune 500 businesses supporting an instance of one or the other of these solutions.

If the points Fogarty presents in his article prove to be true, then it should not be much of a stretch for stakeholders in a serious effort to mine high value business intelligence (BI) from web sites and social media to decide to pursue Microsoft’s solutions as a best possible choice.

ISVs looking to challenge Microsoft in this space may want to think seriously about providing a cloud, PaaS offer. After all, if these ISVs are already succeeding at this game, why shouldn’t consumers do better by simply “hitching a ride” on these platforms? As Fogarty points out DIY isn’t cutting it. At least not yet.

A number of tech markets, including enterprise computing, cloud, SaaS, PaaS, IaaS and IoT have demonstrated a voracious appetite for data management and analysis. Anyone following data management technology may get lost in the notion of “big data”.

I say lost, as an enormous amount of hype has been built up around the “theme” of “big data.” But a lot of long standing data management methods — relational databases management systems (RDBMS) with a columnar architecture built to provide structure to data — work really well for, ostensibly, enormous amounts of information (meaning data). Readers may want to consider efforts like the Port Authority of New York and New Jersey, and the toll road system it manages. How many millions of vehicle transactions occur on a monthly basis? In turn, how many billions of bits of data does the history of vehicle transactions through toll machines represent? Has this enormous amount of data proven to be unmanageable?

The answers to each of the questions, just presented, all support an argument for RDBMS and Structured Query Language (SQL) as a useful method of working with enormous amounts of data. These questions and answers echo across a very wide of applications; for example, the purview of the U.S. National Weather Service; or the universe of drugs managed by the U.S. Food and Drug Administration.

So there is nothing inherently radical about the notion of “big data”, at least if the notion is correctly understood as merely the set of methods commonly in use to manage data. In fact, and this is where, in my opinion. commentator hyperbole has clouded the whole question of just what is changing — in a truly radical way — about data management methods, the notion of big data is NOT correctly understood as I’ve just presented it. The “big” piece of “big data” appears to have been meant to represent a scalable data management architecture (best typified by Apache Hadoop). Anyone reading the presentation on the Hadoop web site can’t help but understand the role of clusters of servers for Hadoop as a solution. Clusters of servers, in turn, provide a perfect rationale for the Apache project to provide the foundation for Hadoop.

A couple of points are worth noting about the Salesforce.com press release:

GE Capital is mentioned as already using Wave. Given GE’s own recent PR campaign around its own data and analytics effort, one must wonder why the business finance component of the company opted not to use the home grown solution ostensibly available to it

The Wave announcement follows, by less than a month, IBM’s announcement of a freemium offer for “Watson Analytics”, and Oracle’s “Analytics Cloud”. Both of these offers are delivered via a cloud, SaaS model. So it’s likely safe to say enterprise technology consumers have demonstrated a significant appetite for analytics. The decision by Salesforce.com, IBM, and Oracle to all deliver their solutions via a cloud, SaaS offer speaks to the new enterprise computing topology (a heterogeneous computing environment) and the need to look to browsers as the ideal thin clients for users to work with their data online.

An ample supply of structured and unstructured data is likely motivating these enterprise tech consumers to look for methods of producing the kind of dashboards and graphs each of these analytics offers is capable of producing. With data collection methods advancing, particularly for big data (unstructured data), this appetite doesn’t look to abate anytime soon.

ISVs with solutions already available, principally Microsoft with its suite of Power tools for Excel (PowerBI, PowerPivot, etc), may also be participating in this “feeding frenzy”. It will be interesting to see how each of the ISVs with offers for this market fare over the next few business quarters.

Of the 42 members of Hadoop’s Project Management Committee, 8 are directly affiliated with Cloudera®, and another with Intel®. Patrick Hunt, an Engineer at Cloudera appears to have played a key role in the development of a keyword search feature for Hadoop, which is not a trial achievement for a database like Hadoop, which is designed for unstructured data. Intel has an investment in Cloudera. Therefore, Intel should benefit as more organizations choose to proceed with unstructured data, and Hadoop as its repository.

Some prominent online businesses, including:

Amazon

eBay

facebook

Twitter

and Spotify

have made major commitments to Hadoop.

Readers are recommended to review Who uses Hadoop? to familiarize themselves with the size of an average Hadoop implementation. Of course, very large repositories of data like these require a lot of CPU resources for processing. As the leading manufacturer of server CPUs, Intel benefits from all of this need for computing power, regardless of whether an organization implementing Hadoop runs it on the Apple OS X O/S, Ubuntu, or another Linux flavor. The recommended hardware for each of these is Intel.

The tools offered by Cloudera for managing Hadoop data repositories are designed to provide enterprise businesses with familiar features and procedures. Since most of these enterprise data centers are already full of Intel hardware, Cloudera can be seen, perhaps, as another method Intel can leverage to maintain its position in these same installations.

What bearing does all of the above have on discussions about large data centers, a need for better power management, and the likelihood of hardware OEMs building solutions on the ARM architecture capturing substantial share? Given the importance of Hadoop to the leading cloud, IaaS vendor — Amazon, as well as to Microsoft Azure it doesn’t appear likely server cores running ARM architecture will quickly become the standard in these environments any time soon.

Further, Intel is certainly not standing by, but working, very actively to produce more power efficient hardware in very small form factors. One can argue Microsoft’s Surface Pro 3, which is powered by either an Intel Quad Core i3, i5, or even i7 is a tangible example of how much progress they have made to better satisfy consumer appetite for power thrifty, extremely thin computing devices.