Blogs

Week's Worth of Koby Quick Hits: June 18 - 22, 2012

Here are the quick-hit ponderings that I posted on the IBM Netezza Facebook page this past week. I went deeper on the themes of experience optimization, Hadoop, big data crowdsourcing, and DW appliance proofs of concept. And I opened up a fresh topic: big data's optimal deployment role. I also attended and spoke at the Hadoop Summit last week, which stimulated many thoughts you'll find reflected in this coming week's quick-hits:

Subjective experience resists easy quantification, hence any attempts at crisp "optimization." Also, when you're trying to optimize the experience that someone else is having, you have no direct access to the quality of that experience. Hence, you have no hard-and-fast "measurements" upon which valid optimization depends.

Online experience optimization is a black art that fancies itself a branch of software engineering. When you start to decompose online experiences down to their most granular elements – page designs, impressions, mouse-overs, clicks, clickthroughs, latencies, broken links, error messages, and the like – you risk succumbing to pointy-headed analysis. It's as if you're trying to measure the pleasure someone derives from a daydream by tracking the low-level synaptic firings of which that fantasy, ostensibly, is composed. What appears fiendishly complex at the physical level might correspond to simplest, sweetest bliss a human can achieve.

Another complication is the fact that the quality of online experience is shaped largely in the expectation and the retrospect, not just in the passing moment. Memory often distorts experience in the direction, positive or negative, suggested by other factors (such as peer influence, advertiser suggestion, and cultural familiarity). Reliably measuring something as slippery as quality of experience can be tough, because the yardstick warps like pliable plastic over time.

What felt unnatural at first, like the Web felt to most people in its early days, is now second nature to most. If, like most of humanity, you've already incorporated this crazy, complex, evolving Internet experience into your comfort zone, you've largely tuned out the artifice. At the end of the day (or the online session), you simply remember whether it allowed you to get your personal tasks accomplished (check email, make travel reservation, listen to music, etc.) in the customary manner. If the artifice wasn't too far out of the norm, you forget it as soon as you get what you came for.

So if you achieve your desired outcome without muss or fuss, your experience is optimal – to you. Clearly, the low-level clickstreams and other events can give an indication for whether outcomes are being achieved from online experiences. But none of those low-level events is, by itself, an experience metric.

Yes, you should build graph models of online sessions in order to gauge whether they are contributing to realization of outcomes in the typical customer "journey." But all the graph models and smart online design in your bag of tricks won't make the customer's day if the "analog" relationship falls flat.

Never underestimate the experience-boosting power of an order delivered on time and undamaged to their physical doorstep.

My chief takeaway from this past week's Hadoop Summit is that this technology is on the cusp of mass adoption. As opposed to last year's summit, this year's broadened the focus from Hadoop's sweet spots--content analytics, predictive modeling, and data science – into the more mainstream territory of business intelligence (BI) and data warehousing (DW).

Let me start by plugging my own presentation at the event. Representing IBM, I discussed the marriage of Hadoop and DW. We are seeing this as a key trend among our customers and in the industry at large.

For starters, companies are adopting Hadoop for strategic roles in their current DW architectures, such as extract-transform-load (ETL), data staging, and preprocessing of unstructured content. They are beginning to evaluate and adopt Hadoop for next-generation BI applications in interactive visualization and exploration of multistructured data sets. With those trends underway, I discussed the implications for evolution of data discovery, staging, and governance in the next-generation DW that has Hadoop in its core DNA.

It wasn't just IBM saying all of this: the growing adoption of Hadoop for BI and DW were key themes of the event, discussed both in the keynotes (by tech visionary Geoffrey Moore and Sears CTO Philip Shelley) in other track sessions (most noteworthy being Ovum analyst Tony Baer's discussion of the unstructured data stewardship in Hadoop).

Many of the other vendors' announcements at this show further exemplified this trend. The Hadoop solution-provider ecosystem continues to expand and differentiate to offer the full range of tools that will, in the next several years, bring the vision of a full-fledged "Hadoop DW" to mature fruition.

Stay tuned to IBM for further deepening of our discussion along these lines.

June 20:Crowdsourcing big data creativity? The social side of data science

Most of us segment the big data development community into two broad camps: the "data scientists" (aka statistical analysts) and the "domain specialists" (aka subject-matter experts). Though it makes sense to do so – statistical analysts and business analysts are often different people with different degrees, skills, tools, and perspectives – the distinction will probably wither away as this decade rolls on.

Big data is a business initiative in most companies, so the domain specialists, who know the subject matter of the problems being addressed, will increasingly have the upper hand. To achieve their objectives, the domain specialists have to become full-fledged data scientists in their own right in order to manage these projects. The people who've been pure data scientists will need to become domain specialists in their own right in order to advance their careers.

In addition to bringing data science expertise into their own skills repertoire, big data domain specialists will need to engage more closely with domain-knowledgeable people inside and outside their own organizations. This is because the more complex the business problem – such as a complex materials-management or customer-handling issue – the less likely that a closed circle of domain experts has all the answers.

Social networking, both internally and externally, will be essential for funneling of all this practical human intelligence into the development, vetting, and validation of your big data models.

Most of the people from whom you crowdsource this domain knowledge won't qualify either as data scientists or subject-matter experts, in the strictest sense. That's a non-issue. Their inputs will help you break through the "fourth wall" of creativity that's necessary to address the most difficult big data business imperatives.

June 21: Proofs of concept as core appliance acquisition approach? TCO tough to demonstrate, but possible

Speeds and feeds are the obvious focus on most data warehousing (DW) appliance proofs of concept (POC), and for obvious reasons. They're the sorts of metrics – fast loads, fast queries, and so forth – that IBM Netezza and other vendors optimize our engineering to maximize.

But DW appliance total cost of ownership (TCO) is far trickier to demonstrate in a POC context. Sure, we can sell you a DW appliance at a competitive price and the like, but that's almost the least of it. TCO is a life-cycle metric that depends only partly on how we engineer the appliances and their integrated tooling. We can't engineer your internal practices for DW deployment, optimization, and administration. But, in a POC-focused procurement cycle, we can show you how our appliances can make all of those functions far simpler, more automated, and productive. We can help you size your appliance to avoid overprovisioning capacity until you truly need it, and to make the most efficient use of the capacity you've acquired.

And we can provide you with considerable guidance, training, and courseware so that you can incorporate TCO-reducing best practices into your operations. In a POC, IBM Netezza can give you all information, demonstrations, and guidance you will need to prove that we are the lowest lifecycle-TCO option for your needs now.

What's the best approach for doing big data? I hear this question all the time, and, quite frankly, I have no hard-and-fast answer for it. I hate to say "it depends," but, in fact, it does.

Quite often, people mean it in the "Hadoop vs. enterprise data warehouse (EDW)" sense of database platform. Clearly, these approaches address the same core big data use cases, with a growing range of applications, such as adaptive machine learning, sentiment monitoring, and social graph analysis, favoring Hadoop. And let's not forget the vast range of NoSQL, stream computing, graph databases, in-memory, and other innovative approaches for addressing various and sundry shades of the 3 Vs.

Increasingly, people mean it in the "software vs. appliance vs. cloud vs. virtualization" sense as well – i.e., the deployment platform or form factor for big data. This debate has been fiercest in the EDW arena, which is a more mature space that has a larger number of commercial form factors in all of these areas. One indicator of Hadoop's maturation is the growing availability of commercial deployment options in each of these form factors. Anybody who attended the Hadoop Summit last week could see that for themselves.

So what's the best deployment approach for big data? Complicating the answer further is the combinatorial explosion of hybrid approaches (e.g., a bit of HDFS, a bit of HBase, a bit of Cassandra, a bit of MongoDB, a bit of RDBMS EDW, a bit of in-memory columnar, a bit of appliance, a bit of software on commodity hardware, a smidgen of cloud, etc.).

The answer depends in part on the complexity of your business requirements and technical environment, including the legacy investments with which it must all interoperate. It also depends on your tolerance for risk, complexity, and headaches. Organizations are hybridizing any and all of these technologies, and vendors such as IBM continue to innovate. There are few established best practices for harmonizing all these various approaches with each other, much less with legacy databases and other infrastructure.

Finally, the answer depends on your willingness to be a first-mover in exploring new approaches for big data.

So what's optimal? You the pioneers must tell us. What lessons have learned in the painstaking task of trying to integrate myriad big data approaches in your environment?

At the end of the week, I'm happy to have summer upon us. I will take a brief vacation around July 4. But there's so much to do, and so much is going on in the big data industry, that I'll be tempted to check work email while relaxing. Let's see whether I can restrain myself.