All posts by Tony Baer

For the few of you who’ve traveled the timeline, you’ll know that we’ve been blogging since well before the term entered the general vocabulary. Our very first post covered the passing of the torch from Bill Gates to Steve Ballmer at Microsoft. A couple posts later, we celebrated the fact that the Y2K bug did not bring the downfall of Western Civilization, and barely a year and a half after that, we briefly posed the question on whether 9/11 would change conceptions about disaster recovery (from the IT perspective, that was).

What a long strange trip it’s been. With this post, we are shifting to a bigger stage, courtesy of friend and colleague Andrew Brust, who’s kept me honest more than a few times over the years (and typically over some choice Belgian brews) about the reality of database outside the ivory tower of the analyst world. As of today, we’re now trying to achieve what we originally hoped to do in those early days of the century: scrounge up some excuse of original thought on a weekly basis. We are joining distinguished company, as ZDNet in its wisdom is expanding the Big on Data blog; along with Andrew, we’ll teaming with George Anadiotis. With Big Data analytics beginning to penetrate the Early Majority, ZDNet’s plan to boost coverage is a visionary move, and we’re gratified to be part of the conversation.

As of this week, we’ll use this space to post links to our ZDNet posts, which should come a lot more frequently than you’ve seen from us for a while. And so, here’s the link to our debut post, on what MongoDB should be when it grows up. Thanks for taking the trip with us, we’re looking forward to having you in our new, scaled-up quarters.

Hadoop has come a long way in its first decade. Doug Cutting, who along with Mike Cafarella cofounded the Hadoop project as the offshoot of the Apache Nutch web crawler (search) project, has written an elegant post from someone who wasn’t just an eyewitness, but who created history.

Cutting’s main point is that the open source model has become the new default for much of commercial software development. With the Linux project having laid the groundwork, open source today has become the accepted, and in many cases, expected development and delivery model for infrastructure-based software. Open source works as long as you don’t go up toward the application level, where IP becomes much too specific to encourage a lowest common denominator approach to value creation. Hadoop has validated that model – the platform has become the preferred destination for big data analytics, and increasingly, the data lake.

For enterprises, the most important attribute of open source is not the access to source code (most IT organizations are not about to spare resources for that), but access to technology that should forestall vendor lock-in.

But reality is not quite so black and white. While open source projects clear the way for cross-vendor support, open source technology itself is no guarantee of portability. And when you peer beneath the surface, the question is whether open source has changed the dynamics of competition.

Compare notes with relational databases, which emerged during the zenith of the closed-source model for enterprise software. Forget for a moment what’s different; look at what’s similar. The most obvious is market consolidation: a decade ago, the enterprise database market was defined around Oracle, Microsoft, IBM, and for data warehousing, Teradata. For Hadoop, the landscape has winnowed down to four flavors: Amazon EMR, Cloudera, Hortonworks (and its ODPi partners), and MapR.

Dig down another level: both enterprise databases and Hadoop developed around core standards: the ANSI SQL query language for relational databases, and HDFS (storage) and MapReduce (compute) for Hadoop. But in both cases, those core pieces served as launching points for products that in the commercial market got differentiated.

For SQL relational databases, the Oracles, Sybases, Microsofts, and IBMs distinguished themselves with different storage engines, computational approaches (e.g., stored procedures), tuning, security, and administrative tools. In some cases, enterprise applications were written to run natively on them.

For Hadoop, the original foundational components– HDFS and MapReduce – are no longer sacrosanct. The operational definition of Hadoop is being defined around APIs to components that are becoming mix and match. There remain core components like HBase, Hive, Pig, and YARN (with the exception of HDFS) that are supported across the board – although with different vendors supporting different versions. But beyond that, there is a beehive (pun intended) of overlapping, competing open source and proprietary projects for platform management, security/access control, data governance, interactive SQL, and streaming; you can find a good blow-by-blow description here. Portability? That’s defined by APIs to storage and compute (remember them?).

So while you might be able to migrate from one Hadoop vendor’s implementation of Hive or Spark, that won’t be the case with those overlapping projects, open source or not. Welcome back, vendor stack.

The question for the Hadoop market as it was for the relational database folks a decade ago is whether to close the book.

The database market in 2006 on first glance appeared a done deal, as vendor consolidation had long passed the inflection point (that happened roughly a decade earlier). But on the margins, developers of simple web-based database applications did not require the sophistication or expense of enterprise relational data platforms. And with emergence of Linux as a viable alternative (which received significant backing from IBM several years before), the scene was set for the LAMP stack: Linux, MySQL, Apache webserver, and one of the “P” languages: Perl, Python, or PHP.

While web developers sparked demand for open source at the low end, Internet companies sparked similar needs at the high end, for handling torrents of flexibly-structured data for analytics and real-time processing. That begat Hadoop and the wave of NoSQL databases (e.g., MongoDB, Cassandra, Couchbase, and others) that set the stage for today’s increasingly multi-polar data platform landscape.

In all this, Hadoop’s future is increasingly tied with that of the data lake – the default place for ingest and storage of raw data, and resting place for actively archived aging data. The data lake is priming Hadoop for growth if platform providers can adequately address ease of implementation and usage issues.

Does demand for data lake, and the need for platform providers that are enterprise-ready, necessarily narrow the Hadoop field for good? Packaging and integration of core, not state-of-the-art, components may likely commoditize to the point where we’ll likely see emergence of low-cost upstarts in emerging world regions. But it’s the broader ecosystem for higher value-add where Hadoop players will have to play, and compete head-on, that will be the future of this market. At one end, demand for Spark-based analytics in the cloud without the “overhead” of Hadoop is already materializing, but that obscures the real picture: competition for security, performance optimization and cost management, information lifecycle management, data governance, and query “ownership.” All the things that Spark standalone lacks and that the Hadoop ecosystem is still building. As inheritors of the data lake, Hadoop vendors will want to own these functions. But what’s to keep these functions from being virtualized from Hadoop? Or from other data platforms? Or for other data platforms to own it?

The tip of the iceberg will come at ownership of the query because, in most organizations, the data lake will only be one of many sources of the truth and it will be more economical to pushdown processing of analytics where the data resides. Not the other way around. Hadoop players want to own this, but so do incumbent database providers as they extend their spheres of governance, not to mention the analytic and integration tool providers as well. And by the way, all that competition is a good thing, even if the path is gets a bit messy in the bargain.

There is little question that open source as a model for commercial software development has become the new norm. The key concept here is development – open source opens the floodgates to development and visibility. The commercial model for supporting open source has become widely accepted among enterprise IT organizations. But as to the competitive dynamics of a commercial market, competing players will inevitably have to differentiate, and the result is likely to be unique blends of open source projects with or without proprietary content (in most cases, with). And so it shouldn’t be surprising that when examining the competitive dynamics, the Hadoop market is following in many of the same footsteps as their relational brothers of a generation ago.

In April or May, we’ll see Spark 2.0. The direction is addressing gap filling, performance enhancement, and refactoring to nip API sprawl in the bud.

Rewinding the tape, in 2015 the Spark project added new entry points beyond Resilient Distributed Datasets (RDDs). We saw DataFrames, a schema-based data API that borrowed from constructs familiar to Python and R developers. Besides opening Spark to SQL developers (who could write analytics to run against database-like tabular representations) and BI tools, the DataFrame API also leveled the playing field between Scala (the native language of Spark) and R, Python, Java, and Clojure via a common API. But DataFrames are not as fast or efficient as RDDs, so recently, Datasets were introduced to provide the best of both worlds: the efficiency of Spark data objects, with the ability to surface them as schema.

Spark 2.0 release will consolidate the DataFrame and Dataset APIs into one; DataFrame becomes, in effect, the row-level construct of Dataset. Together, both will be positioned as the default interchange format and richer API of Spark with more semantics than the low-level RDD.

If you want ease of use, go with Datasets, but if feeds and speeds is the goal, that’s where RDDs fit in. And that’s where the next enhancement comes in. Spark 2.0 adds the first tweaks to the recently-released Tungsten (adding code generation), which aims to replace the 20-year old JVM with a more efficient mechanism for managing CPU memory. That’s a key strategy for juicing Spark performance, and maybe one that will make Dataset performance good enough. The backdrop to this is that with in-memory processing and faster networks (up to 10 GBE are becoming commonplace), the CPU has become the bottleneck. By eliminating the overhead of JVM garbage collection, Tungsten hopes to even the score with storage and network performance.

The final highlight of Spark 2.0 is Structured Streaming, which will extend Spark SQL and DataFrames (which in turn is becoming part of Dataset) with a streaming API. That will allow streaming and interactive steps, which formerly had to be orchestrated with separate programs, to run as one. And it makes streaming analytics richer; instead of running basic filtering or count actions, you will be able to run more complex queries and transforms. The initial release in 2.0 will support ETL, but future releases will extend querying.

Beyond the 2.0 generation, Spark Streaming will finally get – catch this – streaming. Spark Streaming has been a misnomer, as it is really Spark microbatching. By contrast, rival open source streaming engines such as Storm and Flink give you the choice of streaming (processing exactly one event at a time) or microbatch. In the future, Spark Streaming will give you that choice as well. Because sometimes you want pure streaming, where you need to resolve down to a single event, but other use cases will be better suited for microbatch where you can do more complex processes such as data aggregations and joins. And one other thing, Spark Streaming has never been known for low latency; at best it can resolve batches of events in seconds rather than subseconds. When paired with Tungsten memory management, that should hopefully change.

Spark 2.0 walks a tightrope between adding functionality, consolidating APIs, while not trying to break them. It for now begs the question about all the housekeeping that will be necessary if running Spark standalone. If it’s in the cloud, the cloud service provider should offer the perimeter security, but for now more fine-grained access control will have to be implemented in the application or storage layers. There are some pieces – such as the managing the lifecycle of Spark compute artifacts such as RDDs or DataFrames – that may be the domain of third party value-added tools. And – as it seems likely – Spark establishes itself as the successor to MapReduce for the bulk of complex Big Data analytics workloads, the challenge will be drawing the line between what innovations belong on the Apache side (and preventing fragmentation) and what sits better with third parties. We began that discussion in our last post. Later this year, we expect this discussion to hit the forefront.

This is the first of two pieces summarizing our takeaways from the recent Spark Summit East.

Given the 1000+ contributors to the Apache Spark project, it shouldn’t be surprising that development is pacing in dog years. Last year, Spark exploded as the emerging fact of life for bringing Fast Data velocity to Big Data, courtesy of a critical mass of commercial endorsements underscored by IBM’s bear hug in mid year. The Spark practitioner community has been highly successful speaking to itself – Spark would not have become the most active and fastest ramping Apache project were it not for grassroots interest that’s translated to action (over a thousand contributors to date), and it would not have piqued IBM’s attention were it just a small clique of developers.

But with momentum on Spark and related projects (almost a couple hundred using Spark, at last count), it’s time to deal with the reality of taking Spark to the enterprise. For practitioners, we’ve been harping on the need to explain the benefits of Spark-based analytics in business terms. Those benefits can be summarized in two words: Smart Analytics. Machine learning can provide the assist for sifting through torrents of data and helping the business ask the right questions.

The corollary is that the Spark engine, and the management infrastructure for running it, has to become ready for prime time. It’s time to industrialize the running of Spark. That will grow even more critical, not just as data analysts and data scientists write programs, but commercial software tools and applications embed Spark.

If you are implementing standalone – we’ve already weighed in on that – you’re going to have to reinvent all the measures associated with running a data processing platform, like security, workload management, and systems management.

But regardless of whether you run Spark standalone or under a data platform or cloud service with its own management and security infrastructure, what to do about the plumbing of running Spark operations on an ongoing basis, serving many masters? At Spark Summit East last week, a team from Bloomberg gave a glimpse of what organizations will encounter: building a registry of RDDs and DataFrames runtimes would not have to be recreated from scratch each time analysts want to run specific problems. The Bloomberg folks had to create this registry – which is also meant to store valuable lineage metadata on the provenance of the data or real-time stream. Bloomberg had to invent this because there is nothing off-the-shelf to manage frequent RDD use yet. Our take is that within the year, you’ll see ISVs introducing solutions to manage your Spark compute artifacts – not just RDDs or DataFrames, but also Datasets and the new constructs for Structured Streaming feeds. And tools and applications that embed Spark will similarly have to

Bloomberg’s RDD registry is just the tip of the iceberg. As you industrialize Spark, there will be issues relating managing which Spark workloads get priority, which ones get first dibs on memory, and optimizing workloads to fit into available memory. These are issues, not for the core Spark project, but for the ISV community to develop solutions.

As we noted a couple years back, data is getting bigger and fast data is getting faster because of the onward declining cost of infrastructure. And nowhere has that been more apparent than with in-memory and Flash storage. For instance, when SAP HANA yanked the in-memory database from its formerly specialized niche, IBM, Oracle, and Teradata subsequently one-upped with in-memory columnar add-ons to their core platforms. And in the NoSQL world, where Aerospike debuted with its Flash-based operational database, today use of in-memory and Flash storage is no longer unusual. And, while in-memory processing is not the only advantage of the Spark compute engine, the Apache project would not have caught on with the wildfire pace were memory still cost-prohibitive.

But the not so subtle problem with all this is that Silicon-based storage is not simply a faster version of disk. Disk is optimized for retrieval – especially in cases where you have different temperatures of data and want to tier, stripe, or shard it so that the most frequently-accessed data is on the most accessible (typically outer) part of the spindle. Or with Flash, where you want to minimize writes (which are inefficient) and rewrites/updates (which can shorten Flash lifespan), or with memory, where it’s all about lining up similar data types and functions so you can in effect operate it as a form of pipeline so the chip operates at an even pace.

That’s why, as we stated a couple years back, the architecture of storage impacts the architecture of the database and the application(s) running on it.

Now, compound the issue with CPU: there are storage and processing ramifications here. There is chip level cache that gives you an even faster form of storage on-board for highly volatile processes, and then the compute itself, where pipelining techniques can cram multiple actions into a single compute cycle. These factors were not that critical when systems were disk-based, but when you start to level the playing field between storage and CPU performance, the slightest perturbation can add serious speed bumps that could defeat the whole purpose of going to Flash or in-memory storage.

So the attention paid to Spark is a reflection of the importance given to speed in processing Big Data. When you can run what-if scenarios and trend analytics in seconds or minutes instead of hours (or days), machine learning becomes useful. But just as Spark, Flink, and the increasingly endless array of interactive SQL, streaming, and graph engines are emerging, each of them has to solve the problem of literally lining up their data for in-memory processing. This is a thankless task and one that offers zilch added value. Having interfaces for converting data for in memory instantiation is, in effect, running in place.

That’s where a new project, Apache Arrow, comes in. Led by the CTO and co-founder of Dremio (who came from MapR), a stealth startup that will be doing something with Apache Drill, they’ve taken a momentary detour to build a standard interface and protocol for marshaling data and, literally, lining it up in an easily consumable columnar format for processing. So all the 4-byte integers are processed together, and all the join operations are processed together, and so on. It’s built around the data SIMD parallelism engineered into Intel Xeon processors.

Significantly, this project has the potential of going viral in a similar, but quieter way, than Spark. It’s backed by a who’s who list of over 20 Apache committers from Dremio, MapR, Cloudera, Hortonworks, Salesforce, DataStax, Twitter, and AWS. Coming out of the gate, it’s going to get supported by Spark, Storm, Drill, Impala, Pig, Phoenix, Hive, Cassandra, Pandas, Parquet, HBase, and Kudu. And the project, just coming out of stealth today, is not even bothering with incubation. The Apache community has already ratified it as a new top-level project.

Arrow will provide a standard for columnar in-memory processing and interchange of data to an in-memory representation that can be shared by multiple engines (e.g., Spark and Impala) residing on the same node. That means that each database or compute engine does not have to have its own dedicated slice of memory – in-memory columnar storage can now be a common pool, meaning that users can get more use out of the same in-memory footprint, reducing infrastructure costs. They will have implementations for C, C++, Python and Java, with more languages to come.

The project will succeed as long as it keeps its aspirations contained – open source projects get more widely adopted as long as they don’t usurp the unique IP of commercial products. And so, while the Arrow project may develop some sample functions (e.g., joins) for manipulating data within the column, we suggest that they not grow overly ambitious.

If you follow Big Data, you’d have to be living under a rock to have missed the Spark juggernaut. The extensive use of in-memory processing has helped machine learning go mainstream, because the speed of processing enables the system to quickly detect patterns and provide actionable artificial intelligence. It’s surfaced in data prep/data curation tools, where the system helps you get an idea of what’s in your big data and how it fits together, and in a new breed of predictive analytics tools that are now, thanks to machine learning, starting to become prescriptive. Yup, Cloudera brought Spark to our attention a couple years back as the eventual successor to MapReduce, but it was the endorsement of IBM, backed by commitment of 3500 developers and $300 million investment in tool and technology development, which plants the beachhead for Spark computing pass from early adopter to enterprise. We believe that will mostly be through tools that embed Spark under the covers. It’s not game over for Spark; there persist issues of scalability and security, but there’s little question it’s here to stay.

We also saw continued overlap and convergence in the tectonic plates of databases. Hadoop became more SQL like, and if you didn’t think there were enough SQL-on-Hadoop frameworks, this year we got two more from MapR and Teradata. It underscored our belief that there will be as many flavors of SQL on Hadoop as there are in the enterprise database market.

And while we’re on the topic of overlap, there’s the unmistakable trend of NoSQL databases adding SQL faces. Couchbase’s N1QL, Cassandra/DataStax’s CQL, and most recently, the SQL extensions for MongoDB. It reflects the reality that, while NoSQL databases emerged to serve operational roles, there is a need to add some lightweight analytics on them – not to replace data warehouses or Hadoop, but to add some inline analytics as you are handling live customer sessions. Also pertinent to overlap is the morphing of MongoDB, which has been the poster child for lightweight, developer-friendly database. Like Hadoop, MongoDB is no longer being known by its storage engine, but for its developer tooling and APIs. With the 3.0 release, the storage engines became pluggable (the same path trod by MySQL a decade earlier). With the just-announced 3.2 version, write-friendlier WiredTiger replaces the original MMAP as the default storage engine (meaning you can still use MMAP if you override factory settings).

A year ago, we expected streaming, machine learning, and search to become the fastest growing Big Data analytic use cases; turns out that machine learning was the hands-down winner last year, but we’ve also seen quite an upsurge of interest in streaming thanks to a perfect storm-like convergence of IoT and mobile data use cases (which epitomize real time) with technology opportunity (open source has lowered barriers for developers, enterprises, and vendors alike, while commodity scale-out architecture provides the economical scaling to handle torrents of real-time data). Open source is not necessarily replacing proprietary technology; proprietary products offer the polish (e.g., ease of use, data integration, application management, and security) that are either lacking from open source products or require manual integration. But open source has injected new energy into a field that formerly was more of a complex solution looking for a problem.
So what’s up in 2016?
A lot… but three trends pop out at us.

1. Appliances and cloud drive the next wave of Hadoop adoption.
Hadoop has been too darn hard to implement. Even with the deployment and management tools offered with packaged commercial distributions, implementation remains developer-centric and best undertaken with teams experienced with DevOps-style continuous integration. The difficulty of implementation was not a show-stopper for early adopters (e.g., Internet firms who invent their own technology, digital media and adtech firms who thrive on advanced technology, and capital markets firms who compete on being bleeding edge), or early enterprise adopters (innovators from the Global 2000). But it will be for the next wave, who lack the depth or sophistication of IT skills/resources of the trailblazers.

The wake up call came when we heard that Oracle’s Big Data Appliance, which barely registered on the map during its first couple years of existence, encountered a significant upsurge in sales among the company’s European client base. Considered in conjunction with continued healthy growth in Amazon’s cloud adoption, it dawned on us that the next wave of Hadoop adoption will be driven by simpler paths: either via appliance or cloud. This is not to say that packaged Hadoop offerings won’t further automate deployment, but the cloud and appliances are the straightest paths to getting more black box.

2. Machine learning becomes a fact of life with analytics tools. And more narratives, less dashboards.
Already a checklist item with data preparation, we expect the same to happen with analytics tools this year. Until now the skills threshold has been steep for taking advantage of machine learning. There are numerous techniques to choose from; first you identify whether you already know what type of outcome that you’re looking for, then you choose between approaches such as linear regression models, decision trees, random forests, clustering, anomaly detection and so on to solve your problem. It takes a statistical programmer to make that choice. Then you have to write the algorithm, or you can use tools that prepackage those algorithms for you such as those from H2O or Skytree. The big nut to crack will be in how to apply these algorithms and interpret them.

But we expect to see more of these models packaged under the hood. We’ve seen some cool tools this past year, like Adatao, that combine natural language query for business end users with an underlying development environment for R and Python programmers. We’re seeing tooling that puts all this more inside the black box, combining natural language querying with the ability to recognize signals in the data, guide the user on what to query, and automatically construct narratives or storyboards, as opposed to abstract dashboards. Machine learning plays a foundational role in generating such guided experiences. We’ve seen varying bits and pieces of these capabilities in offerings such as IBM Watson Analytics, Oracle Big Data Discovery, and Amazon QuickSight – and in the coming year, we expect to see more.

3. Data Lake enters the agenda
The Data Lake, the stuff of debate over the past few years, starts becoming reality with early enterprise adopters. The definitions of data lakes are in the eyes of the beholder – we view them as the governed repository that acts as the default ingest point and repository for raw data and the resting point for aged data that is retained online for active archiving. It’s typically not the first use case for Hadoop and shouldn’t be because you shouldn’t build a repository until you know how to use the underlying platform and, for data lake, know how to work with big data. But as the early wave of enterprise adopters grow comfortable with Hadoop in production serving more than a single organization, planning for the data lake is a logical follow-on step. It’s not that we’ll see full adoption in 2016 – Rome wasn’t built in a day. But we’ll start seeing more scrutiny on data management, building on the rudimentary data lineage capabilities currently available with Hadoop platforms (e.g., Cloudera Navigator, Apache Atlas) and that are part of data wrangling tools. Data lake governance is a work in process; there is much white space to be filled out with lifecycle management/data tiering, data retention, data protection, and cost/performance optimization.

There’s been lots ofdebate over whether the data scientist position is the sexiest job of the 21st century. Despite the Unicorn hype, spending a day with them at the Wrangle conference, an event staged by Cloudera, was a surprisingly earthy experience. It wasn’t an event chock full of algorithms, but instead, it was about the trials and tribulations of making data science work in a business. The issues were surprisingly mundane. And by the way, the brains in the room spoke perfectly understandable English.

It starts with questions as elementary as finding the data, and enough of it, to learn something meaningful. Or defining your base assumptions; a data scientist with a financial payments processor found definitions of fraud were not as black and white as she (or anybody) would have expected. And assuming you’ve found those data sets and established some baseline truths, there are the usual growing pains of scaling infrastructure and analytics. What might compute well in a 10-node cluster might have issues when you scale many times that. Significantly, the hiccups could be logical as well as physical; if your computations have any interdependencies; surprises can emerge as the threads multiply.

But let’s get down to brass tacks. Like why run a complex algorithm when a simple one will do. For instance, when a flyer tweets about bad services, it’s far more effective for the airline to simply respond to the tweet asking the customer to provide their booking number (thru private message) rather than resort to elaborate graph analytics establishing the customer’s identity. And don’t just show data for the sake of it; there’s a good reason why Google Maps GPS simply shows colored lines to highlight best routes rather than dashboards at each intersection showing which percentages of drivers turned left or went straight. When formulating queries or hypotheses, look outside your peer group to see if it makes sense through other peoples’ eyes.

Data scientists face many of the same issues as developers at large. One of the speakers admitted resorting to Python scripts rather than some heavier weight frameworks like Storm or Kafka; the question in retrospect is how well are those scripts documented for future reference. Another spoke of the pain of scale up of infrastructure not designed for sophisticated analytics; in this case, a system built with Ruby scripting (not exactly well suited for statistical programming) on a Mongo database (not well suited for analytics), and taking Band-Aid approaches (e.g., replicating the database nightly to a Hadoop cluster) before finally biting the bullet and rewriting the code to eliminate the need for wasteful data transfers. Another spoke of the difficulty of debugging machine learning algorithms that get too complex for their own good.

There are moral questions as well. Clare Corthell, who heads her own ML consulting firm, made an impassioned plea for data scientists to root out bias in their algorithms. Of course, the idea of any human viewing data or querying it objectively is a literal impossibility as we’re all human, we see things through our own mental lenses. In essence, it means factoring in human biases even in the most objective computational problems. For instance, the algorithms for online dating sites should factor skews, such as Asian men tending to rate African American women more negatively than average; or that loan approvals based on ‘objective’ metrics such as income, assets, and zip code in effect perpetuate the same redlining practices that fair lending laws were supposed to prohibit.

Data science may be a much hyped profession; the supply is far dwarfed by demand. We’ve long believed that there will always be need for data scientists, but that also, for the large mass of enterprises, the applications will start embedding data science. And it’s already happening, thanks to machine learning providing a system assist to humans in BI tools and data prep/data wrangling tools. But at the end of the day, as much as they might be considered unicorns, data scientists face very familiar issues.

A year ago, Turing award winnerDr. Michael Stonebraker made the point that, when you try managing more than a handful of data sets, manual approaches run out of gas and the machine must come in to help. He was referring to the task of cataloging data sets in the context of capabilities performed by his latest startup, Tamr. If your typical data warehouse or data mart involves three or four data sources, it’s possible for you to get your head around figuring the idiosyncrasies of each data set and how to integrate them for analytics.

But push that number to dozens, if not hundreds or thousands of data sets, and any human brain is going to hit the wall — maybe literally. And that’s where machine learning first made big data navigable, not just to data scientists, but to business users. Introduced by Paxata, and since then, through a long tail of startups and household names, these tools applied machine learning to help the user wrangle data through a new kind of iterative process. Since then, analytic tools such as IBM’s Watson Analytics are employing machine learning to help end users perform predictive analytics.

Walking the floor of last week’s Strata Hadoop World in New York, we saw machine learning powering “emergent” approaches to building data warehouses. Infoworks monitors the what data end users are targeting for their queries by taking a change data capture-like approach to monitoring logs; but instead of just tracking changes (which is useful for data lineage), it deduces the data model and builds OLAP cubes. Alation, another startup, uses a similar approach for crawling data sets to build catalogs with Google-like PageRanks showing which tables and queries are the most popular. It’s supplemented with a collaboration environment where people add context, and a natural language query capability that browses the catalog.

Just as machine learning is transforming the data transformation process to help business users navigate their way through big data, it’s also starting to provide the intelligence to help business users become more effective with exploratory analytics. While over the past couple years, interactive SQL was the most competitive battle for Hadoop providers — enabling established BI tools to treat Hadoop as simply a larger data warehouse — machine learning will become essential to helping users become productive with exploratory analytics on big data.

What makes machine learning possible within an interactive experience is the emerging Spark compute engine. Spark is what’s turning Hadoop from a Big Data platform to a Fast Data one. By now, every commercial Hadoop distro includes a Spark implementation, although which Spark engines (e.g., SQL, Streaming, Machine Learning, and Graph) still varies by vendor. A few months back IBM declared it would invest $300 million and dedicate 3500 developers to Spark machine learning product development, followed by Cloudera’s announcement of a One Platform initiative to plug Spark’s gaps.

And so our attention was piqued by Netflix’s Strata session on running Spark at petabyte scale. Among Spark’s weaknesses is that it hasn’t consistently scaled over a thousand nodes, and is not known for high concurrency. Netflix’s data warehouse currently tops out at 20 petabytes and serves roughly 350 users (we presume, technically savvy data scientists and data engineers). Spark is still at its infancy at Netflix; while workloads are growing, they are not at a level that would merit a dedicated cluster (Netflix runs its computing in the Amazon cloud, on S3 storage). Much of the Spark workloads are for streaming, run under YARN. And that leads to a number of issues showing that at high scale, and high concurrency, Spark is a work in progress.

A few of the issues that Netflix is working to scale Spark include adding caching steps to accelerate loading of large data sets. Related to that is reducing the latency of retrieving large metadata sets (“list calls”) that are often associated with large data sets; Netflix is working on an optimization that would apply to Amazon’s S3. Another scaling issue related to file scanning (Spark normally scans all Hive tables when a query is first run); Netflix has designed a workaround to pushdown predicate processing so queries only scan relevant tables.

For most business users, the issue of Spark scaling won’t be relevant as their queries are not routinely expected to involve multiple petabytes of data. But for Spark to reach its promise for supplanting MapReduce for iterative, complex, data-intensive workloads, scale will prove an essential hurdle. We have little doubt that the sizable Spark community will rise to the task. But the future won’t necessarily be all Spark all the time. Keep your eye out for the Apex streaming project; it’s drawn some key principals who have been known for backing Storm.

That’s one of the headlines of a newly released Databricks survey that you should definitely check out. Because Spark only requires a JVM to run, there’s been plenty ofdebate on whether you really need to run it on Hadoop, or whether Spark will displace it altogether. Technically, the answer is no. To run Spark, all you need is a JVM installed on the cluster, or a lightweight cluster manager like Apache Mesos. It’s the familiar argument about why bother with the overhead of installing and running a general-purpose platform if you only have a single-purpose workload.

Actually, there are reasons, if security or data governance are necessary, but hold that thought.

According to the Databricks survey, which polled nearly 1500 respondents online over the summer, nearly half are running Spark standalone, with the other 40% running under YARN (Hadoop) and 11% on Mesos. There’s a strong push for dedicated deployment.

But let’s take a closer look at the numbers. About half the respondents are also running Spark on a public cloud. Admittedly, running in the cloud does not necessarily automatically equate with standalone deployment. But there’s a lot more than coincidence in the numbers given that popular cloud-based Spark services from Databricks, and more recently, Amazon and Google, are (or will be) running in dedicated environments.

And consider what stage we’re at with Spark adoption. Commercial support is barely a couple years old and cloud PaaS offerings are much newer than that. The top 10 sectors using Spark are the classic early adopters of Big Data analytics (and, ironically in this case, Hadoop): Software, web, and mobile technology/solutions providers. So the question is whether the trend will continue as Spark adoption breaks into mainstream IT, and as Spark is embedded into commercial analytic tools and data management/data wrangling tools (which it already is).

This is not to say that running Spark standalone will become just a footnote to history. If you’re experimenting with new analytic workloads – like testing another clustering or machine learning algorithm, dedicated sandboxes are great places to run those proofs-of-concepts. If you have specific types of workloads, there has long been good business and technology cases for running them on the right infrastructure; if you’re running a compute-intensive workloads, for instance, you’ll probably want to run it on servers or clusters that are compute- rather than storage-heavy. And if you’re running real-time, operational analytics, you’ll want to run it on hardware that has heavily bulked up on memory.

Hardware providers like Teradata, Oracle, and IBM have long offered workload-optimized machines, while cloud providers like Amazon offer arrays of different compute and storage instances that clients can choose for deployment. There’s no reason why Spark should be any different – and that’s why there’s an expanding marketplace of PaaS providers that are offering Spark-optimized environments.

But if Spark dedicated deployment is to become the norm rather than the exception, it must reinvent the wheel where it comes to security, data protection, lifecycle workflows, data localization, and so on. The Spark open source community is busy addressing many of the same gaps that are currently challenging the Hadoop community (just that the Hadoop project has a 2-year head start). But let’s assume that the Spark project dots all the i’s and crosses all the t’s to deliver the robustness that is expected of any enterprise data platform. As Spark workloads get productionalized, will your organization really want to run them in yet another governance silo?

Note: There are plenty of nuggets in the Databricks survey beyond Hadoop. Recommendation systems, log processing, and business intelligence (an umbrella category) are the most popular uses. The practitioners are mostly data engineers and data scientists – suggesting that adoption is concentrated among those with new generation skills. But while advanced analytics and real-time streaming are viewed by respondents as the most important Spark features, paradoxically, Spark SQL is the most used Spark component. While new bells and whistles are important, at the end of the day, accessibility from and integration with enterprise analytics trump all.

If it seems like we’ve been down this path before, well, maybe we have. June has been a month of juxtapositions, back and forth to the west coast for Hadoop and Spark Summits. The mood from last week to this has been quite contrasting. Spark Summit has the kind of canned heat that Hadoop conferences had a couple years back. We won’t stretch the Dickens metaphor.

Yeah, it’s human nature to say, down with the old and in with the new.

But let’s set something straight: Spark ain’t going to replace Hadoop, as we’re talking about apples and oranges. Spark can run on Hadoop, and it can run on other data platforms. What it might replace is MapReduce, if Spark can overcome its scaling hurdles. And it could fulfil IBM’s vision as the next analytic operating system if it addresses mundane – but very important concerns – for supporting scaling, high concurrency, and bulletproof security. Spark originated at UC Berkeley’s AMPLab back in 2009, with the founders forming Databricks. With roughly 700 committers contributors, Spark has ballooned to becoming the most active open source project in the apache community, barely 2 years after becoming an Apache project.

Spark is best known as a sort of in-memory analytics replacement for iterative computation frameworks like MapReduce; both employ massively parallel compute and then shuffle interim results, with the difference being that Spark caches in memory while MapReduce writes to disk. But that’s just the tip of the iceberg. Spark offers a simpler programming model, better fault tolerance, and it’s far more extensible than MapReduce. Spark is any form of iterative computation, and it was designed to support specific extensions; among the most popular are machine learning, microbatch stream processing, graph computing, and even SQL.

By contrast, Hadoop is a data platform. It is one of many that can run Spark, because Spark is platform-independent. So you could also run Spark on Cassandra, other NoSQL data store, or SQL databases, but Hadoop has been the most popular target right now.

And, not to forget Apache Mesos, another AMPLab discovery for cluster management to which Spark had originally been closely associated.

There’s little question about the excitement level over Spark. By now the headlines have poured out over IBM investing $300 million, committing 3500 developers, establishing a Spark open source development center a few BART stops from AMPLab in San Francisco, and aiming directly and through partners to educate 1 million professionals on Spark in the next few years (or about 4 – 5x the current number registered for IBM’s online Big Data university). IBM views Spark’s strength as machine learning, and wants to make machine learning a declarative programming experience that will fellow in SQL’s footsteps with its new SystemML language (which it plans to open source).

That’s not to overshadow Databricks’ announcement that its Spark developer cloud, in preview over the past year, has now gone GA. The big challenge facing Databricks was making its cloud scalable and sufficiently elastic to meet the demand – and not become a victim of its own success. And there is the growing number of vendors that are embedding Spark within their analytic tools, streaming products, and development tools. The release announcement of Spark 1.4 brings new manageability and capability for automatically renewing Kerberos tokens for long running processes like streaming. But there remain growing pains, like reducing the number of moving parts needed to make Spark a first class citizen with Hadoop YARN.

By contrast, last week was about Hadoop becoming more manageable and more amenable to enterprise infrastructure, like shared storage as our colleague Merv Adrian pointed out. Not to mention enduring adolescent factional turf wars.

It’s easy to get excited by the idealism around the shiny new thing. While the sky seems the limit, the reality is that there’s lots of blocking and tackling ahead. And the need for engaging, not only developers, but business stakeholders through applications, rather than development tools, and success stories with tangible results. It’s a stage that the Hadoop community is just starting to embrace now.