Monthly Archives: December 2016

(Excerpt from original post on the Taneja Group News Blog)

It’s almost 2017 – where is your organization with regards to getting any value out of Big Data projects? Are you still dabbling with endless POC’s, or maybe haven’t even gotten a big data project approved yet?

(Excerpt from original post on the Taneja Group News Blog)

In the last few months I’ve been really bullish on Apache Spark as an big enabler of wider big data solution adoption. Recently we got the great opportunity to conduct some deep Spark market research (with Cloudera’s sponsorship) and were able to survey nearly seven thousand (6900+) highly qualified technical and managerial people working with big data from around the world.

Some highlights — First, across the broad range of industries, company sizes, and big data maturities, over one-half (54%) of respondents are already actively using Spark to solve a primary organizational use case. That’s an incredible adoption rate, and no doubt due to the many ways Spark makes big data analysis accessible to a much wider audience – not just Phd’s but anyone with a modicum of SQL and scripting skills.

When it comes to use cases, in addition to the expected Data Processing/Engineering/ETL use case (55%), we found high rates of forward-looking and analytically sophisticated use cases like Real-time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). And support for the more traditional customer intelligence (31%) and BI/DW (29%) use cases weren’t far behind. By adding those numbers up you can see that many organizations indicated that Spark was already being applied to more than one important type of use case at the same time – a good sign that Spark supports nuanced applications and offers some great efficiencies (sharing big data, converging analytical approaches).

Is Spark going to replace Hadoop and the Hadoop ecosystem of projects? A lot of folks run Spark on its own cluster, but we assess mostly only for performance and availability isolation. And that is likely just a matter of platform maturity – its likely future schedulers (and/or something like Pepperdata) will solve the multi-tenancy QoS issues with running Spark alongside and converged with any and all other kinds of data processing solutions (e.g. NoSQL, Flink, search…).

In practice already, converged analytics are the big trend with near half of current users (48%) said they used Spark with HBase and 41% again also with Kafka. Production big data solutions are actually pipelines of activities that span from data acquisition and ingest through full data processing and disposition. We believe that as Spark grows its organizational footprint out from initial data processing and ad-hoc data science into advanced operational (i.e. data center) production applications, that it truly blossoms when fully enabled by supporting other big data ecosystem technologies.

RT @TruthinIT: There's no cost of goods like a traditional NAS device where I've got disks I've got to pay for. And if I'm not using the data on those disks, I still got to pay for those disks. bit.ly/2BBX073@Nasuni@smworldbigdata

In 30 min I'm interviewing @Cohesity (and customer) on @TruthinIT about Mass Data Fragmentation. It's about having too many copies in about four or five different "dimensions", including cloud! Join us webcast (12.11.18) @ 1pmET (and there will be prizes) bit.ly/2PdqrQn