To say Streaming Analytics is popular is an understatement. Right now Streaming Engineering is a top skill Data Engineers must understand. There are a lot of options and development stacks when it comes to analyzing data in a streaming architecture. Today I sat down with Lewis Kaneshiro (CEO & Co-founder) and Karthik Ramasamy (Co-founder) of Streamlio to get their thoughts on Streaming Analytics and Data Engineering careers.

Streamlio Opensource Stack

Streamlio is a full stack streaming solution that handles the messaging, processing, and stream storage in real-time applications. The Streamlio development stack is built primary from Heron, Pulsar, and BookKeeper. Let’s dicuss each of these opensource projects.

Heron

Heron is real-time processing engine used/incubated by Twitter. Currently Heron is going through the transition of moving into the Apache software foundation (learn more about this in the interview). Heron is at the heart of real-time analytics by processing data before the time value expires.

Pulsar

Pulsar is an Apache incubated project for distributed publishing and subscribing messaging real-time architectures. The origin of Pulsar is similar to that of many opensource big data projects in that it was used first by Yahoo.

BookKeeper

BookKeeper the scalabale, fault-tolerant, and low-latency storage service used in many development stacks. BookKeeper is under the Apache Software foundation and popular in many opensource streaming architectures.

Interview Questions

Have we as a community accepted Hadoop related tools to be virtualized or containerized?

How do Data Engineers get started with Streamlio?

What are the biggest real-time Analytics use cases?

Is the Internet Of Things (IoT) the primary driver behind the explosion in Streaming Analytics?

What skills should new Data Engineer focus on to be amazing Data Engineers?

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound Data

Bound data is finite and unchanging data, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).

Traditionally we have analyzed data as Bound data sets looking back into the past. Using historic data sets to look for patterns or correlation that can be studied to improve future results. The timeline on these future results were measured in months or years.

For example, testing a marketing campaign for the Telsa Model S would take place over a quarter. At the end of the quarter sales and marketing metrics are measured deeming a success or failure for the campaign. Tweaks for the campaign are implemented for next quarter and the waiting cycle continues. Why not tweak and measure the campaign from the first onset?

Our architectures and systems were built to handle data in this fashion because we didn’t have the ability to analyze data in real-time. Now with the lower cost for CPU and explosion in Open Source Software for analyzing data, future results can be measured in days, hours, minutes, and seconds.

Unbound Data

Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generated on a Web Scale Enterprise Network is Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc. Trying to analyze all this data as Bound data is asking for pain and failure (trust me I’ve been down this road).

Our world is built on processing unbound data. Think of ourselves as machines and our brains as the processing engine. Yesterday I was walking across a parking lot with my 5 year old daughter. How much Unbound data (stimuli) did I process and analyze?

Watching for cars in the parking lot and calculating where and when to walk

Ensuring I was holding my daughter’s hand and that she was still in step with me

Knowing the location of my car and path to get to car

Puddles, pot holes, and pedestrians to navigate

Did all this data (stimuli) come in concise and finite fashion for me to analyze? Of course not!

All the data points were unpredictable and infinite. At any time during our walk to the car more stimuli could be introduced(cars, weather, people, etc). In the real world all our data is Unbound and has always been.

How to Manage Bound vs. Unbound Data

What does this mean? It means we need better systems and architectures for analyzing Unbound data, but we also need to support those Bound data sets in the same system. Our systems, architectures, and software has been built to run bound data sets. Since the 1970’s where relations database were built to hold data collected. The problem is in the next 2-4 years we are going to have 20 – 30 billion connected devices. All sending data that we as consumers will demand instant feedback on!

On the processing side the community has shifted to true streaming analytics projects with Apache Flink, Apache Beam and Spark Streaming to name a few. Flink is a project showing strong promise of consolidating our Lambda Architecture into a Kappa Architecture. By switching to a Kappa Architecture developers/administrators can support on code base for both streaming and batch workloads. Not only does this help with the technical debt of managing two system, but eliminates the need for multiple writes for data blocks.

Scale-out architectures have provided us the ability to quickly expand our demand. Scale-out is not just Hadoop clusters that allow for Web Scale, but the ability to scale compute intense workloads vs. storage intense. Most Hadoop cluster are extremely CPU top heavy because each time storage is needed CPU is added as well.

Will your architecture support 10 TBs more? How about 4 PBs? Get ready for explosion in Unbound data….

Should I Use Kappa Architecture For Real-Time Analytics?

Analytics architectures are challenging to design. If you follow the latest trends in Big Data, you’ll see a lot different architecture patterns to chose from.

Architects have a fear of choosing the wrong pattern. It’s what keeps them up at night.

What architecture should be used for designing a real-time analytics application. Should I use the Kappa Architecture for real-time analytics? Watch this video and find out!

Video

Transcript

Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of Big Data, Big Questions. Today’s question is all about the Kappa architecture and real-time analytics. So, our question today came in from a user, and it’s going to be about how we can tackle the Kappa architecture, and is it a good fit for those real-time analytics, for sensor networks, how it all kind of works together. Find out more, right after this.

So, today’s question came in from Francisco. And it’s Francisco from Chile, and he says, “Best regards from Chile.” So, Francisco, thanks for your question, and thanks for watching. So, his question is, “Hi, I’m building a system for processing sensor network data in near real-time. All this time, I’ve been studying the Lambda architecture in order to achieve this. But now, I’ve ran into the Kappa architecture, and I’m having trouble deciding between which one.” He says, what he wants to do is, he wants to analyze this near real-time data in real-time. So, as the data is coming from the sensors, he wants to obtain knowledge, and then push those out in some kind of a UI. So, some kind of charts and graphs, and he’s saying, do we have any suggestions about why we would choose one of these architectures that we would recommend for him? Well, thanks again, Francisco, for your question. And so, yes, I have some thoughts about how we should set up that network. So, but let’s review, real quick, about what we’ve talked about in previous videos of the Lambda architecture, and what the Kappa architecture is, and then how we’re going to implement those.

So, if you remember, the Lambda architecture – we have two different streams. And so, we have a batch-level stream and we have a real-time. So, as your data comes in, it might come in through something that’s a queueing system, called Kafka, or it’s in a area right there, where we’re just using it to queue all the data as it comes in. And so, for that real-time, you will follow that real-time stream, and so, you might use Spark, or Flink, or some kind of real-time processing engine that’s going to do the analytics, and push that out to some of your dashboards for data that’s just as it’s coming in, right? So, as soon as that data comes in, you want to analyze it as quick as you can – it’s what we call near real-time, right? But, you also have your batch layer. So, for your batch processing, for your storing of the data, right? Because, at some point, your queueing system, whether it’s Kafka or something, it’s going to get very, very large, and some of that data’s going to be old, and you don’t need to have it in a area where you can stream it out and analyze it all the time. So, you want it to be able to tier, or you want to move that data off to, maybe, HTFS, or S3 Object. And so, from there, you can use your distributed search, you can have it in HTFS, use Cassandra, or some other kind of… maybe it’s HBase, or some kind of NoSQL database that’s working on top of Hadoop. And then, you also can run your batch jobs there. So, you can run your MapReduce jobs there, whether it’s traditional MapReduce, or whether it’s Spark’s batch-level processing. But, you have two layers.

And so, that’s one of the challenges with the Lambda architectures – you have these two different layers, right? So, you’re supporting two levels of code, and for a lot of your processing, a lot of your data that’s coming in, maybe you’re just using the real-time there, but maybe the batch processing is used every month. But, you’re still having to support those two different levels of code. And so, that’s why we talk about the Kappa architecture, right? So, the Kappa architecture, it simplifies it. So, as your data comes in, you want to have your data that’s in one queueing system, or one storage device – where your data comes in, you can do your analytics on it, so you can do your real-time processing, and push that data our to your dashboards, your web applications, or however you’re trying to consume that data. Then, also do your distributed search, as well. So, you can… If you’re using ElasticSearch, or some other kind of distributed search, maybe it’s Solr, or some of the other ones, you can be able to analyze that data and have it supporting that real-time search, as well. But, you might use Spark, and Flink, for your real-time analytics, but you also wanted to do your batch, too. So, you’re going to have some batch processing that’s going to be done. But, instead of creating a whole ‘nother tier, you want to be able to do that within that queueing system that you have. And so, whether you’re using Kafka, or whether you’re using Pravega, which is a new open-source product that just was released by Dell, you want to be able to have all that data in one spot, so that when you’re queueing that data, you know that it’s going to be there. But, you can also do your analytics on it. So, you can use it for your distributed search, you can use it for those streaming analytics jobs, but also, whenever you go back to do some of your batch, or some of your transitional processing, you know that it’s in that same location, too. That way, there’s not as much redundancy, right? So, you’re not having to store data in multiple locations, and it’s taking up more room than you really need.

And so, this is what we call the Kappa architecture, and this is why it’s so popular right now is, it simplifies that workstream. And so, when we start deciding between those two, back to Francisco’s question – Francisco, your application – it seems like it has a real need for real-time, right? So, there’s a lot of things that are going on there from the network, and a lot of traffic that’s coming in. And so, this is going to be where we break down a couple of different concepts. And so, we talked about bound and unbound. So, a bound dataset is data that we know how much data is going to come in, right? Or, we wait to do, and do the processing on that data, after it’s already came in. And so, when you think of bound data, think of sales orders, think of interview numbers. And so, we know that’s largely what we would consider transactional data, so we know all the data as it’s coming in, and then we’re running the calculation then. But, what your data is, is unbound. And so, when we talk about unbound data is, you don’t know how much data is coming in, right? And it’s infinitive. It’s not going to end. So, with network traffic, you don’t know how long that’s going to be going on. So, the network traffic’s going to continue to come in, you don’t know… You might get one terabyte at one point, you might get ten terabytes, you might scale up all in one second. And then, as the data comes in, it might come in uneven, right? So, you might have some that’s timestamped a little bit earlier than other data that’s coming in, too. And so, that’s what we call unbound data.

And so, for unbound data, the Kappa architecture works really well. It also works really well for bound data, too. So, when we start to look at that, and looking at your project, my recommendation is to use the Kappa architecture – go ahead and use it because you’re using real-time data. But then, for those batch levels, and I’m sure that you’ll start having some processing and some pieces that you start doing that are batch – you can also consume that in the Kappa architecture, as well. And so, there are some things you can look into, so, you can choose streaming analytics, with Spark streaming, you can look at Flink, Beam – those are some of the applications you can use. But, you can also use distributed search, so you can use Solr, you can use ElasticSearch – all those are going to work well, whether you choose the Kappa architecture, or whether you choose the Lambda architecture. My recommendation is, go with the Kappa architecture.

Well, thanks guys, that’s another episode of Big Data, Big Questions. Make sure you subscribe, so that you never miss an episode. If you have any questions, have your question answered on Big Data, Big Questions – just go to the website, put your comments below, reach out to me on Twitter, however you want. Submit those questions, and have me answer those questions here on Big Data, Big Questions. Thanks again, guys.

I honestly think developing real-time analytics is one of the hardest feats for developers to take on!

I’ll admit I’m for sure biased, but that doesn’t make me wrong.

My first project in the Hadoop eco-system was a real-time application when the Hadoop community still didn’t have real-time processing. I’ve always been honest in my posts and always will. So let me not sugarcoat this…the project sucked and was deemed a failure!

My team didn’t understand the requirements for real-time and couldn’t meet the requirements. The project was over budget and delayed. However, all was not lost, years have past and I learned a lot and the Hadoop community now has a new frameworks to speed up processing in real-time. Before developing your real-time analytics project please read these top 3 recommendations for real-time analytics! You will thank me….

What is Real-Time Analytics

Real-time analytics is the ability to analyze data as soon as it is created. Not only as soon as it’s created, but before all the data is uncovered. Traditional batch architectures have all the data in place before processing, but real-time processing is done as the data is created.

To be picky there is really no such thing as real-time analytics! What we have right now is near real-analytics which to humans is at millisecond speed or just faster than our competitors. For true real-time we would have to analyze the data at the same time it occurs and right now there is always some latency from networking, processing, etc. Let’s table this discussion until quantum computing becomes mainstream……

Take for example a GPS enabled application for brick and mortar stores. Using the phone as a sensor the application will know the customers proximity to the store. When the customer is close the store a offer is sent via the phone. Sounds simple, but imagine millions of sensors entering data into the system and trying to analyze the data location information. Add to this example for knowing store locations, store hours, local events, inventory levels, etc. Now many things could go wrong here. For example, the application could send offer for a product not in stock, send offer too late once the customer is out of range, send offer to a store that is closed.

Still think building those real-time applications is easy? Let’s look into the future…

Tsunami of Real-Time Data

How much data are we talking about for the future of real-time analytics? Gartner predicts that by 2020 worldwide we will have 20.4 billion devices connect. The predcition roughly estimates the world population of 7 billion people with an averages 3 devices per person. Sounds like a lot, however, I think it’s a conservative prediction. How many devices do you have connect in your home? I have 25 in my home and I’m not considered on the bleeding edge. I’ve talked with quite a few people who have as many as 75 plus. So let’s say 1/4 of the population has 15 devices by 2020 that will total closer to 28 plus billion devices.

Recommendations for Real-Time Analytics

Since we know why streaming processing and real-time analytics are growing at a perplexing pace, lets’ discuss recommendations for building those real-time applications.

1 – Timing is Everything

Know the time to value for the insights of the data. All data has different values assigned to it and that value degrades over time. Picture our previous example for a retailer using location services to send offers via a mobile application. How valuable is potential customer’s location? It’s really valuable, but only if the application can process the data quickly enough to send an incentive while the customer is near their physical location. Otherwise, the application is providing historical information.

After understanding the time value of the data you can find the correct framework (Flink, Spark, Storm, etc) to process the data. Most of the streaming data we need to be processed real-time for specific insights. Example pulling that user location data and time. Remember not all data is processed the same way batch vs. streaming.

2 – Make Sure Applications will Scale

Make sure your real-time application can scale. Not just scale with large influxes of data, but independently with processing and storage. In the future of IoT and Streaming data sources will be extremely unpredictable. One day you might ingest 2 TB of new data and the next 2 PB. If you think I’m joking checkout my talk from the DataWorks Summit of the Future Architecture of Streaming Analytics. Build application on the foundation of architectures, services, and components that can scale. Remember our friend Murphy and his law about how things can go wrong.

Scaling isn’t all focused on just being able to ingest more data, but scaling independently with compute and capacity. Make sure your real-time application supports a data lake strategy. Isilon’s Data Lake Platform give the ability to separate compute and capacity when growing your Hadoop clusters. So when a new set of data comes in that is 10 TB of data that isn’t really growing and probably will only run weekly or monthly you can scale your capacity without having to add unneeded capacity. Also, a data lake strategy gives you the ability to opt out of the 3x replication with 200% utilization vs. 80% utilization on Isilon. Whether you use Isilon or not make sure you have a data lake strategy that builds on the architecture of independent scaling!!

3 – Life Cycle Cost of Data

Since we know the value of the data decrease over time we need to assign a cost for that data. I know you probably just rolled your eyes when I mentioned the cost of data, but it’s important to understand that data is a product. Just like Amazon sells books for different prices, they also assign cost data’s value varies over time

As big data developers we want to hold on to data forever and bring in as many news sources as possible. However, when our manager or CFO gets the bill for all the capacity you need you will be sitting endless meetings and writing up justification reports for about why you are holding all this data. This means less time doing what we love, coding in our Hadoop Cluster. Know the value of your data and plan accordingly!!

Wrap Up of Real-time Analytics

Finishing up our discussion, remember that real-time analytics is processing of data as soon as the data is generated. By analyzing the data as it’s generated decision can be made quicker which helps create better applications for our users. When building real-time applications make sure you follow my 3 recommendations by understanding the time value of the data, building on systems that scale independently, and assigning value to the data. Successfully building real-time applications depends on these 3 core points.

Learning how to develop streaming architectures can be tricky and difficult. In Big Data the Kappa Architecture has become the powerful streaming architecture because of the growing need to analyze streaming data. For the past few years the Lambda architecture has been king but in past year the Big Data community has seen a transformation to the Kappa Architecture.

What is the Kappa Architecture? How can you implement the Kappa Architecture in your environment? Watch this video and find out!

Transcript

(forgive any errors the video was transcribed by a machine..)

Hi folks Thomas Henson here with thomashenson.com and this is another episode of big data big questions and so today what we’re going to do is we’re going to tackle the Kappa architecture and explain how we can use that in Big Data and why it’s so popular right now find out more right after this.

[Music]

So in a previous episode we talked about the lambda architecture and how the land architecture is kind of the standard that we’ve seen in big data before we had spark and streaming and Flink and you know all those processing engines that work with a you know in big data to do streaming and so you can find that video right here Oh check it out we’re in the same shirt pretty cool so after you watch that video now we need to talk about the capital architecture and the reason we’re going to talk about the Kappa is because it’s based and it’s more kind of morphed actually from what the lambda architecture is and so when we talk about the Lambda architecture we talked how we had a to dualistic you know framework so we have your speed layer and your batch or MapReduce later but more of a transactional and right so you have two layers you’re still moving your data in HDFS you’re still point your data into Q well the capital architecture what we’re trying to do there and where the industry is going is not to have to support two different frameworks right so I mean anytime you’re supporting two but two versions of code or two different layers of code it just it’s more complicated you know you mean more developers and is just more risk right you know you look at the 80/20 rule you’re always going have you know probably 20% of you know 20% of bugs cause 80% of your problems so you know why have to manage two different layers and so what we’re starting to see is we’re starting to move all our data into one system where we can interact with it through our API and you know pull out you know whether you know whether we’re running a you know flute job or whether we’re running some kind of distributed search maybe using solar or ElasticSeach but we want to collapse all that down into one different framework and so okay that that sounds pretty simple but it’s not really implemented like we think and so one of the big tips and one thing I want you to pay attention to is when you’re talking about the capital architecture you’re saying okay I’m going have all this let I’m going have this one layer here that’s going to interact and I want to run all my jobs you know whether I’m running through spark around through Flink that’s how we’re going to process this data what you want to make sure is we want to make sure that you’re not just using Kafka or some kind of message queue and you know you’re pulling your job you’re still doing you know you’re still pulling this your API’s and still running your streaming jobs from there but you may still be you know taking that data and moving it into HDFS and still running some processing here and so really what we want to see with the Kappa architecture is we want to see where we’re taking our data and you know whatever our queuing system is you can check out per Vega I oh and there’s some information there about that architecture layer and what you’ll see is you want that data to be able to you so your source data comes in you want your data to exist and it’s a kind of queuing system but then you also want that to auto to your app but you don’t want your API’s where you’re writing directly to HDFS because then you’re just writing to two different systems as well so you want something to abstract away all that storage so whether your data comes in and it’s more archival and it’s sitting in HDFS or sitting in some kind of object-based storage or it’s the streaming you know it’s the streaming applications and you’re trying to pull that data off as fast as you can you only want to interact with that one system and so that’s what we say when we talk about Kappa and that’s what Kappa really is intended to be so remember you want to abstract away that storage layer once your queuing system where you’re only dealing with API’s and so you want to be pulling your spark jobs your Flink jobs in your stripping research through one pipeline not through two different pipelines where you’re breaking up your speed layer and you’re breaking up you know maybe your batch of your transactional layer so that’s what the Kappa architecture is explained make sure you subscribe to this video so you never miss an episode you definitely want to keep up with what’s going on in Big Data any questions you have submit those big data big questions do in the comments below send me an email you know put it on the comment section or go to the Big Data big question section on my blog thanks again and I’ll see you next time.

What is Lambda Architecture?

Since the Spark, Storm, and other streaming processing engines entered the Hadoop ecosystem the Lambda Architecture has been the defacto architecture for Big Data with a real-time processing requirement. In this episode of Big Data Big Questions I’ll explain what the Lambda Architecture is and how developers and administrators can implement in their Big Data workflow.

Transcript

(forgive any errors text was transcribed by a machine)

Hi folks Thomas Henson here with thomashenson.com and today is another episode of big data big questions and so today’s question is what is the lambda architecture and how does that relate to our big data and Hadoop ecosystem? Find out right after this so when we talk about the lambda architecture and how that’s implemented into do we have to go back and look at Hadoop 1.0 and 2.0 when we really didn’t have a speed layer or Spark for streaming analytics and so back in the traditional days of Hadoop 1.0 and 2.0 we were using MapReduce for most of our processing and so the way that that would work is our data would come in we would pull our data into HDFS once our data was in HDFS we would run some kind of MapReduce job so you know we need to use pig or hive or to write our own custom job or some of the other frameworks that are in the ecosystem so that was all you know mostly transactional right so all our data had to be in HDFS so we had to have a complete view of our data to be able to process it later on we started looking at it and seeing that hey we need to be able to pull data in and do it in when data is not really complete right so unless transactional when we maybe have incomplete parts of the data or the data is continuing to be updated and so that’s where spark and Flink and some of the other streaming analytics and streaming processing engines came in is that we wanted to be able to process that data as it came came in and do it a lot faster too and so we took out the need really to even put it into HDFS for when we first we’re starting to process it because that takes time to write so we wanted to be able to move our data and process it before it even hit you know our HDFS and our disconnect that whole system but we still needed to be able to process that for batch processing right so some analytics some data that we’re going to pull we want to do that in real time right but then there’s other insights like maybe some monthly reports quarterly reports that are just better for transactional right and even when we start to talk about you know how we run a process and hold on to historical data and kind of use as a traditional enterprise data warehouse but in a larger you know more Hadoop platform basis like hi presto and some of the other SQL engines that are working on top of us do and so the need came where we were having these two different you know two different systems and how we were going to process data so we started adapting the lambda architecture so both the land architecture was was as your data come in it would sit and maybe a queue so maybe you can have it sitting in Kafka or just some kind of message queue any data that needed to be pulled out and processed streaming we would take and we will process that and what would call our speed layer so we have our speed layer maybe using smart or flee to pull out some insights and push those right out to our dashboards for our data that was going to exist for battleship for the you know transactional processing and just hold them for historical data we would have our MapReduce layer so we’re all a batch and so if you think about two different prongs so you have your speed layer coming in here pulling out your insights but your data as it sits in the cube goes into HDFS and still there to you know run hide will top up or hold on for historical data or maybe to still run some MapReduce jobs and pull up to a dashboard and so what we would have is you have two pronged approach there with your speed layer being your speed letter being on top and then your bachelor being on the bottom and then so as that dated would come in you still have your data in HDFS but you’re still be able to pull your data from you know your real time processing as the data is coming in and so that’s what we started talking about when we were saying lambda architecture is just a two layer system to be able to do our MapReduce and our best job and then also a speed layer to do our streaming analytics you know whether it be through spark flee or attaching beam and some of the other pieces so it’s a really good process to know it’s and you know it’s something that’s been in the industry for quite a long time so if you’re new to the Hadoop environment definitely want to know and be able to reference it back to but there are some other architecture that we’ll talk about in some future episodes so make sure you subscribe so that you never miss an episode so go right now and subscribe so that the next time that we talk about an architecture that you don’t miss it and I’ll check back with you next time thanks folks

All Things Data

Just coming off an amazing week with a ton of information in the Hadoop Ecosystem. It’s been a 2 years since I’ve been to this conference. Somethings have changed like the name from Hadoop Summit to DataWorks Summit. Other things stayed the same like breaking news and extremely great content.

I’ll try to sum up my thoughts from the sessions I attended and people I talked with.

First there was an insanely great session called The Future Architecture of Streaming Analytics put on by a very handsome Hadoop Guru, Thomas Henson. It was a well received session where I talked about how to architect streaming application for the next 2-5 years where we will see some 20 billion plus connected devices worldwide.

Hortonwork & IBM Partnership

Next there was breaking news with Hortonworks and IBM partnerships. The huge part of the partnership was that IBM’s BigInsights will merge with Hortonworks Data Platform. Both IBM and Hortonworks are part of the open data platform .

What does this mean to the Big Data community? Well more consolidation of Hadoop distros packages, but more collaboration into the big data frameworks. This is good for the community because it allows us to focus on the open-source frameworks inside the big data community. Now instead of having to work though the difference of BigInsights vs. HDP, development will be poured into Spark, Ambari, HDFS, etc.

Hadoop 3.0 Community Updates

New updates coming the with the next release of Hadoop 3.0 was great! There is a significant amount of changes coming with the release which is slated for GA August 15, 2017. The big focus is going to be with the introduction of Erasure Coding for data striping, supporting containers for YARN, and some minor changes. Look for an in-depth look at Hadoop 3.0 in a follow up post.

Hive LLAP

If you haven’t looked deeply at Hive in the last year or so….you’ve really missed out. Hive is really starting to mature to a EDW on Hadoop!! I’m not sure how many different breakout sessions there were on Hive LLAP but I know it was mentioned in most I attended.

The first Hive breakout session was hosted by Hortonworks Co-founder Alan Gates. He walked through the latest updates and future roadmap for Hive. Also the audience was posed a question: What do we except in a Data Warehouse?

Governance

High Performance

Management & Monitoring

Security

Replication & DR

Storage Capacity

Support for BI

We walked through where the Hive community was in addressing these requirements. Hive LLAP was certainly there on the higher performance. More on that now….

Another breakout session focused on a shoot off for the Hadoop SQLs. Wow this session was full and very interesting. Here is the list of SQL engines tested in the shoot out:

MapReduce

Presto

Spark SQL

Hive LLAP

All the test were run using the Hive Benchmark Testing on the same hardware. Hive LLAP was the clear winner with MapReduce the huge loser (no surprise here). The Spark SQL performed really well but there were issues using the thrift server which might have skewed the results. Kerberos was not implemented on the testing as well.

Pig Latin Updates

Of course there were sessions on Pig Latin! Yahoo presented their results on converting all Pig jobs from MapReduce to Tez jobs. After seeing the keynote about Yahoo’s conversation rate from MapReduce jobs to Tez/Spark/etc jobs shows that Yahoo is still running a ton of Pig jobs. Moving to Tez has increased the speed and efficiency of the Pig jobs at Yahoo. Also in the next few months Pig on Spark should be released.

Closing Thoughts

After missing last year at the Hadoop Summit or DataWorks Summit it was fun to be back. DataWorks Summit is still the premier events for Hadoop developer/admins to come and learn new features developed by the community. For sure this year the theme seemed to be benchmark testing, mix between Streaming Analytics, and Big Data EDW. It’s definitely an event I will try to make again next year to keep up with the Hadoop community.

Next week I will be heading to the DataWorks Summit in San Jose (formerly Hadoop Summit). The DataWorks summit is one of the top conferences for the Hadoop Ecosystem. Last year was first DataWorks Summit I’ve missed the past 3 years, but this year I’m back. I’m happy to announce this year I have a breakout sessions.

My session will focus on the Future Architectures of Streaming Analytics. I will cover how these architectures will support the future of Streaming Analytics. In the past few years the Hadoop community has focused on the processing of data from streaming data sources with Storm, Spark, Flink, Beam and other projects. Now as we enter in era of massive streams of data it’s time to focus on how we store and scale theses systems. Gartner predicts that by 2020 we will reach 20.4 billion connected devices. Now more than ever we are going to need systems with auto-scaling and unlimited retention. Projects like Pravega emerging to abstract away the storage layer in massive data analytics architectures. Stop by my session to learn about Pravega and architecture recommendations for Streaming Analytics.

Information on my session

The proliferation of connected devices and sensors is leading the Digital Transformation. By 2020 there will be over 20 billion connected devices. Data from these devices need to be ingested at extreme speeds in order to be analyzed before the data decays. The life cycle of the data is critical in revealing what insight can be revealed and how quickly they can be acted upon.

In this session we will look at the past, present and future architecture trends of streaming analytics. Next we will look at how to turn all the data from these devices into actionable insights. We will also dive into recommendations for streaming architecture depending on the data streams and time factor of the data. Finally, we will discuss how to manage all the sensor data, understand the life cycle cost of the data, and how to scale capacity and capability easily with a modern infrastructure strategy.