http://traffic.libsyn.com/sedaily/kafka_streams_edited.mp3Podcast: Play in new window | Download Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another? To quote today’s guest Jay Kreps “the gap we see Kafka Streams filling is less the analytics-focused domain these frameworks focus on and more building core applications

http://traffic.libsyn.com/sedaily/Google_Cloud_Edited.mp3Podcast: Play in new window | Download Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have iterated on them. Dataflow and Apache Beam are projects that present a unified batch and stream processing system. A previous episode with Frances

http://traffic.libsyn.com/sedaily/Apache_Beam__Edited.mp3Podcast: Play in new window | Download Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have seen all of our data. Some streaming systems give us the tools to deal partially with unbounded data streams, but we have to complement those streaming systems with batch processing, in a

http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering

http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio

http://traffic.libsyn.com/sedaily/Mic_Edited_2.mp3Podcast: Play in new window | Download “Millenials deeply care about software, in the sense where if something doesn’t work as it should, it’s forgotten immediately – if you build an app and there are bugs, you’re done.” Mic.com is a media company focused on news for millennials. Anthony Sessa is the VP of product at Mic.com, and he joins us to talk about the engineering of a modern news

http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,

http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where

From Nicolae Marasoiu’s answer via Quora: Kafka is a high performance messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes. Kafka can push astonishingly high volume through each node; often saturating disk, network, or both, while keeping a low cpu utilization. You would use Kafka in scenarios of asynchronous communication and processing pipelines, predominantly in distributed systems, cloud & big data,

“The business of technology and the technology of technology are kind of converging if you ask me. And there is definitely a space for some publications that don’t have decades of technical debt in the software space.”

“You’ve got software engineers who are interested in machine learning, and think what they need to do is just bring in another module and then that will solve their problem. It’s particularly important for those people to understand that this is a different type of beast.”

“The more you’re comfortable with this idea that everything is going to fail, the more you realize that it’s a natural process of distributed systems, and it helps you write and architect better code.”

“Every vendor will advertise that their system is better – that’s nice, I understand you need to sell your thing, but what am I gaining as a user and what am I sacrificing as a user by choosing your product?”

http://traffic.libsyn.com/sedaily/Mesos_Edited_2.mp3Podcast: Play in new window | Download Apache Mesos is an open-source cluster manager that enables resource sharing in a fine-grained manner, improving cluster utilization. Michael Hausenblas is a developer and cloud advocate with Mesosphere, which builds the Datacenter Operating System (DCOS), a distributed OS that uses Apache Mesos as its kernel. Questions Can you give the historical context for cluster computing? How are the distributed systems needs of different