Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.

Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell

Holden Karau

Holden Karau is transgender Canadian, and anactive open source contributor. When not in San Francisco working as asoftware development engineer at IBM's Spark Technology Center, Holdentalks internationally on Spark and holds office hours at coffee shops athome and abroad. She makes frequent contributions to Spark, specializing inPySpark and Machine Learning. Prior to IBM she worked on a variety ofdistributed, search, and classification problems at Alpine, Databricks,Google, Foursquare, and Amazon. She graduated from the University ofWaterloo with a Bachelor of Mathematics in Computer Science. Outside ofsoftware she enjoys playing with fire, welding, scooters, poutine, anddancing.

Andy Konwinski

Most recently, Andy Konwinski co-founded Databricks. Before that he was a PhD student and then postdoc in the AMPLab at UC Berkeley, focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project. He also worked with systems engineers and researchers at Google on the design of Omega, their next generation cluster scheduling system. More recently, he developed and led the AMP Camp Big Data Bootcamps and first Spark Summit, and has been contributing to the Spark project.

Patrick Wendell

Patrick Wendell is an engineer at Databricks as well as a Spark Committer and PMC member. In the Spark project, Patrick has acted as release manager for several Spark releases, including Spark 1.0. Patrick also maintains several subsystems of Spark's core engine. Before helping start Databricks, Patrick obtained an M.S. in Computer Science at UC Berkeley. His research focused on low latency scheduling for large scale analytics workloads. He holds a B.S.E in Computer Science from Princeton University

Matei Zaharia

Matei Zaharia is the creator of Apache Spark and CTO at Databricks. He holds a PhD from UC Berkeley, where he started Spark as a research project. He now serves as its Vice President at Apache. Apart from Spark, he has made research and open source contributions to other projects in the cluster computing area, including Apache Hadoop (where he is a committer) and Apache Mesos (which he also helped start at Berkeley).

The animal on the cover of Learning Spark is a small-spotted catshark (Scyliorhinus canicula), one of the most abundant elasmobranchs in the Northeast Atlantic and Mediterranean Sea. It is a small, slender shark with a blunt head, elongated eyes, and a rounded snout. The dorsal surface is grayish-brown and patterned with many small dark and sometimes lighter spots. The texture of the skin is rough, similar to the coarseness of sandpaper.This small shark feeds on marine invertebrates including mollusks, crustaceans, cephalopods, and polychaete worms. It also feeds on small bony fish, and occasionally larger fish. It is an oviparous species that deposits egg-cases in shallow coastal waters, protected by a horny capsule with long tendrils.The small-spotted catshark is of only moderate commercial fisheries importance, however it is utilized in public aquarium display tanks. Though commercial landings are made and large individuals are retained for human consumption, the species is often discarded and studies show that post-discard survival rates are high.Many of the animals on O'Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to http://animals.oreilly.com.The cover image is from Wood's Animate Creation. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag's Ubuntu Mono.

Over all very nicely written , the examples are provided in 3 languages Scala,Java and Python. Covers a wide array of topics , perhaps they could have gone deep in some areas but still okay , gives you enough to get started.

If you are looking for real world examples the follow on book, Adv Spark is a good read .

This book also does not cover the architecture in great detail and perhaps they could have done a better Job organizing the Topics especially around the Physical Architecture a bit better . There is references to Executors and Block Manager in prior chapters before these services are introduced .

I find this book as being a very good point for start to learn Spark. The authors approach to make a structured presentation over the framework I find very useful in order to go into more domain speciffic, detailed examples.

Content in the book is obtainable from the Spark website itself. So far, no topic struck me as being novel or dealt with in a detailed way. What I would have loved to see being explained is the internals. For instance, I was keen on understanding how checkpointing worked, what is checkpointed, and at what frequency. Missing. I was then looking for a concrete example of how Kafka Direct API in Spark Streaming worked and specifically how to get the information on the current topic being consumed by each executor. Missing, in fact found a blog post on Cloudera more useful than the discussion in this book. Then, I was looking for HBase Connectivity examples, missing. If you want to use Spark in a practical enterprise setting with ecosystem integration etc, be prepared for a lot more time being spent on the Internet searching for things. If you want to understand the internals of Spark, again you will be doing more Google searches and forum perusals than finding yourself going back to this book. It merely serves the purpose of an introduction, which is a big letdown given the author list.

Over the last few years Big Data has gathered an incredible amount of momentum. All this fuzz and buzz resulted in top companies, as well as fearless start-ups, to invest hours and cash in data solutions, some of which have emerged, establishing new standards. Having the spotlight on often resulted in these projects turning into open source ones. Among these , Spark, a cluster computing framework, recently adopted by the Apache Foundation. Despite being a hot topic of this 2015, the literature dedicated to the subject is still very limited. Among the few titles available, Learning Spark provides the curious reader with a decent overview of the major features provided by the framework.

Written by a groups of enthusiasts and developers, including the original creator of the framework itself, Matei, Learning Spark targets data scientists and engineers. As expressly written on the back cover, this book is neither a reference nor a cookbook. Its goal is to presents a different, faster alternative to the Hadoop's Map/Reduce paradigm and to the elephant made in Apache itself.

The reader is given a quick overview of the capabilities of the framework, such as the built-in libraries, Spark SQL and the many different data sources it can interact with. While not all the main features are presented, those that are found within these almost three-hundreds pages come with plenty of well explained examples.

The examples are, on the other hand, one of the many perplexities raised by this text: each is presented in Python, Java and Scala. While it is great to see many different bindings in action, any average skilled Pythonist can easily understand what happens in Java . And vice versa. This is even more true in the case of Scala, another most wanted topic of the recent years, inevitably related to Java and its ecosystem.

Another thumb down for the complete absence of anything related to the Spark's internal architecture. The car looks nice, but what about the engine? How does it work? Magic? Witchery?

Again, the examples presented are clear and well explained, but there is no real world case shown. Spark is meant to get executed on huge clusters with scary amounts of data. True, this is a quick overview of the product, but "hello world" per se does not make me wanna learn more.

Overall, a good read for that early morning hour of commute. It helps the curious reader to pickup the basics of the framework. On the other hand, nothing of what is presented can't be found in the web pages of the Apache Software Foundation.

As usual, you can find more reviews on my personal blog: http://books.lostinmalloc.com Feel free to pass by and share your thoughts!

I decied to learn Spark from this book but after a while I realized that this book misses a real world comprehend example. Some use case, which can be started on the beginning with simple RDD transformations and continue to add more features like file operations and so on. The chapters are well organized but I missed python sample codes in some places, the samples were just a slices from a complete solution, which can be found on gitHub.

Good overview of spark. For the size of the book, it is difficult to stuff better content in it. I just expected more material about inner workings of spark. The Tuning and Debugging chapter is way too light. It's often difficult to debug what's going wrong in spark. Ok we can follow jobs, stages and tasks in the WebUI but it's often not enough.