Tuesday, March 25, 2014

I am really enjoying Vagrant. It's one of those tools that are indispensable. However, today I wanted to install a CentOS VM in my application and I didn't remember the version name that I was using in my other VMs. To find out, the only thing that you have to do is to check a previous VM. Here's an example:
vim ~/vagrant_boxes/kafka/Vagrantfile
You will be able to see the version inside the file:

Friday, March 21, 2014

Marc Andreessen noticed that software is eating the world. I see the same thing with Big Data. Big Data is shaping the world around us. It has been used on presidential elections, weather reports, consumer analysis/sentiment, fraud check, etc. Strata conference is the epicenter of new technologies, use cases, and new innovations related to Big Data. I've been meaning to go there for quite some time. Previously, I purchased the videos from O'Reilly because I couldn't make it. Thanks to my current company, 3C (they're pretty awesome), I was able to go along with five of my coworkers. It's the place where you can meet the experts, the main committers, and ask them questions. If your eyes get dilated when you talk of Hadoop, or you get exited when you need to solve a problem that has to do with a huge amount of data including the famous "three V's" (volume, velocity, and variety), then this conference is for you. This is a quick summary of my experience of the conference.

The conference revolved around four clusters:

How quickly can you get the data into your system (ingest)

How fast can you show the results

It's all about presentation (charts)

Big Data doesn't mean Hadoop

How Quickly Can You Get Data

The presentation that left me mesmerized was Spark! I can't wait to use it. It is a very compelling product and it's now backed up by Cloudera. With Spark you can do the following:

Get a compute engine for Hadoop data - no need to reinvent the wheel

Speed up! A 100% faster MapReduce engine

Sophisticated: it runs all the sophisticated algorithms. Get access to a library of sophisticated algorithms

A a big community behind it; the most popular Big Data open source (followed by Hadoop)

Learning from the big guys - Yahoo!, Conviva, and Cloudera are using it

Not to mention that it comes integrated with a analytic suite (Shark), a large-scale graph processing (Bagel), and real-time analysis (Spark Streaming). This is nice because rather than doing Hive, Hadoop, and Mahout, and Storm, I only have to learn one programming paradigm.

How Fast Can You Show The Results

Twitter explains how they monitor millions (+5,700 tweets per second) of Time Series. The presentation was superb. I found out that the stack that they're using, named "Observability", is composed on: Finnagle, Cassandra, and query language and execution engines based on Scala. Although is a work in progress the stack is about three years old. I hope that they open-sourced it stack so I can get more context on how they monitor a large distributed system.

Another very interesting product was Google's Big Query. This was one of those presentations in which we (my team and I) stumbled upon by accident. The presentation showed how to use Google's toolkit: Freebase, Maps, and BigQuery to do analytics.

It's All About Context, Results, or Charts

Another company that impressed me was Trifacta. With their tool you can clean data, see the model (graph) and recursively do it again in case you see patterns or not. The tool is targeted to data scientists, data wranglers, and data analysts. It's a great tool to mine data data, but most important, you can clean the data and show the results with relative ease.

IPython: This rekindled my interest in Python. IPythons notebooks are great for data scientists. You can get code, text, and graphics all in one page, so it's the perfect tool to show quick results. It's not that Python wasn't a popular language for data scientists. NumPy library provides a solid MATLAB-like matrix data structure, with efficient matrix and vector operations. It also provides other great APIs like SciPy and Pandas.

Big Data != Hadoop

Two topics that opened my eyes were Mesos and YARN. Mesos, what Twitter uses to manage its clusters, is similar to YARN (Yet Another Resource Negotiator). The Hadoop 2.0, or YARN, it's becoming more of an environment and operating system; not just a MapReduce. With YARN, the JobTracker is gone. The ResourceManager is what does the job of the JobTracker. The ResourceManager (RM) is a scheduler - it allocates resources based on a pluggable scheduling algorithm. RM manages and monitors all the applications, so it strictly limits to arbitrating available resources.

One of our favorite (me and two of my buddies), was Netflix Data Platform by Kurt Brown. A different and a great presentation. Rather than going on the technology side, they explained how the culture is intertwined with their technology stack or decisions. For example, they talked about the reason for using "the cloud". Obvious reasons like: it's cheaper, much flexible (growth, a better place to do tests/spikes), and having multi data center is definitely a plus. Also, Amazon and RackSpace have great services such as SQL, EMR, and S3. But the main reason is "focus". They are focused on getting movies and increasing their audience rather than to focus on the "plumbing". They expressed their commitment to "open-source software" (OSS). They mentioned the great talent that they can get and how they can "manage their own destiny" by following these principles and using these tools.

Netflix explained their philosophy and how it's the "soul" of their decision (technical and business). For example, they keep keyboards, mice, and other peripherals in vending machines (they are free), so that everyone knows to "act in Netflix best interest". Furthermore, every decision or project needs to answer a basic question: "what value are you adding?". They apply the rule "accept that things will break". Because of this, they build safety nets around their systems. Again, it was a very nice and interesting presentation.

I really enjoyed the conference. I also just purchased the videos. Which I highly recommend!! During the next few months, I'm going to try to learn some of these tools and present them at the Miami JVM Meetup. Hopefully I can get to see you there, or better yet, hope to see you at Strata 2015. If you're going to either one of these events, let's meet up and share a beer...or two and discuss Big Data. I promise that my eyes will get dilated.

Thursday, March 6, 2014

I was looking forward to this book because of the title. I was under the impression that I was going to find concrete examples on how Big Data has affected and disrupted some industries. Best of all, I thought that I was going to read what industries will be impacted and how. The book showed some examples at the end, but in my opinion, it leaves something very important: speed and sophistication.

I just came back from Strata 2014, which is why I was looking forward to this book, and when I heard Matei Zaharia's keynote, it was all I needed to know about the current disruption of big data. Nowadays, big data storage is becoming commoditized, so the best value added is speed (how quick you can get the answer of your problem) and sophistication (run the best algorithms on the data). The book doesn't mentioned this but it might be because of its age - things are moving super quick on Big Data.

Some of the things that the book does well:

Introduces some history about the Big Data problem

How it affected some of the silos technologies like RDBS

How they solve the scalability issue

If you are a manager or someone that has no understanding of the world of Big Data, then I would recommended. However, if you are a developer, data scientist, or data wrangler, then this book will be too basic. The one thing that I highly recommend, if you are interested in this subject, is to attend (or at least purchase the videos) of Strata.