A Short List of Accessible Big Data Training Options

As you’ve read on this site and many others, the database world is well into a transition from a relational focus to a focus on non-relational tools. While the relational approach underpins most organizations’ data management cycles, I’d venture to say that all have a big chunk of big data, NoSQL, unstructured data, and more in their five-year plans, and that chunk is what’s getting most of the executive “mind share”, to use the vernacular.

Some are well along the way in their big data learning adventure, but others haven’t started yet. One thing about this IT revolution is that there’s no shortage of highly accessible training options. But several people have complained to me about the sheer quantity of options, not to mention the sheer number of new words the novice needs to learn in order to figure out what the heck big data is.

So here’s a very short list of training options accessible to the IT professional who is a rank big data beginner, starting with a very brief classification of the tools that I hope provides a some context. (Two quick caveats: First, I am currently working through some of the training options; all come highly recommended by colleagues who have. Second, there’s of course much more to learn, this is just a list to get started).

0. Context

Here’s a very high-level breakdown of what we mean by “big data”. For the beginner, the non-relational data management world breaks down into four learning categories:

The Basics: Most non-relational db tools operate on one Unix variant or the other, often Linux, and involve use of Java for app development.

The term Big Data often refers to the Apache Hadoop, the open source “framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models”. Although new tools bring some real-time capabilities, Hadoop is associated with analytical rather than transactional systems.

NoSQL, or “not only SQL”, database systems encompass Hadoop and over 150 other alternatives to the relational approach. In this category I’ll look at two systems that support transactional systems: Cassandra and MongoDB.

1. The Basics

To work with these tools you’ll need to be able to get around the Linux command line, and therefore be able to work with one of the Linux “shells”, which enable the command line interface. The Bourne Again Shell, bash, is the most commonly used command line interface for Linux. Here’s a nice tutorial from Machtelt Garrels.

I’m a Java beginner, and I’m getting a lot out of Graham Mitchell’s Learn Java the Hard Way. Although the experienced SQL developer might find the first few chapters tiresome, and it is only an intro, the author’s approach is efficient and accessible, so you can steam through the early sections pretty easily. They are available for free, then it is $34 for the full pdf, exercise files, and videos.

2. Hadoop

Perhaps the most widely used learning resource for Hadoop is Hadoop: the Definitive Guideby Tom White. The third edition is available for free pdf download here and the fourth soon to be released and available at Amazon, etc.

Both Hortonworks and Cloudera offer free tutorials. Each also offers downloadable virtual machines for hands on learning (Cloudera’s and Hortonworks‘). I haven’t used the Hortonworks trial VM but as a Mac user I had to purchase and install VMWare Fusion to run the Cloudera download ($50 upgrade, $70 new). Windows and Linux users can download a free VMWare player here.

3. NoSQL

Cassandra is an open source NoSQL platform that supports many widely used web-based services, most notably Netflix. Datastax is a company that provides product add-ons and services that supplement Cassandra’s stability and robustness for the enterprise. Datastax offers lots of free online training and makes available a downloadable community edition of Cassandra.

MongoDB is another production proven (see Metlife, Snagajob.com) open source database offering with great training options. You can download their production release or deploy to the cloud here, and access their training resources here. Most of their online courses run seven weeks with new lesson videos, quizzes and homework assignments released each week.

Big data is here right now, and while it won’t eliminate relational tools from the picture it will certainly capture most of the attention as it offers businesses new possibilities to capitalize on untapped data resources. Hopefully this short list helps you get started on your big data journey.