What precision and recall are?

After the predictive model has been finished, the most important question is: How good is it? Does it predict well?

Evaluating the model is one of the most important tasks in the data science project, it indicates how good predictions are. Very often for classification problems we look at metrics called precision and recall, to define them in detail let’s quickly introduce confusion matrix first.

Confusion Matrix for binary classification is made of four simple ratios:

True Negative(TN): case was true negative and predicted negative

True Positive(TP): case was true positive and predicted positive

False Negative(FN): case was true positive but predicted negative

False Positive(FP): case was true negative but predicted positive

Understanding the confusion matrix, calculating precision and recall is easy.

Precision – is the ratio of correctly predicted positive observations to the total predicted positive observations, or what percent of positive predictions were correct?

Precision = TP/TP+FP

Recall – also called sensitivity, is the ratio of correctly predicted positive observations to all observations in actual class – yes, or what percent of the positive cases did you catch?

Recall = TP/TP+FN

There are also two more useful matrices coming from confusion matrix, Accuracy – correctly predicted observation to the total observations and F1 score the weighted average of Precision and Recall. Although intuitively it is not as easy to understand as accuracy, the F1 score is usually more useful than accuracy, especially if you have an uneven class distribution.

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively. You ll also read about rising data scientists such as Clare Corthell, who crafted her own open source data science masters program. This book is perfect for aspiring or current data scientists to learn from the best. It s a reference book packed full of strategies, suggestions and recipes to launch and grow your own data science career.

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You’ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you’ll learn:

Fundamental concepts and applications of machine learning
Advantages and shortcomings of widely used machine learning algorithms
How to represent data processed by machine learning, including which data aspects to focus on
Advanced methods for model evaluation and parameter tuning
The concept of pipelines for chaining models and encapsulating your workflow
Methods for working with text data, including text-specific processing techniques
Suggestions for improving your machine learning and data science skills

NEWS

Experts Predict When Artificial Intelligence Will Exceed Human Performance – Artificial intelligence is changing the world and doing it at breakneck speed. The promise is that intelligent machines will be able to do every task better and more cheaply than humans. Rightly or wrongly, one industry after another is falling under its spell, even though few have benefited significantly so far.

Applying deep learning to real-world problems – The rise of artificial intelligence in recent years is grounded in the success of deep learning. Three major drivers caused the breakthrough of (deep) neural networks: the availability of huge amounts of training data, powerful computational infrastructure, and advances in academia.

How AI Can Keep Accelerating After Moore’s Law – Google CEO Sundar Pichai was obviously excited when he spoke to developers about a blockbuster result from his machine-learning lab earlier this month. Researchers had figured out how to automate some of the work of crafting machine-learning software, something that could make it much easier to deploy the technology in new situations and industries.

Bayesian GAN – Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, we use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks.

The $1700 great Deep Learning box. – After years of using a thin client in the form of increasingly thinner MacBooks, I had gotten used to it. So when I got into Deep Learning (DL), I went straight for the brand new at the time Amazon P2 cloud servers. No upfront cost, the ability to train many models simultaneously and the general coolness of having a machine learning model out there slowly teaching itself.

Exploring LSTMs – The first time I learned about LSTMs, my eyes glazed over. Not in a good, jelly donut kind of way. It turns out LSTMs are a fairly simple extension to neural networks, and they’re behind a lot of the amazing achievements deep learning has made in the past few years. So I’ll try to present them as intuitively as possible – in such a way that you could have discovered them yourself.

An Algorithm Summarizes Lengthy Text Surprisingly Well – ho has time to read every article they see shared on Twitter or Facebook, or every document that’s relevant to their job? As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. And it may become routine to rely on a machine to analyze and paraphrase articles, research papers, and other text for you.

Divide and Conquer: How Microsoft researchers used AI to master Ms. Pac-Man – Microsoft researchers have created an artificial intelligence-based system that learned how to get the maximum score on the addictive 1980s video game Ms. Pac-Man, using a divide-and-conquer method that could have broad implications for teaching AI agents to do complex tasks that augment human capabilities.

Open Source Datasets – A large-scale, high-quality dataset of URL links to approximately 300,000 video clips that cover 400 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400 video clips. Each clip is human annotated with a single action class and lasts around 10s.

One Model To Learn Them All – Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains.

THE POWER OF GTC – GTC is the largest and most important event of the year for GPU developers. GTC and the global GTC event series offer valuable training and a showcase of the most vital work in the computing industry today – including artificial intelligence and deep learning, healthcare, virtual reality, accelerated analytics, and self-driving cars.

Artificial intelligence can now predict suicide with remarkable accuracy – When someone commits suicide, their family and friends can be left with the heartbreaking and answerless question of what they could have done differently. Colin Walsh, data scientist at Vanderbilt University Medical Center, hopes his work in predicting suicide risk will give people the opportunity to ask “what can I do?” while there’s still a chance to intervene.

A developer’s guide to the Internet of Things (IoT) – The Internet of Things (IoT) is an area of rapid growth and opportunity. Technical innovations in networks, sensors and applications, coupled with the advent of ‘smart machines’ have resulted in a huge diversity of devices generating all kinds of structured and unstructured data that needs to be processed somewhere.

Neural Networks for Machine Learning – Learn about artificial neural networks and how they’re being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We’ll emphasize both the basic algorithms and the practical tricks needed to get them to work well.

If you have found above useful, please don’t forget to share with others on social media.

How would you validate-test a predictive model?

Why evaluate/test model at all?

Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test the tuned model against a fresh cut of data.

Below we will look at few most common validation metrics used for predictive modeling. The choice of metrics influences how you weight the importance of different characteristics in the results and your ultimate choice of which machine learning algorithm to choose. Before we move on to a variety of metrics lets get basics right.

Golden rules for validating-testing a model.

Rule #1

Never use same data for training and testing!!!

Rule #2

Look at Rule #1

What it means is, always leave cut of data that are not included in training to test your model against after fitting/tuning is finished.

In some cases, it may seem like ‘losing’ part of the training set, especially when data sample is not large enough.There are ways around it as well, one of most popular is ‘Cross Validation’.

Cross Validation

It is simple ‘trick’ that splits data into n equal parts. Then successively hold out each part and fit the model using the rest. This gives n estimates of model performance that can be combined into an overall measure. Although very ‘heavy’ from computing point of view, very efficient and widely used method to avoid overfitting and improve ‘out of sample’ performance of the model.

Two main types of predictive models.

Talking about a predictive modeling, there are two main types of problems to be solved:

Regression problems are those where you are trying to predict or explain one thing (dependent variable) using other things (independent variables) with continuous output eg exact price of a stock next day.

Classification problems try to determine group membership by deriving probabilities eg. will the stock price go up/down or will not change next day. Algorithms like SVM and KNN create a class output. Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.

In regression problems, we do not have such inconsistencies in output. The output is always continuous in nature and requires no further treatment.

Techniques/metrics for model validation.

Having the dataset divided, and model fitted there is a question, what kind of quantifiable validation metrics to use. There are few very basic quick and dirty methods to check performance. One of them is value range – if model outputs are far outside of the response variable range, that would immediately indicate poor estimation or model inaccuracy.

Most often there is a need to use something more sophisticated and ‘scientific’.

Accuracy

Accuracy is a classification metric, it the number of correct predictions made as a ratio of all predictions. Probably it is the most common evaluation metric for classification problems.

Below is an example of calculating classification accuracy using Scikit-Learn.

Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate.

F1 score

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.

Confusion Matrix

It is a matrix of dimension N x N, where N is the number of classes being predicted. The confusion matrix is a presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis.

There are four possible options:

True positives (TP), which are the instances that are positives and are classified as positives.

False positives (FP), which are the instances that are negatives and are classified as positives.

False negatives (FN), which are the instances that are positives and are classified as negatives.

True negatives (TN), which are the instances that are negatives and are classified as negatives.

Below is an example code to compute confusion matrix, using ScikitLearn.

Gain and Lift Chart.

Many times the measure of the overall effectiveness of the model is not enough. It may be important to know if the model does increasingly better with more data. Is there any marginal improvement in the model’s predictive ability if for example, we consider 70% of the data versus only 50%?

The lift charts represent the actual lift for each percentage of the population, which is defined as the ratio between the percentage of positive instances found by using the model and without using it.

These types of charts are common in business analytics of Direct Marketing where the problem is to identify if a particular prospect was worth calling.

Kolmogorov-Smirnov Chart.

This non-parametric statistical test is used to compare two distributions, to assess how close they are to each other. In this context, one of the distributions is the theoretical distribution that the observations are supposed to follow (usually a continuous distribution with one or two parameters, such as Gaussian), while the other distribution is the actual, empirical, parameter-free, discrete distribution computed on the observations.

KS is maximum difference between % cumulative Goods and Bads distribution across score/probability bands. The gains table typically has % cumulative Goods (or Event) and % Cumulative Bads (Or Non-event) across 10/20 score bands. Using gains table, we can find the KS for the model which has been used for creating gains table.

KS is point estimate, meaning it is only one value and indicate the score/probability band where separate between Goods (or Event) and Bads (or Non-event) is maximum.

Area Under the ROC curve (AUC – ROC)

The ROC curve is almost independent of the response rate. The Receiver Operating Characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier. The curve is created by plotting the true positive rate (TP) against the false positive rate (FP) at various threshold settings. This is one of the popular metrics used in the industry. The biggest advantage of using ROC curve is that it is independent of the change in

The biggest advantage of ROC curve is that it is independent of the change in the proportion of responders.

An area under ROC Curve (or AUC) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

Logarithmic Loss

Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier.

In order to calculate Log Loss, the classifier must assign a probability to each class rather than simply give the most likely class. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss.

Mean Absolute Error

The mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes.

The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. The measure gives an idea of the magnitude of the error, but no idea of the direction.

Mean Squared Error

One of the most common measures used to quantify the performance of the model. It is an average of the squares of the difference between the actual observations and those predicted. The squaring of the errors tends to heavily weight statistical outliers, affecting the accuracy of the results.

Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error is unbiased and follow a normal distribution.

RMSE metric is given by:

where N is Total Number of Observations.

R^2 Metric

The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively. You ll also read about rising data scientists such as Clare Corthell, who crafted her own open source data science masters program. This book is perfect for aspiring or current data scientists to learn from the best. It s a reference book packed full of strategies, suggestions and recipes to launch and grow your own data science career.

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You’ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you’ll learn:

Fundamental concepts and applications of machine learning
Advantages and shortcomings of widely used machine learning algorithms
How to represent data processed by machine learning, including which data aspects to focus on
Advanced methods for model evaluation and parameter tuning
The concept of pipelines for chaining models and encapsulating your workflow
Methods for working with text data, including text-specific processing techniques
Suggestions for improving your machine learning and data science skills

“TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) passed between them.”

To quote the TensorFlow website, TensorFlow is an “open source software library for numerical computation using data flow graphs”. Name TensorFlow derives from the operations which neural networks perform on multidimensional data arrays, often referred to as ‘tensors’. It is using data flow graphs and is capable of building and training variety of different machine learning algorithms including deep neural networks, at the same time, it is general enough to be applicable in a wide variety of other domains as well. Flexible architecture allows deploying computation to one or more CPUs or GPU in a desktop, server, or mobile device with a single API.

TensorFlow is Google Brain’s second generation machine learning system, released as open source software in 2015. TensorFlow is available on 64-bit Linux, macOS, and mobile computing platforms including Android and iOS. TensorFlow provides a Python API, as well as C++, Haskell, Java and Go APIs. Google’s machine learning framework became lately ‘hottest’ in data science world, it is particularly useful for building deep learning systems for predictive models involving natural language processing, audio, and images.

What is ‘Graph’ or ‘Data Flow Graph’? What is TensorFlow Session?

Trying to define what TensorFlow is, it is hard to avoid using word ‘graph’, or ‘data flow graph’, so what is that? The shortest definition would be, TensorFlow Graph is a description of computations. Deep learning (neural networks with many layers) uses mostly very simple mathematical operations – just many of them, on high dimensional data structures(tensors). Neural networks can have thousands or even millions of weights. Computing them, by interpreting every step (Python) would take forever.

That’s why we create a graph made up of defined tensors and mathematical operations and even initial values for variables. Only after we’ve created this ‘recipe’ we can pass it to what TensorFlow calls a session. To compute anything, a graph must be launched in a Session. The session runs the graph using very efficient and optimized code. Not only that, but many of the operations, such as matrix multiplication, are ones that can be parallelised by supported GPU (Graphics Processing Unit) and the session will do that for you. Also, TensorFlow is built to be able to distribute the processing across multiple machines and/or GPUs.

TensorFlow programs are usually divided into a construction phase, that assembles a graph, and an execution phase that uses a session to execute operations in the graph. To do machine learning in TensorFlow, you want to create tensors, adding operations (that output other tensors), and then executing the computation (running the computational graph). In particular, it’s important to realize that when you add an operation on tensors, it doesn’t execute immediately. TensorFlow waits for you to define all the operations you want to perform and then optimizes the computation graph, ‘deciding’ how to execute the computation, before generating the data. Because of this, tensors in TensorFlow are not so much holding the data as a placeholder for holding the data, waiting for the data to arrive when a computation is executed.

Prerequisites:

NEURAL NETWORKS – basics.

Before we move on to create our first model in TensorFlow, we’ll need to get the basics right, talk a bit about the structure of a simple neural network.

A simple neural network has some input units where the input goes. It also has hidden units, so-called because from a user’s perspective they’re hidden. And there are output units, from which we get the results. Off to the side are also bias units, which are there to help control the values emitted from the hidden and output units. Connecting all of these units are a bunch of weights, which are just numbers, each of which is associated with two units. The way we train neural network is to assign values to all those weights. That’s what training a neural network does, find suitable values for those weights. One step in “running” the neural network is to multiply the value of each weight by the value of its input unit, and then to store the result in the associated unit.

There is plenty of resources available online to get more background on the neural networks architectures, few examples below:

Let’s do it… TensorFlow first example code.

To keep things simple let’s start with ‘Halo World’ example.

importing TensorFlow

import tensorflow as tf

Declaring constants/variables, TensorFlow constants can be declared using the tf.constant function, and variables with the tf.Variable function. The first element in both is the value to be assigned the constant/variable when it is initialised. TensorFlow will infer the type of the constant/variable initialised value, but it can also be set explicitly using the optional dtype argument. It’s important to note that, as the Python code runs through these commands, the variables haven’t actually been declared as they would have been if you just had a standard Python declaration.

x = tf.constant(2.0)
y = tf.Variable(3.0)

Lets make our code compute something, simple multiplication.

z = y * x

Now comes the time when we would like to see the outcome, except nothing, has been computed yet… welcome to the TensorFlow. To make use of TensorFlow variables and perform calculations, Session must be created and all variables must be initialized. We can do it using the following statements.

Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, efficient tools to implement programs capable of learning from data. This practical book shows you how.

By using concrete examples, minimal theory, and two production-ready Python frameworks—scikit-learn and TensorFlow—author Aurélien Géron helps you gain an intuitive understanding of the concepts and tools for building intelligent systems. You’ll learn a range of techniques, starting with simple linear regression and progressing to deep neural networks. With exercises in each chapter to help you apply what you’ve learned, all you need is programming experience to get started.

This guide starts with the fundamentals of the TensorFlow library which includes variables, matrices, and various data sources. Moving ahead, you will get hands-on experience with Linear Regression techniques with TensorFlow. The next chapters cover important high-level concepts such as neural networks, CNN, RNN, and NLP.

Once you are familiar and comfortable with the TensorFlow ecosystem, the last chapter will show you how to take it to production.

What you will learn
Become familiar with the basics of the TensorFlow machine learning library
Get to know Linear Regression techniques with TensorFlow
Learn SVMs with hands-on recipes
Implement neural networks and improve predictions
Apply NLP and sentiment analysis to your data
Master CNN and RNN through practical recipes
Take TensorFlow into production

TensorFlow is currently the leading open-source software for deep learning, used by a rapidly growing number of practitioners working on computer vision, Natural Language Processing (NLP), speech recognition, and general predictive analytics. This book is an end-to-end guide to TensorFlow designed for data scientists, engineers, students and researchers.

With this book you will learn how to:

Get up and running with TensorFlow, rapidly and painlessly
Build and train popular deep learning models for computer vision and NLP
Apply your advanced understanding of the TensorFlow framework to build and adapt models for your specific needs
Train models at scale, and deploy TensorFlow in a production setting

TensorFlow, a popular library for machine learning, embraces the innovation and community-engagement of open source, but has the support, guidance, and stability of a large corporation. Because of its multitude of strengths, TensorFlow is appropriate for individuals and businesses ranging from startups to companies as large as, well, Google. TensorFlow is currently being used for natural language processing, artificial intelligence, computer vision, and predictive analytics. TensorFlow, open sourced to the public by Google in November 2015, was made to be flexible, efficient, extensible, and portable. Computers of any shape and size can run it, from smartphones all the way up to huge computing clusters. This book is for anyone who knows a little machine learning (or not) and who has heard about TensorFlow, but found the documentation too daunting to approach. It introduces the TensorFlow framework and the underlying machine learning concepts that are important to harness machine intelligence. After reading this book, you should have a deep understanding of the core TensorFlow API.

Being able to make near-real-time decisions is becoming increasingly crucial. To succeed, we need machine learning systems that can turn massive amounts of data into valuable insights. But when you're just starting out in the data science field, how do you get started creating machine learning applications? The answer is TensorFlow, a new open source machine learning library from Google. The TensorFlow library can take your high level designs and turn them into the low level mathematical operations required by machine learning algorithms.

Machine Learning with TensorFlow teaches readers about machine learning algorithms and how to implement solutions with TensorFlow. It starts with an overview of machine learning concepts and moves on to the essentials needed to begin using TensorFlow. Each chapter zooms into a prominent example of machine learning. Readers can cover them all to master the basics or skip around to cater to their needs. By the end of this book, readers will be able to solve classification, clustering, regression, and prediction problems in the real world.

AlphaGo’s next move – Chinese Go Grandmaster and world number one Ke Jie departed from his typical style of play and opened with a “3:3 point” strategy – a highly unusual approach aimed at quickly claiming corner territory at the start of the game.

Integrate Your Amazon Lex Bot with Any Messaging Service – Is your Amazon Lex chatbot ready to talk to the world? When it is, chances are that you’ll want it to be able to interact with as many users as possible. Amazon Lex offers built-in integration with Facebook, Slack and Twilio. But what if you want to connect to a messaging service that isn’t supported? Well, there’s an API for that–the Amazon Lex API.

How Our Company Learned to Make Better Predictions About Everything – In Silicon Valley, everyone makes bets. Founders bet years of their lives on finding product-market fit, investors bet billions on the future value of ambitious startups, and executives bet that their strategies will increase a company’s prospects. Here, predicting the future is not a theoretical superpower, it’s part of the job.

Are Pop Lyrics Getting More Repetitive? – In 1977, the great computer scientist Donald Knuth published a paper called The Complexity of Songs, which is basically one long joke about the repetitive lyrics of newfangled music (example quote: “the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced”).

Home advantages and wanderlust – When Burnley got beat 3-1 by Everton at Goodison Park on the 15th April, 33 games into their Premier League season, they’d gained only 4 points out of a possible 51 in their away fixtures. But during this time they’d also managed to accrue 32 points out of a possible 48 at Turf Moor; if the league table were based upon only home fixtures, they’d be in a highly impressive 6th place.

The Simple, Economic Value of Artificial Intelligence – How does this framing now apply to our emerging AI revolution?After decades of promise and hype, AI seems to have finally arrived, – driven by the explosive growth of big data,inexpensive computing power and storage, and advanced algorithms like machine learning that enable us to analyze and extract insights from all that data.

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively. You ll also read about rising data scientists such as Clare Corthell, who crafted her own open source data science masters program. This book is perfect for aspiring or current data scientists to learn from the best. It s a reference book packed full of strategies, suggestions and recipes to launch and grow your own data science career.

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You’ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you’ll learn:

Fundamental concepts and applications of machine learning
Advantages and shortcomings of widely used machine learning algorithms
How to represent data processed by machine learning, including which data aspects to focus on
Advanced methods for model evaluation and parameter tuning
The concept of pipelines for chaining models and encapsulating your workflow
Methods for working with text data, including text-specific processing techniques
Suggestions for improving your machine learning and data science skills

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively. You ll also read about rising data scientists such as Clare Corthell, who crafted her own open source data science masters program. This book is perfect for aspiring or current data scientists to learn from the best. It s a reference book packed full of strategies, suggestions and recipes to launch and grow your own data science career.

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You’ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you’ll learn:

Fundamental concepts and applications of machine learning
Advantages and shortcomings of widely used machine learning algorithms
How to represent data processed by machine learning, including which data aspects to focus on
Advanced methods for model evaluation and parameter tuning
The concept of pipelines for chaining models and encapsulating your workflow
Methods for working with text data, including text-specific processing techniques
Suggestions for improving your machine learning and data science skills

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. As its architectural center, YARN enhances a Hadoop compute cluster in the following ways: Multitenancy, Cluster utilization, Scalability and Compatibility. Multi-tenant data processing improves an enterprises’ return on Hadoop investments. YARNs dynamic allocation of cluster resources improves utilization over more static MapReduce rules. YARN’s resource manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes. Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruptions to the processes that already work.

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether designing a new Hadoop application or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
Factors to consider when using Hadoop to store and model data
Best practices for moving data in and out of the system
Data processing frameworks, including MapReduce, Spark, and Hive
Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
Giraph, GraphX, and other tools for large graph processing on Hadoop
Using workflow orchestration and scheduling tools such as Apache Oozie
Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
Architecture examples for clickstream analysis, fraud detection, and data warehousing

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)
Run distributed computations with MapReduce
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
Discover common pitfalls and advanced features for writing real-world MapReduce programs
Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
Load data from relational databases into HDFS, using Sqoop
Perform large-scale data processing with the Pig query language
Analyze datasets with Hive, Hadoop’s data warehousing system
Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.

Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.

This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or uncompress data, modify data by adding or removing pieces of information, and more. Flume allows you a number of different configurations and topologies, allowing you to choose the right setup for your application. Flume is a distributed system which runs across multiple machines. It can collect large volumes of data from many applications and systems. It includes mechanisms for load balancing and failover, and it can be extended and customized in many ways. Flume is a scalable, reliable, configurable and extensible system for management the movement of large volumes of data.

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether designing a new Hadoop application or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
Factors to consider when using Hadoop to store and model data
Best practices for moving data in and out of the system
Data processing frameworks, including MapReduce, Spark, and Hive
Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
Giraph, GraphX, and other tools for large graph processing on Hadoop
Using workflow orchestration and scheduling tools such as Apache Oozie
Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
Architecture examples for clickstream analysis, fraud detection, and data warehousing

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)
Run distributed computations with MapReduce
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
Discover common pitfalls and advanced features for writing real-world MapReduce programs
Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
Load data from relational databases into HDFS, using Sqoop
Perform large-scale data processing with the Pig query language
Analyze datasets with Hive, Hadoop’s data warehousing system
Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.

Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.

This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. The design is heavily influenced by transaction logs. Apache Kafka was originally developed by LinkedIn and was subsequently open sourced in early 2011. Graduation from the Apache Incubator occurred on 23 October 2012. Due to its widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from brokers, consumer, and producers, in addition to monitoring ZooKeeper, which is used by Kafka for coordination among consumers. There are currently several monitoring platforms to track Kafka performance, both open-source, like LinkedIn’s Burrow, as well as paid, like Datadog. In addition to these platforms, collecting Kafka data can also be performed using tools commonly bundled with Java, including JConsole.

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether designing a new Hadoop application or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
Factors to consider when using Hadoop to store and model data
Best practices for moving data in and out of the system
Data processing frameworks, including MapReduce, Spark, and Hive
Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
Giraph, GraphX, and other tools for large graph processing on Hadoop
Using workflow orchestration and scheduling tools such as Apache Oozie
Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
Architecture examples for clickstream analysis, fraud detection, and data warehousing

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)
Run distributed computations with MapReduce
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
Discover common pitfalls and advanced features for writing real-world MapReduce programs
Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
Load data from relational databases into HDFS, using Sqoop
Perform large-scale data processing with the Pig query language
Analyze datasets with Hive, Hadoop’s data warehousing system
Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.

Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.

This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.