I recently came across a paper by Lisandro Kaunitz, Shenjun Zhong and Javier Kreiner titled Beating the bookies with their own numbers – and how the online sports betting market is rigged. The paper describes a method the authors used to make profits on football games. It also exposed the practices bookmakers employ to restrict successful betting strategies. There are some interesting techniques used in the paper and some problems with the strategy that I have identified. We will dig into both.

The first finding is that 1 divided by the average closing odds for a football game drawn from a collection (no less than 3) of bookmakers is a good proxy for the true probability of the games outcome. This is important. In order to profit from betting we need to find a signal that gives us confidence our prediction (i.e. our ability to predict the outcome of the game) is better than where is the number of possible outcomes. In this case is 3 since a game of football has 3 possible outcomes. They are home team wins, away team wins or game ends in a draw.

To confirm is a good proxy for game outcomes the authors performed an analysis against 10 years of football games where each games result and the bookmakers closing odds were recorded. The analysis begins with expressing the average closing odds for each games outcome as a consensus probability between 0 and 1. This is calculated as follows:

They then took the consensus probabilities for each outcome and grouped them into bins of equal size 0.0125 between 0 and 1. For each bin they calculated the actual probability of each game outcome by taking the frequency of games that resulted in each possible outcome and divided this by all the games in the bin. They then compared the bins actual probability for each outcome to the consensus probability for each outcome. To avoid sparse results each bin had to have a minimum of 100 games.

The hope here is that the consensus probability for each bin and game outcome would agree with the actual probability of each bins game outcomes. That is indeed what they found. There is a strong linear relationship between the consensus probability and actual probability of game outcomes. Here is a reproduction of that linear relationship.

This chart is convincing. It suggests the consensus probability is a good predictor of the actual probability of game outcomes. However, we don’t take it on face value. Lets put it to the test and see how we do predicting the outcome of each game when there is a clear winner. A clear winner is a game where one outcome has a higher consensus probability than the other outcomes.

I ran a simulation over the historic data and found out of 479,388 games 478,387 had a clear winner. From those games I was able to predict the outcome 52% of the time. Since there are 3 outcomes in a football game the probability of predicting the winner with no information is 1/3 or 0.33. Our accuracy is 57% better than a random guess which is good. But there is a catch. Betting accuracy does not result in betting profits. Profits depend on the available odds we can bet on and odds often return less than our initial bet size. In other words we risk more than we can make. Lets walk through an example to understand why this is a problem.

If the odds returned by the 52% of our winning bets were on average 1.8:1 then for each $100 bet we would win $80. For the other 48% we lose $100 per bet. If we bet on 100 games we make $4160 and lose $4800 leaving us with a total loss of $640.

So where do we go from here? The consensus probability is a good predictor of the games outcome. However, if we recall it is derived from the bookmakers odds . In other words the more we are sure of the outcome the less we can profit.

This leads us to the second finding. To increase our ability to profit we need to find favourable conditions where the maximum odds offered by a bookmaker underestimate the consensus probability. More specifically we want to find an outcome where the payoff which is a combination of the consensus probability and maximum odds is the greatest. The formula taken from the paper is as follows:

The authors suggest a margin term is used to control the minimum payoff. The revised formula is:

The margin controls the minimum spread that needs to be satisfied between the average odds and maximum odds. As the margin increases the available games we can bet on decreases as large spreads are outliers in the distribution of odds which by definition are rare events.

The next step is to choose an appropriate margin parameter. To do that the authors ran a series of simulations with margins between 0.01 and 0.1. After analysing the results they settled on 0.05. This produced the most profit with the largest number of games to bet on.

They back-tested the strategy by placing $50 bets on each game that delivered a payoff greater than 0. The back-test produced a 44.4% accuracy and yield an average return of 3.5% per bet! The strategy bet on 56,435 games over 10 years and returned an overall profit of $98,865. The results of the simulation have been reproduced below.

At first glance these results look compelling. However, there are some problems we need to explore.

The first problem I noticed is the accumulated losses at the beginning of the strategy. If we zoom into the first 250 games we see a streak of losses. Under the original betting conditions our bank balance would at one point have been negative $800. And that’s not the worst case scenario.

If we started betting 5000 games into the period we would have been down around $10,000 on $50 bets. That’s a very tough situation to be in. Would you have continued with the strategy? I’m confident I would have ceased betting.

The problem arises because the staking strategy (the amount we bet) assumes we can cover our losses. This is unrealistic and over the long run will lead to gamblers ruin! There are a number of staking strategies we can use to minimise this problem however it’s not going to help because there is a more serious problem we explore later.

The second problem is the strategy assumes we live in a perfect world. Do we really believe we can bet on 56,435 games over ten years and not miss any? That is on average ~16 bets a day and that assumes games are uniformly distributed which we know is not true. There will be peak periods of activity that could introduce uncertainty in our capacity to place bets. For example, holiday periods, weekends and when we are sleeping, just to name a few. In fact the authors experienced this problem first hand. When they started live betting the strategy they found 30% of the games identified by the strategy could not be fulfilled. So they went back and simulated this uncertainty by randomly dropping 30% of the favourable games. In the paper they stated the strategy was still profitable however did not provide results. So lets simulate what this uncertainty would do to the results. We generated 1000 sample runs placing $50 bets while randomly dropping 30% of the games the strategy identified.

The result is we now have 39,504 games to bet on, the average return is still 3.5% and our profit is on average $69,391. We shrunk our profit by 1/3 which is to be expected since we dropped 30% of the games the strategy identified we should bet on. That’s not the real problem here. The real problem is the wide band in the chart above. That represents how confident we should be in this result. The wider it is the less confident we should be. Remember, the world is not perfect. We cannot expect to close every available bet. Therefore we need to include uncertainty in the back-testing strategy. We do that by running many simulations (in this case 1000) that randomly drop 30% of the games which represent the games we should have bet on but couldn’t for reasons out of our control.

In this particular simulation the standard deviation of the final profit at the end of the period is $8,445. Since the final profit is close to normally distributed we can use the empirical rule to approximate the 95% confidence interval which is between $52,501 and $86,281. That means we can be 95% confident the final profit will be between these figures assuming all conditions stay the same.

Now lets put this into real life terms. Placing bets on 3,950 games every year may amount to a part time job for many people. If we could earn $50,000 per year working part time we would earn $500,000 after 10 years. To get to this figure using this strategy and be 95% confident in the result we need to increase the size of our bets to no less than $290 and up to $476 per game. The problem of course is larger bet sizes can lead to larger losses particularly in this data set where there are long losing streaks and the strategy needs time to recover. How bad could these losing streaks be to earn our part time wage with 95% confidence? We would need to continue betting after incurring a loss of $11,089 within the first 100 games. That is down over 1/5 of the wage we intend on making in the first year. Here is the simulation using $476 per bet.

The third problem which is the most serious is related to the very first finding in the paper. If you look closely at the first graph you might notice a discrepancy. The home and away win consensus probabilities range between 0 and 1. Meaning the distribution of average home win and away odds offered by bookmakers covers a broad range (remember the consensus probability is derived from the average odds). The draw consensus probabilities (red dots) tell a very different story. They don’t go past 0.4! We can see this more clearly with a CDF plot of consensus probabilities for each outcome.

Almost 100% of draw consensus probabilities are < 0.4 while home and away are distributed between 0 and 1. Skew between classes of data can be very problematic when they are not accounted for in models. In this particular case the skew has a significant impact on the strategies predictive power for games that end in draws. If we recall the criteria for betting is derived from the following formula:

The problem is the spread between average draw odds and maximum draw odds have to be significantly higher than the spread of home or away outcomes for draw odds to deliver the highest payoff for a game. This is because the consensus probabilities for draws are almost always lower than the consensus probabilities for home and away outcomes. An analysis of the data demonstrates this. Only 633 of a total 479,388 games have a higher consensus probability for draw outcomes which means the strategy will rarely pick draw as an outcome. Yet we know empirically and by looking at the data many games do end in draws. We can show this by breaking down the prediction accuracy of the simulation we’ve recreated. The strategy does a terrible job at predicting games that end in draws.

Home Win

Draw

Away Win

# of Games

216,878

119,793

142,717

% Predicted

78.47%

1.36%

43.28%

We can further demonstrate the impact this has to the overall strategy by running a simulation on the games that end in draws only.

Ouch! So what can we conclude from these findings? Can we leave our day job and become professional tipsters? I don’t think that would be wise.

While the consensus probability provides good signal for home and away outcomes, it fails for draw outcomes. Perhaps the distribution of draw odds is simply reflecting betting patterns. This implies people prefer to bet on home and away outcomes only which results in very little movement in the distribution of draw odds.

I think a more plausible theory is the imbalance in each outcomes distribution is manufactured intentionally by the bookmakers. They may choose to keep the movement in draw odds to a narrow range in order to limit the signal draw odds provide. How can they do this? By adjusting the home and away odds only as a means to balancing their overall risk. This would allow them to hide 1/3 of the signal that the act of betting generates. Either way, to improve on this strategy the imbalance in predictive power needs to be addressed to limit the risk of losing all of our capital in the event there is a streak of draws. This is important even if it means giving up some of our overall predictive power.

PS – I am not a gambler and this article does not constitute betting advice or condone gambling

]]>https://www.beebotech.com.au/2018/10/applying-analytics-to-bet-on-football-games/feed/0632How to run Google Cloud Datalab on your own Linux serverhttps://www.beebotech.com.au/2018/02/how-to-run-google-cloud-datalab-on-your-own-linux-server/
https://www.beebotech.com.au/2018/02/how-to-run-google-cloud-datalab-on-your-own-linux-server/#respondWed, 14 Feb 2018 08:59:08 +0000https://www.beebotech.com.au/?p=607If you are into data analysis then you are probably already using Jupyter Notebooks. Some time ago Google developed their own flavour of Jupyter notebooks and released it as Google Cloud Datalab. If you are interested in running this variant it is very easy to get up and running. You will need a Linux server with docker installed.

The first time you run this command docker will pull the Datalab image from Google, so it might take some time.

Datalab will listen on port 8080 for connections. We need to tell the docker host to expose this port so we can connect to Datalab from our browser. You can either expose 8080 directly or map it to a different port. Personally, I think its a good idea mapping each instance to a different port (e.g. 8080 + N) just incase you find yourself running multiple Datalab containers. Keep in mind we have told the docker host to expose the Datalab port to everyone on the network by specifying 0.0.0.0. Datalab does not come with built-in authentication or encryption so if your docker host is on an insecure network you may prefer to expose the Datalab port on 127.0.0.1 and then use an ssh tunnel to forward the exposed port <EXTERNAL_PORT> to the system running your desktop browser.

We also need to consider storage as we want to preserve data if the Datalab container is stopped. Datalab is configured to read and write notebooks to /content in the running container. We can tell docker to map the users home directory (i.e. $HOME) on the docker host to the running Datalab container’s /content path. This way when the container is stopped the notebooks will be preserved.

With the container up and running connect to the Datalab instance at http://<DOCKERHOST>:<EXTERNAL_PORT>/

You should see Google Cloud Datalab load in your browser.

At this point you can create a notebook and start analysing data. If you need to install or upgrade Python packages the easiest way is to invoke shell commands from within the notebook using the following syntax:

!<command> <arguments>

For example, to install Python Spark for Python 3 create a code cell in the notebook as below and run it:

!pip3 install --upgrade pyspark

Have fun!

]]>https://www.beebotech.com.au/2018/02/how-to-run-google-cloud-datalab-on-your-own-linux-server/feed/0607Want to learn Artificial Intelligence, Machine Learning and Deep Learning?https://www.beebotech.com.au/2018/01/want-to-learn-artificial-intelligence-machine-learning-and-deep-learning/
https://www.beebotech.com.au/2018/01/want-to-learn-artificial-intelligence-machine-learning-and-deep-learning/#respondFri, 12 Jan 2018 11:11:44 +0000https://www.beebotech.com.au/?p=561Over the last 18 months I have spent my nights studying Artificial Intelligence, Machine Learning and Deep Learning. I thought it might be useful to share the resources I’ve used to gain my education in these fields. Below is a list of resources categorised by Artificial Intelligence, Machine Learning and Deep Learning disciplines. Some other topics are covered which will be necessary for certain disciplines.

Udacity have developed an Artificial Intelligence Nanodegree that includes lessons by Peter Norvig amongst others. The Nanodegree uses exercises out of the Artificial Intelligence: A Modern Approach textbook. Between full-time employment, travel and kids I found this course quite challenging. In my experience it requires you to dedicate 15-20 hours per week and it has fixed deadlines. If you decide to enrol ensure you have sufficient time to spare.

MIT have an excellent course on Artificial Intelligence available on ITunes. This course covers the major search algorithms amongst other planning problems.

Machine Learning

Introduction to Machine Learning by Sebastian Thrun and Co available at Udacity. This is a great free course if you are starting out in Machine Learning. It is a good place to gain a solid foundation before moving onto Deep Learning.

Jason Brownlee is a fellow Australian with excellent educational resources on Machine Learning. I’ve purchased his book Master Machine Learning Algorithms. Jason walks you through the major Machine Learning concepts and algorithms using Excel spreadsheets (yes that is not a typo!). It sounds primitive but it is actually a great way to unpack a Machine Learning algorithm and see the path it takes as it iteratively tries to learn the objective.

Andrew Ng’s Stanford Machine Learning course is available on ITunes. If your mathematics is rusty you might find this one difficult to follow.

The free Deep Learning Book available online. If you want all the theory behind Deep Learning this is a good resource.

Andrew Ng’s most recent course on Deep Learning. You can take the course on Coursera or view the videos for free on youtube. If you are interested in the mathematics behind Deep Learning this course is ideal. I particularly like the way Andrew builds on Logistics Regression as the building block of Neural Networks and Deep Neural Networks.

Udacity have a Deep Learning Nanodegree which I completed in 2017. This has received a few bad reviews however my experience was pretty good. A piece of advice is don’t start with this course. Use the other free resources to get acquainted with Deep Learning before you pursue the Nanodegree.

Many of the courses in the list require Python programming language experience. If you can program but don’t know Python check out the free course Programming Foundations with Python at Udacity.

Coding, Libraries and Frameworks

The Anaconda distribution of Python makes it very easy to get up and running with a nice package management system called conda. This can be used to install Machine Learning and Deep Learning packages.

Learn about Jupyter Notebooks. They provide a way to perform iterative programming with visualisations and interactivity mixed in.

If you are doing Machine Learning in Python scikit-learn is the de-facto standard, although certain algorithms can be slow as they have not been parallelised.

If you need a high performance Machine Learning framework check out h2o.ai. There are bindings for multiple languages and it is very fast (i.e. developed with multi-processing in mind). It is also great for out-of-core processing (i.e. where the dataset you are learning from cannot fit into memory).

TensorFlow, Keras, PyTorch, MXNet are used for Deep Learning, although in some cases they can be used for certain Machine Learning algorithms as well. The key to these frameworks is support for GPU’s which speeds up certain operations. In some cases these frameworks can also utilise multiple GPU’s and even clusters of GPU powered servers.

XGBoost is the de-facto library for tree-boosting algorithms and has various language bindings including Python and R. XGBoost is very popular in Kaggle competitions.

I like to learn by example and there is no better place for Machine Learning and Deep Learning examples than the Kaggle platform. If you want to test your skills you can always enter a Kaggle competition.

As you practice these disciplines you will need data to analyse. There are a number of public data repositories for this purpose. Have a look at the datasets available on Kaggle, UCI Machine Learning repository and AWS Public Datasets. There are plenty more but these should get you going.

Community

I often attend the Melbourne Data Science meet-up. Look for a similar community in your area and get involved in a Datathon (equivalent to a Hackathon but focused on analysis of a dataset).

arXiv contains research papers. Alternatively there is arXiv Sanity Preserver an interface into arXiv that offers additional features like full-text search, similarity search and a recommendation system.

DeepMind is focused on AI research and owned by Google. They are best known for using Artificial Intelligence to teach a computer agent to play the game of Go. The artificial agent named AlphaGo was used to beat Lee Sedol the world Go champion in 2017. If you get a chance watch the movie. Their research is primarily focused on learning to learn through Reinforcement Learning.

ICML is the International Conference on Machine Learning. Beware it is primarily for researchers.

]]>https://www.beebotech.com.au/2018/01/want-to-learn-artificial-intelligence-machine-learning-and-deep-learning/feed/0561Data is the currency of the 21st centuryhttps://www.beebotech.com.au/2018/01/data-is-the-currency-of-the-21st-century/
https://www.beebotech.com.au/2018/01/data-is-the-currency-of-the-21st-century/#respondMon, 08 Jan 2018 09:50:25 +0000https://www.beebotech.com.au/?p=552Over the last 18 months my area of research has shifted from focusing on the infrastructure that supports data, to understanding data and the value one can derive from data. In my quest to unlock value from data I have researched many topics. From conventional business intelligence methods, to data visualisation, statistics, machine learning, artificial neural networks, probabilistic programming and cognitive computing. Suffice to say it is a lot to take on in a short amount of time (and my head still hurts). What I have come to appreciate is, that much of the data we now generate contains invisible value that needs to be explored and unlocked. Let me explain.

From the beginning of time, the human race has collected data to support decision making. Data from surveys, data from forms, data from experiments and tests, data from the optical lens, data from analogue to digital converters. This data has been used to understand the universe, to find meaning and support objectivity. However, the amount of data and the diversity of data we can generate is limited by the methods available to capture, store, process and analyse the data.

Thanks to the digital revolution, we now have a deluge of diverse data available. We generate inordinate amounts of data in the form of speech, images and text from social media platforms, mobile devices and sensors. From this data companies can infer what you like, what you feel, what you believe in, what you are doing, where you are going, who you interact with, who you admire and what you might do next.

The amount of data we generated in each decade is orders of magnitude higher than the previous decade. This data along with advances in computational processing capacity and artificial intelligence allow us to observe and recognise relationships in the universe that previously were invisible to any human being.

This revolution has led the major corporations into an artificial intelligence arms race. What may not be obvious to many is the main ingredient for this arms race is the data. Just like previous industrial revolutions which led to the mass accumulation of energy resources and minerals by a select group of families, the same can be said today with major organisations like Google, Facebook, Amazon, Baidu and Microsoft. These organisations had the foresight to recognise that data drives artificial intelligence and artificial intelligence creates an unreasonable advantage.

But what can AI possibly do with the data, you ask? With data streaming from sight (images), sound (microphone) and touch (screen) sensors one can apply AI to understand what you are doing, feeling, saying and seeing. AI can consider the connections and interactions between the data from multiple points of view (human senses) to build a greater understanding of the environment. Consider a little experiment. Next time you speak to someone, close your eyes and try to imagine how they are feeling. Are they happy, are they sad, are they calm, are they angry? The human brain can associate emotion from speech, the same way we associate objects through the visual cortex. Now, do the same but with your eyes open. Are you more confident in your answer?

With that said, let’s reiterate the potential of data. Data allows us to reason about the world around us. It allows us to draw conclusions about a thought, an instinct, a situation, or a question, with a degree of certainty. Data supports our objectives. Be it, make more money, solve a problem or prevent a situation. Data allows us to formulate a plan for the future or adjust our course. Data helps us prioritise, supports our curiosity, our desire to learn, advance and adapt to changing conditions. Data allows us to personalise interactions with our customers, our employees and with each other, in real-time, at immense scale.

I surmise that data rich organisations and governments will be able to do a lot of good with the data. They will be able to solve crimes, prevent the spread of diseases, anticipate social unrest and reduce our dependency on fossil fuels. They will also have the ability to predict economic conditions, domestic trends and worldwide events.

To remain relevant in this era, we all need to embrace and understand data. First and foremost, data needs to be considered in our products and interactions with customers, partners and each other. The right data needs to be generated, captured and stored where it can be explored, processed and analysed in order to infer meaning and support decisions. Once we are done with the data it needs to be preserved and protected like precious books that we refer to from time to time, to avoid mistakes in the past and strengthen our understanding of the present.

To summarise, there are a few thoughts I would like to leave you with. They are:

Never underestimate the reasoning power of data to support understanding

Consider whether you are capturing the right data and the data available at your disposal

Understanding what you might infer from the data will guide you in the types of data to capture

Consider whether you are using data to derive an edge

It doesn’t matter whether you are an individual, company, not for profit, government or educational institution. Those with the right data will have the main ingredient to advance. Those without data will fall further behind.

]]>https://www.beebotech.com.au/2018/01/data-is-the-currency-of-the-21st-century/feed/0552Data Protection for Hadoop Environmentshttps://www.beebotech.com.au/2015/01/data-protection-for-hadoop-environments/
https://www.beebotech.com.au/2015/01/data-protection-for-hadoop-environments/#commentsWed, 07 Jan 2015 06:43:32 +0000https://www.beebotech.com.au/?p=484One of the questions I often get asked is do we need data protection for Hadoop environments?

It is an unusual question because most of my customers don’t ask do we need data protection for Oracle, DB2, SAP, Teradata or SQL environments? I think it is safe to say the majority of these environments are always protected.

So I started thinking. What motivates people to question whether they need data protection for Hadoop?

I boiled it down to three themes:

Very few (if any) of the software backup vendors have solutions for Hadoop

Hadoop size and scalability is daunting when you consider most of us come from a background of providing data protection for monolithic environments that are in the 10’s of TB rather than the 100’s or 1000’s of TB for a single system

Hadoop has some inbuilt data protection properties

So if we play it back, we can’t turn to our traditional backup vendors as they don’t have integrated Hadoop solutions. We are inundated by the size and scale of the problem. And so we are left with the hope that Hadoop’s inbuilt data protection properties will be good enough.

Before we dive into answering is Hadoop’s inbuilt data protection properties good enough, I wanted to share some fundamental differences between Hadoop and traditional Enterprise systems. This will help us navigate how we should think about data protection in a Hadoop environment.

Distributed versus Monolithic

Hadoop is a distributed system that spans many servers. Not only is the computing distributed but for the standard Hadoop architecture, storage is distributed as well. Contrast this to a traditional Enterprise application that spans only a few servers and is usually reliant on centralised intelligent storage systems.

From a data protection perspective, which is going to be more challenging?

Protecting a centralised data repository that is addressable by a few servers is arguably simpler than supporting a large collection of servers each owning a portion of the data repository that is captive to the server.

When we are challenged to protect large volumes of data frequently we are forced to resort to infrastructure centric approaches to data protection. This is where we rely on the infrastructure to optimise the flow of data movement between the source of the data undergoing protection and the protection storage system. In many cases we have to resort to intelligent storage or in the case of virtualisation, the hypervisor to provide these optimisations. However, in a standard Hadoop architecture where physical servers with direct attached storage is used, neither of these approaches is available.

To comply with the standard Hadoop architecture, the protection methodology needs to be software-driven and ideally should exploit the distributed properties of the Hadoop architecture so that data protection processing can scale with the environment.

Mixed Workloads

Hadoop systems typically support different applications and business processes, many of which will be unrelated. Contrast this to a traditional Enterprise application consists of a collection of related systems working together to deliver an application (e.g. Customer Relationship Management) that supports one or more business processes. For these Enterprise systems we apply consistent data protection policies across the systems that support the application. We could go one step further and treat all application upstream and downstream interfaces and dependencies with the same levels of protection.

Hadoop environments can support many Enterprise applications and business processes. Each application will vary in importance much like production and non-production system classifications. The importance of Hadoop workloads is a characteristic we should exploit when devising a data protecting strategy. If we fall into the trap of treating everything equal, it gets us back to being overwhelmed by the size and scope of the problem ahead of us.

So, when thinking about data protection for Hadoop environments we should focus our efforts on the applications and corresponding data sets that matter. We may find if we do this the profile of the challenge ahead of us changes from being a 500TB protection problem to a 100TB protection problem using method A and 200TB protection problem using method B and 200TB that does not require additional levels of protection.

Hadoop Inbuilt Data Protection

There are two styles of data protection provided by Hadoop.

Onboard protection is built into the standard Hadoop File System (HDFS). The level of protection provided is confined to an individual Hadoop cluster. If the Hadoop cluster or file system suffers a major malfunction then these methods will likely be compromised.

Offboard protection is primarily concerned with creating copies of files from one Hadoop cluster to another Hadoop cluster or independent storage system.

Onboard Data Protection Features

In a typical Enterprise application we usually rely on the storage system to provide data redundancy both from an access layer (storage controllers) and persistence layer (storage media).

In Hadoop, redundancy at the persistence layer is provided by maintaining N-way local copies (3 by default) of each data block across data nodes. This is called the system-wide replication factor and is universally applied when files are created. The replication factor can also be defined explicitly when files are create and can be modified retrospectively. This flexibility enables redundancy to be aligned to the importance and value of data.

Redundancy at the access layer is provided by the NameNode. The NameNode tracks the location and availability of data blocks and instructs Hadoop clients which data nodes should be used when reading and writing data blocks.

Hadoop provides proactive protection. Many filesystems assume data stays correct after it is written. This is a naive assumption that has been researched extensively. HDFS protects against data corruption by verifying the integrity of data blocks and repairing them from replica copies. This process occurs as a background task periodically and when Hadoop clients read data from HDFS.

HDFS supports snapshots of the entire file system or directories. Nested snapshot directories are not allowed. HDFS Snapshots are read only logical copies of directories and files that preserve the state of persistent data. One unexpected behaviour of HDFS snapshots is they are not atomic. In our testing, HDFS snapshots only preserve consistency on file close. What this means if a snapshot is taken of an open file the resulting snapshot will represent the file data once the open file is closed. This behaviour is at odds with traditional storage and filesystem based snapshot implementations that preserve data at the time of the snapshot regardless of state.

To illustrate this point we created a snapshot (s1) of a closed file then we started appending to the same file and at the same time created another snapshot (s2). Then we compared the size of the original file with the two snapshot versions. The first snapshot did not grow and represented the file data before the append operation. The second snapshot grew after it was created and represented the file data after the append operation completed (i.e. on file close).

Notice the second snapshot (s2) continues to grow as the file append runs in the background. The first snapshot (s1) does not change.

This demonstrates HDFS snapshots provide consistency only on file close. This may well be how the designers intended them to work. Coming from a storage background it is important to understand this behaviour particularly if HDFS snapshots form part of your data protection strategy.

Hadoop supports the concept of a recycle bin that is implemented by the HDFS Trash feature. When files are deleted using the HDFS client they are moved to a user-level trash folder where they persist for a predefined amount of time. Once the time elapses the file is deleted from the HDFS namespace and the referenced blocks are reclaimed as part of the space reclamation process. Files that have not expired can be moved back into a directory and in this regard can be used to undo a delete operation.

Below is an example of a file that was deleted and subsequently moved to trash.

Trash is implemented in the HDFS client only. If we are interfacing with HDFS using a programmatic API the function is not implemented.

Use of trash cannot be enforced. A user can bypass trash by specifying the -skipTrash argument to HDFS client.

Offboard Data Protection Features

Hadoop supports copying files in and out of Hadoop clusters using Hadoop Distributed Copy (distcp). This technology is designed to copy large volumes of data either within the same Hadoop cluster, to another Hadoop cluster, Amazon S3 (or S3 compliant object store), Openstack Swift, FTP or NAS storage visible to all Hadoop data nodes. Support for Microsoft Azure has been included in the 2.7.0 release. Refer to HADOOP–9629.

A key benefit of distcp is it uses Hadoop’s parallel processing model (MapReduce) to carry out the work. This is important for large Hadoop clusters as a protection method that does not leverage the distributed nature of Hadoop is bound to hit scalability limits.

One downside of the current distcp implementation is that single file replication performance is bound to one map task and data node respectively.

Why is this a problem?

Each file requiring replication can only consume the networking resources available to the data node running the corresponding map task. For many small files this does not represent a problem, as many map tasks can be used to distribute the workload. However, for very large files which are common to Hadoop environments, performance of individual file copies will be limited by the bandwidth available to individual data nodes. For example, if data nodes are connected via 1 Gbe and the average file size in the cluster was 10TB, then the minimum amount of time required to copy one average file is 22 hours.

To address this situation distcp will need to evolve to support distributed block-level copy which to my knowledge is not currently available. Another option is to compress the files transparently prior to transmission however that is not available either (see HADOOP–8065).

Hadoop Distributed Copy to EMC ViPR (S3 interface)

We previously mentioned it is possible to copy files from Hadoop to S3 compliant object stores. This section will describe how to use distributed copy with EMC ViPR object store which is S3 compliant to the extent required.

S3 support is provided by the JetS3t API (s3 and s3n drivers) or the AWS S3 SDK (s3a driver). The JetS3t API allows the S3 endpoint to be configured to a target other than Amazon Web Services (AWS). To do this we need to set some properties that will get passed to the JetS3t API.

Create a file called jets3t.properties in your hadoop configuration directory. In my case it is /usr/local/hadoop/etc/hadoop.

Here we set the s3 endpoint to our ViPR data store (if you have multiple data stores point this at your load balancer) and set the ports for HTTP and HTTPS access that ViPR listens on. We disabled the use of DNS to resolve bucket names since we did not setup wildcard DNS for S3 bucket naming. We also disabled HTTPS as it requires real certificates rather than self signed to function.

Also in the same directory is the Hadoop core-site.xml. Set the following properties in this file:

Note1: keep this in mind when using hdfs dfs -ls against an S3 bucket
Note2: when copying directories from hdfs to S3 ensure there are no nested directories as we observed distcp will fail to copy the structure back into HDFS from the parent S3 folder

We can also protect the file copies on Data Domain using the retention lock feature. Retention lock is used to prevent files from being deleted until a certain date and time. This technique is commonly used to comply with data retention regulations but can also be used to provide enhanced levels of protection from human error and malware which for this use case is important because the protection copies are directly exposed to the system being protected.

Here is an example of using retention lock. Set the access time of the file to a date in the future.

Note: retention lock must be licensed and enabled on the Data Domain mtree that we are using to store protection copies.

Data Domain also supports a feature called fast copy. This enables the Data Domain file system namespace to be duplicated to different areas on the same Data Domain system. The duplication process is almost instantaneous and creates a copy of the pointers representing the directories and files we are interested in copying. The copy consumes no additional space as the data remains deduplicated against the original.

The fast copy technology can be used to support an incremental forever protection strategy for Hadoop data sets with versioning enabled. Let’s demonstrate how this works in practice.

Let’s say we want to create weekly versioned copies of the HDFS /files directory on a Data Domain system.

First we create a copy of the initial version using distcp to the Data Domain mtree that we have mounted under /dd.

The first version will be copied to a directory based on the current timestamp syntax YYYYMMDD. In this example we will be using the date command with backticks to evaluate the current system date in the command line.

Immediately after creating the copy on Data Domain we want to protect it further by taking a fast copy to an mtree that is not visible to Hadoop. To do this we need to have a second mtree setup on the Data Domain system (we created /data/col1/hadoop_fc ahead of time) that we fast copy into.

To run the fast copy command we need to have ssh keys setup between the controlling host and Data Domain. This was setup ahead of time for a user called hadoop. Keep in mind this entire process does not need to run from a Hadoop node. It should be run from an independent trusted secure host that has access to Hadoop and Data Domain.

Lets fast copy the current version from the hadoop mtree to the hadoop_fc mtree for added protection.

Now a week passes and we want to preserve todays version of the HDFS /files directory. To demonstrate differences between the two versions lets first delete a file from HDFS to show how this approach enables versioning.

Before we create our weekly protection copy we want to fast copy the previous weeks copy into todays. This way when we distcp todays version of HDFS /files it should recognise the 8 files already exist and then all it needs to do is delete the one file that no longer exists (the vmlinuz–3.10.0–123.el7.x86_64_1 file we deleted from HDFS).

As we expected distcp synchronised the current HDFS version of /files with this weeks protection copy. To achieve that it did very little work since the distcp target was pre-populated from the previous version using Data Domain fast copy.

We can verify by comparing HDFS /files with the current version on the Data Domain.

The protection process can be independent of Hadoop so that human error and malware are less likely to propagate to the protection process and copies i.e. delineation of controls.

All the protection copies are deduplicated against each other and stored efficiently for versioning of large data sets.

The protection copies are stored in their native format and readily accessible directly from Hadoop for processing without requiring special access, tools or having to restore them back into HDFS first.

The protection copies can be replicated between Data Domain systems for efficient offsite protection.

One other possibility that we have yet to exploit is distributing the deduplication process. In the current strategy Data Domain is required to perform the deduplication processing on behalf of the Hadoop data nodes.

What would make a great deal of sense is if we could distribute the deduplication processing to the Hadoop data nodes. That would allow the protection strategy to scale uniformly. Data Domain Boost is designed to do exactly that. It is a software development kit (SDK) that allows deduplication processing to be embedded in the data sources (applications).

As of this time the SDK is not in the public domain and there is no integration with distcp. If you think this would be a worthwhile development please leave a comment.

Vendor specific Data Protection

Thus far we have touched on the data protection concepts and methods implemented with the standard Apache Hadoop distribution. Now lets look at how each vendors Hadoop distribution provides protection capabilities unique to that distribution.

The Cloudera CDH distribution provides a Backup and Disaster Recovery solution based on a hardened version of Hadoop distributed copy (distcp). This solution requires a second Hadoop cluster to be established as a replication target.

The HortonWorks distribution comes with Apache Falcon. This provides a data management and processing framework that incorporates governance and regulatory compliance features.

The Pivotal HD distribution provides specific data protection tools for HAWQ the SQL query engine running on top of Hadoop.

The MapR distribution has taken a different approach to addressing protection. They developed a proprietary filesystem which is API-compatible with HDFS. This is a major shift from the standard Apache distribution and provides a rich set of data services that can be used for protection including support for atomic snapshots.

There are also a number of tool vendors that can be used for Hadoop protection.

WANdisco is an extension built on top of Apache Hadoop or any of the supported distribution partners and is designed to stretch a Hadoop cluster across sites providing a non-stop architecture. The WANdisco solution does not appear to address versioning which I consider a core requirement of a data protection solution.

A number of other tools exist from Syncsort and Attunity to move data in and out of Hadoop.

If I have omitted any others feel free to leave a comment.

Apart from software-derived solutions there are converged solutions such as those offered by EMC where we have Pivotal HD, Cloudera CDH or Hortonworks running on Isilon OneFS which supports HDFS natively. In these architectures protection can be provided by the Isilon intelligent storage infrastructure. In this case Isilon OneFS supports snapshots which can be replicated and/or rolled over (at a file or direction level) to alternative storage for data protection and long term preservation.

Application Protection rather than File Protection

There is a significant caveat with all data protection methods we have discussed to date. We have discussed how to protect data stored in HDFS. What we have not considered is the applications layered on top of Hadoop.

HDFS file recovery does not constitute application recovery. In order to provide application recovery we need to consider which files require protection and whether the application can be put into a consistent state at the time the protection copy is created. If we cannot guarantee consistency then we need to consider whether the application can be brought into a consistent state after recovery from an inconsistent copy.

Application consistency is usually a property of the application. For example, transaction processing applications supported by relational databases provide consistency through ACID (Atomicity, Consistency, Isolation, Durability) properties. As part of the protection method we would call on the ACID properties of the application to create a consistent view in preparation for the copy process. If the application does not provide ACID properties then we rely on a volume, file system or storage-based function that allows IO to be momentarily frozen so that an atomic point-in-time view of an object (or group of objects) can be taken.

In the case of the standard Apache Hadoop architecture we would expect HDFS snapshots to provide consistent views of data sets, however as we demonstrated in practice this is not possible. And since we don’t have the luxury of intelligent storage we are at the mercy of the application to provide hooks that allow us to take consistent point-in-time views of application data sets. Without this functionality we would need to resort to archaic practices analogous to cold backups which is simply not feasible at scale (i.e. take application offline during HDFS snapshot operation).

HBase is one example of an application layered on top of Hadoop. This technology introduces the concept of tables to Hadoop. HBase includes its own snapshot implementation called HBase snapshots. This is necessary as only HBase understands the underlying structure, relationships and states of the data sets it layers on top of HDFS. However, even HBase snapshots are not perfect. The Apache HBase Operational Management guide includes the following warning:

There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled. A snapshot is only a representation of a table during a window of time. The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors. There is also no way to know whether a given insert or update is in memory or has been flushed.

While HBase understands how to snapshot a table it cannot guarantee consistency. Whether this presents a problem for you will depend on your workflow. If atomicity and consistency is important then some synchronisation with application workflows will be necessary to provide consistent recovery positions and versioning of application workload.

What are we Protecting against anyway?

I think we have exhausted the data protection options available to Hadoop environments. What I would like to end with is to help you answer the question are Hadoop’s inbuilt protection properties good enough?

When we think about data protection we usually think in terms of recovery point objective and recovery time objective. What we rarely consider are the events we are protecting against and whether the protection methods and controls we employ are effective against those events.

Whilst it is not my intention to raise fear and doubt it is important to acknowledge there is no such thing as software that doesn’t fail unexpectedly. HDFS has had a very good track record of avoiding data loss. However, despite this record, software bugs that can lead to data loss still exist (see HDFS–5042). The research community has also weighed in on the topic with a paper titled HARDFS: Hardening HDFS with Selective and Lightweight Versioning. It is worth reading if you want to understand what can go wrong.

To help you answer the question is Hadoop’s inbuilt protection properties good enough? we need to explore the events that can lead to data loss and then assess how Hadoop’s inbuilt protection properties would fair in each case.

Below is a list of the events I come across in the Enterprise that can result in data loss.

Data corruption– usually a by product of hardware and/or software failure. An example may be physical decay of the underlying storage media which is sometimes referred to as bit rot.

Site failure– analogous to a disaster event that renders the site inoperable. An example may be a building fire, systematic software failure or malware event.

System failure– individual server, storage or network component failure. An example may be a motherboard or disk failure.

Operational failure– human error resulting from the day to day operations of the environment. An example may be an operator re-initialising the wrong system and erasing data.

Application software failure– failures of the application software layered on top of the infrastructure. An example may be a coding bug.

Infrastructure software failure– software failures relating to the infrastructure supporting the applications. An example may be disk firmware or controller software bug.

Accidental user event– accidental user-driven acts that require data recovery. An example may be a user deleting the wrong file.

Malicious user event– intentional user-driven acts designed to harm the system. An example may be a user re-initialising a file system.

Malware event– automated systems designed to penetrate and harm the system. An example was the recent Syncology worm that encrypted the data stored on Syncology NAS devices and demanded payment in exchange for the encryption key.

Associated with each event is the probability of occurrence. Probability is implicated by many factors, environment, technology, people, processes, policies and experience. Each of these may help or hinder our ability to counter the risk of these events occurring. It would be unfair to try and generalise the probability of these events occurring in your environment (that is left up to the reader) so for the purpose of comparison I have assumed all events are equally probable and rated Hadoop’s tolerance to these events using a simple low, medium and high scale. It is important to acknowledge these are my ratings based on my judgement and my experience. By all means if you have a different view leave a comment.

Event

Rating

Rationale

Data corruption

High

Support for N-way replication, data scanning and verify on read provides significant scope to respond to data corruption events.

Site failure

Medium

A Hadoop cluster by design is isolated to one site. Hadoop distributed copy can be used to create offsite copies however there are challenges relating to application consistency and security that need careful consideration in order to maintain recovery positions.

System failure

Medium

The Hadoop architecture is fault tolerant and designed to tolerate server and network failures. However, system reliability provided by N-way replication is relative to data node count. As node count increases reliability decreased. To maintain constant probability of data loss the replication factor must be adjusted with data node increases.

Operational failure

Medium

Hadoop lacks role based access controls to Hadoop functions. This makes it difficult to enforce system-wide data protection policies and standards. For example, a data owner can change the replication factor on files they have write access too. This can lead to compromises in data protection policies which cannot be prevented but can be audited by enabling HDFS audit logging.

Application software failure

Low

Data protection strategies that share the same software and controls with the application being protected are more vulnerable to software (and operational) failures that can propagate to the protection strategy. To address this situation it is common practice to apply defence in depth principles. This introduces a layered approach to data protection that mitigates the risk of a failure (e.g. software) compromising both the environment under protection and the protection strategies.

Infrastructure software failure

High

One can argue the Hadoop architecture has less reliance on infrastructure software compared to Enterprise architectures that rely on centralised storage systems. The nodes can be made up of a diverse set of building blocks that mitigates against a common wide-spread infrastructure software failures.

Accidental user event

Medium

HDFS trash allows a user to recover from user-driven actions such as accidental file deletion. However, it can be bypassed and is only implemented in the HDFS client. Furthermore, if the event goes unnoticed and trash expires, data loss occurs. There are documented cases of this happening.

Malicious user event

Low

If an authorised malicious user (i.e. the insider threat) wanted to delete their data sets they could do so using standard Hadoop privileges. HDFS trash can be bypassed or expunged and HDFS snapshots of a users directories can be removed by the user even if the snapshots were created by the super user. Furthermore, in a secure relationship between two HDFS clusters the distcp end points are implicitly trusted and vulnerable by association.

Malware event

Low

A Hadoop client with super user privileges has access to the entire Hadoop cluster including all inbuilt protection methods.

Hadoop is very reliable when it comes to protecting against data corruption and physical component failures. Where Hadoop’s inbuilt protection strategies fall short is application software and human failures (intentional or otherwise).

To address these events we need to consider taking a copy of the data and placing it under the control of a different (and diverse) system. Separation of duties and system diversification is an important property of a data protection strategy that minimises events from propagating to secondary data sources (i.e. protection copies).

The idea is analogous to diversifying investments across fund managers and asset classes. When funds are split between different fund managers and asset classes the likely hood of one fund manager or asset class affecting the other is reduced. In other words, we hedge our risk.

The same principle applies to data protection. When there is separation of duties and diversification between primary and protection storage the likely hood of cross-contamination from either human events (accidental or otherwise) or software failures is minimised.

Taking copies of Hadoop data sets may sound daunting to begin with however we must remember to apply the hot, warm and cold data concepts to reduce the frequency (how often) and volume (how much) of data that needs to be processed to support our data protection requirements.

To achieve that we need to understand our data. We need to understand the value of the data and what the business impact would be if the data was not available or worse lost. Secondly, based on the business impact analysis (BIA) we need to determine how often and how much data needs to be protected, and/or whether there are other ways to reconstitute the data from alternative sources. Once we understand these classifications we can devise policies aligned with methods to protect the data.

If the data has low business value then it may be appropriate to utilise the method with the lowest cost. For example, HDFS trash with two replicas. If the data has high availability and business value then it would be appropriate to utilise multiple methods that together provide the fastest time to recovery and broadest event coverage (i.e. lowest risk).

Fast time to recovery methods rely on technology that is tightly coupled with the primary data source and structure e.g. a snapshot that is not diversified from the primary data source or structure. Methods that yield the broadest event coverage and lowest risk profile rely on technologies that are loosely coupled (or decoupled) from the primary data source and structure e.g. a full independent copy on a diverse storage device and file system that is independently verified.

To achieve both fast time to recovery and broadest event coverage requires the adoption of multiple complimentary protection methods with opposing technical and risk profiles.

Summary

It is certainly clear in my mind that there is no universal approach to providing data protection for large scale Hadoop environments. The options available to you will ultimately depend on:

The Hadoop architecture you choose to adopt

The Hadoop distribution and the options it offers relative to your requirements

The value of data stored in Hadoop and your tolerance for risk

The applications layered on top of Hadoop

The application workflows and your ability to integrate with them for the purpose of protection and versioning

When I first thought about writing this article I never expected it would reach over 11,000 words. Looking back I think if you have reached this point you would agree data protection for Hadoop is a broad topic that warrants special attention.

I hope you found this useful and I look forward to the continued adoption of Hadoop in the Enterprise. This will drive data protection vendors to embrace the technology and mature our approach to protecting Hadoop environments.

]]>https://www.beebotech.com.au/2015/01/data-protection-for-hadoop-environments/feed/1484Data Domain Boost over WAN is here!https://www.beebotech.com.au/2014/09/data-domain-boost-over-wan-is-here/
https://www.beebotech.com.au/2014/09/data-domain-boost-over-wan-is-here/#respondTue, 23 Sep 2014 01:58:35 +0000https://www.beebotech.com.au/?p=352Last week we made Avamar 7.1 Generally Available to the public. For more information see here. One of the features we introduced was support for Data Domain Boost over WAN. Naturally, I wanted to try it out by backing up my new MacBook (that I am still learning) from my EMC work office to my home lab over an SSL VPN tunnel.

The first thing I did was upgrade the lab to Avamar 7.1 and seed the first backup to the Data Domain over the LAN. The total size of the seed backup was 195GB. I stumbled a little getting the first seed backup as I didn’t realise when the MacBook sleeps the backup shuts down. Once I figured out how to use my MacBook I had my seed.

I kicked off the backup from work and after a few cycles it averaged 53 minutes. What’s really cool is it only sent 123MB as illustrated by the Activity Monitor.

In throughput terms thats 216 GB/hour. Not too bad for a full recovery point.

Naturally, my colleagues had a bunch of questions for me.

What is the latency to your home lab?

At first glance I thought it was reasonable. However, when I started looking into it I realised it was quite poor with a moderate amount of packet loss. At 64bytes it was averaging 163ms with 2% packet loss. That’s not healthy. Why so bad? Turns out even though the distance between the EMC office and my home is only ~25 km the IP traffic is routed via the US (Massachusetts specifically) and back. This is far from optimal, but it shows even under poor conditions DD Boost over WAN is rock solid.

So this had me thinking. How do I perform a WAN test and keep the test in country? How about over my mobiles 4G network? I did the same backup again this time tethered via USB cable to my mobile phone network.

Network quality was better at 109ms with no packet loss. This time around the backup took 36 minutes (325 GB/hour).

The next question was how does DD Boost over WAN compare to the Avamar client-side protocol which writes to Avamar Data Store nodes?

After reconfiguring the Avamar client and seeding the first backup to Avamar storage I ran the subsequent backup over the 4G network to the Avamar storage. This time around the backup took 17 minutes (688 GB/hour) halving the time relative to DD Boost over WAN.

I expected this to be the case. One of the differences between the Avamar algorithm and DD Boost over WAN is the degree of client-side caching.

The Avamar algorithm maintains two levels of caching on the client. The file metadata hash cache and the file data hash cache. The file metadata hash cache is used to avoid exchanges with the Avamar storage when the file it is backing up has not changed. If the file has changed the file blocks get hashed and these hashes are compared to the local data hash cache. If the local data hash cache returns a hit we avoid an exchange with the Avamar storage. If we get a data hash cache miss we must ask the Avamar storage if the hash is present at the other end.

In the case of DD Boost over WAN with Avamar software, we only have at our disposal one level of client-side caching – the file metadata hash cache. In the event a files contents has changed between backups, DD Boost over WAN relies on more exchanges with the Data Domain appliance to determine if a file data hash is present at the other end. As such, in the case where the round trip time is elevated, we should expect backups to take longer in the DD Boost over WAN use case compared to the Avamar algorithm.

As a comparison I also compared DD Boost over LAN backups to the Avamar algorithm. As expected, they both took the same time (11 minutes) as the benefit of multiple levels of caching only materialise when the round trip time is elevated. The time it took is essentially how long it takes to traverse the file system and hash the file metadata on my MacBook. Had the MacBook not been a bottleneck I would expect DD Boost to be more efficient on the client (because the client performs less work for the same outcome) in LAN use cases.

It’s always wise to know the differences and plan accordingly. Of course, your conditions and results will vary. Hope this helps. Peter..

]]>https://www.beebotech.com.au/2014/09/data-domain-boost-over-wan-is-here/feed/0352EMC Data Protection User Group coming to a city near youhttps://www.beebotech.com.au/2014/05/emc-data-protection-user-group-coming-to-a-city-near-you/
https://www.beebotech.com.au/2014/05/emc-data-protection-user-group-coming-to-a-city-near-you/#respondTue, 20 May 2014 05:30:45 +0000https://www.beebotech.com.au/?p=320

Come listen to my colleagues and I at the inaugural EMC Data Protection User Group series. This series kicks off at the end of May and will run throughout the rest of the year in 68 cities around the world. The agenda is structured in such a way that we can address both global and local data protection topics. We gather at a local location to share EMC product information, to enable you to exchange experiences and best practices, and to allow users to network with EMC experts and fellow peers.

Want to learn more? Click here to register for a local user group in your city (I will be in Melbourne). You can also join the EMC Community Network by clicking here. Finally, you can follow on Twitter @EMCProtectData and Facebook.

PS: We will be covering the present and future of Data Protection. To that end, please keep in mind you will be required to complete a Non Disclosure Agreement.

Now, before you jump off your chair in excitement, first some disclaimers. What I am about to demonstrate is not supported by me or EMC. Please don’t call EMC if you end up:

losing your backups

receiving a large bill from AWS

This is being shared in the spirit of experimentation. OK so first let me describe what is required.

We need a host to act as our Avamar to Glacier gateway. This is analogous to a Cloud Gateway. For this I am going to spin up a Linux virtual machine running CentOS.

Next we need a way to extract backups from Avamar into a flat file. It turns out the Avamar Extended Retention feature introduced a method to export and import Avamar backups to/from PAX streams. In this case we are going to turn the backup stream into a flat file and use that as the basis for archiving to Glacier.

We also need a way of shipping these flat file backup archives to Glacier. There are a few options available. I am going to use mtglacier. This tool allows a local file system to be kept in sync with one or more Glacier vault’s. In our case we do not want to keep the files around so once they have been uploaded they will be deleted.

The architecture looks like this:

We have our Avamar server (in this case Avamar Virtual Edition running 7.0SP1) with an Avamar Data Store (file system). We have our Cloud Gateway Server with Avamar client and mtglacier installed and we have Glacier supporting the vault.

I am not going to discuss the process of installing Avamar client or mtglacier. These are already well documented.

Exporting and Uploading Avamar Archived Backups to Glacier

To archive an Avamar backup into Glacier we need to go through 3 steps:

Identify the Avamar backup that needs to be exported

Export the Avamar backup to a flat file on the Cloud gateway

Upload the flat file into a Glacier vault

The workflow looks like this:

For this experiment I created a small backup that is 176588 KB in size from a Windows client with the name winco.mlab.local. This client is registered under the Avamar domain /mlab. To export this backup I need to determine its unique identifier (sequence number).

Below is the list of backups available for this client. Run this command from the Cloud gateway using the Avamar adminsitrator (MCUser) account. If you want to lock down access substitute this user with an Avamar domain account.

The backup we are interested in is highlighted above with sequence #62 and backup label MOD-1395921964905.

To archive we need to export a copy of the backup to the Cloud gateway. Before we do that create an archive directory tree structure on the Cloud gateway that mirrors the Avamar domain structure. This will provide a mapping between archived backups in Glacier and the Avamar domain and client it originated from. If we were archiving backups from multiple Avamar servers then we may choose to prefix the structure with the Avamar server name. This would avoid conflicts.

For each backup archive create the following directory structure:

/archives/<Avamar domain path>/<Avamar client>/<Backup Sequence #>

In this example we create it as follows:

# mkdir -p /archives/mlab/winco.mlab.local/62

Now we can begin the export process. To do this we instruct Avamar’s avtar command to extract a copy of the backup using the PAX archive format. PAX is short for Portable Archive Exchage and has similarities to tar and cpio. This is written to a file name data.avpax under the directory structure we previously created.

If we look at the exported file it contains some header and XML content followed by the backup data itself. The XML content is used to describe the backup if we ever wanted to bring it back into Avamar.

Before we archive this backup we also want to extract file lists and backup job metadata in order to service additional use cases such as a search archive service.

For example, a full text search engine could be introduced to index these files to support the process of identifying long term archives for retrieval. There are many free search engines available. One to consider is Elasticsearch. I may get to this in a subsequent blog post.

Now we are ready to upload the archive to Glacier. Lets instruct mtglacier to perform a dry run sync process to confirm what it will upload to Glacier. In this case we want to filter the criteria to files called data.avpax

What we see here is mtglacier performed a multipart upload using 4 workers and 16MB chunks. This is necessary to drive parallelism and saturate bandwidth. About 9 minutes later the 181MB file upload completed. Here is what it looked like from the internet gateway.

Before we delete the archived backup from Avamar we should take a backup of the metadata we created on the Cloud gateway. This is necessary to ensure we can always maintain the relationship between Avamar archived backups and Glacier archives.

Then delete the exported backup in the archive folder. This can be done automatically by searching the mtglacier journal log and deleting files that have been inserted with a CREATED record (CREATED means uploaded to Glacier). In this case we want to limit the search to files called data.avpax

At this point we have successfully demonstrated how to archive an Avamar backup to Glacier.

Retrieving and Importing Avamar Archived Backups from Glacier

To import an archived backup into Avamar we need to go through 4 steps:

Identify the Glacier archive that needs to be retrieved

Request Glacier to retrieve the archive

Download the archive when it is ready to be retrieved

Import the archived backup into Avamar

The workflow looks like this:

[UPDATE: 20140801 – mtglacier now supports restoring individual files using the new –include and –exclude options. Using grep to extract the file from the journal and creating a new journal log is no longer required]

Unfortunately mtglacier does not support restoring individual files. Rather, it restores any files referenced in the journal log that do not exist in the local archive file system.

To work around this limitation we extract the record entry we want to restore from the journal log and create a new one. We then use the new log to initiate the restore. In this case we want to retrieve the backup that was recently archived for winco.mlab.local with sequence #62.

After the retrieve request is issued Glacier takes several hours before the archive becomes available for download. If you try and restore an archive that is not available this is what mtglacier returns:

This graph looks significantly higher and narrower than the previous one. Can anyone guess why?

My broadband connection is asymmetric. That is, the download line rate is significantly higher than the upload. This is very common for home broadband connections. However, for this use case, it is not ideal. We would generate significantly more upload traffic than download. To make this feasible requires a symmetric link.

In this experiment the download was quick and ran @ 3 MB/s. My link is capable of 10 MB/s.

Now that we have downloaded the archive we need to import it back into Avamar. In this case we want to import it back into the original Avamar client. We specify the domain path as defined when it was exported.

We can see a new backup #63 was imported. Avamar did not use the original sequence number. It increments these when backups are created. The imported backup shares the same label as the original backup we exported which is OK.

We can now browse this backup using Avamar Administrator and restore it.

We should point out the imported backup has no expiration date. If we want to set this we would use the –expires argument to avtar during the import.

What about compression and encryption?

If you would prefer to compress and subsequently encrypt the backups before they are sent to Glacier then we can substitute –stream for –to-stdout in the case of exports and –from-stdin in the case of imports.

You could use gzip or bzip2 for compression and ccrypt for encryption. Make sure to compress before encrypting. For the encryption key we could use a combination of the Avamar backup label and sequence number.

What about alternative archive targets?

Although Glacier was used as the target for this experiment, the options are endless. The same approach can be used to archive backups to many popular cloud and object stores including S3, Swift, Atmos, EVault, Azure, Google and Ceph, either through similar tools like mtglacier or alternatives such as FUSE modules.

Alternatively if you want to keep your archives on premise then traditional block or file storage systems could be consumed by the Cloud gateway. Ideally, these would implement erasure coding schemes to keep costs down.

So it can be done… But is it practical?

We have proven it is possible to archive Avamar backups but does that mean it is practical?

Lets put things into perspective. If we wanted to archive monthly backups for long term retention to Glacier what would we need?

Avamar comes in many flavours; virtual, physical, with and without Data Domain. The sizes range from 500GB (before dedup) to 124TB (Avamar 16 node grid). With Data Domain we can store 570TB in the active tier and have several attached to one Avamar server.

Now, lets assume we stopped storing backups in Avamar greater than 1 month old and instead use Glacier. To work out the size of our monthly backup we need to understand the ratio of front-end protected storage to backend consumed for a 30 day retention profile.

There are many factors that impact this ratio (data type, change rate, growth rate, etc) however for the purpose of this experiment we will use 1:1.

For a 500GB Avamar instance we would need to archive 500GB a month to Glacier. We have 30 days to complete the archive before the next cycle starts. Realistically we don’t want to consume the entire 30 days. We need to give ourselves some tolerance. Therefore, lets say we want to complete a monthly cycle within 50% either side of the next cycle. How much upload bandwidth would we need for 500GB?

We would need a 3.2 Mbps upload link. What about larger volumes of data?

Below is a table of volumes relative to time occupied between cycles. In the 100% case the archiving process is running 24×7.

What we can infer from this chart is the upload bandwidth requirements are very high.

For example, my broadband can only accommodate 2 Mbps. Even then home broadband plans are not appropriate as most of them have upload GB caps and throttle bandwidth to impractical levels when the cap is reached. My cap is 200 GB for upload and download combined and costs $80 AUD/month.

What we need is a symmetric link which is often reserved for businesses.

For example, a leading telco offers 10 Mbps business plans. That would support a 2TB monthly archive use case to Glacier at 75% busy. However, this type of connectivity is very costly at $7931/month. Compare this to Glacier’s cost of $0.01/GB/month or in this case $20/month (first month) scaling to $240/month (12 months) to store monthly archived backups for 1 year.

In this example, the cost of networking is 33x more than storage. This makes any cloud storage look expensive even at $0.01/GB/month.

The blended cost is $0.34/GB/month after year 1 (excluding AWS get/put request and restore costs).

Summary

Glacier is a very cost effective cold storage service. However, the cost of networking in this country makes it impractical to consume Glacier over the Internet for long term backup archives. To address this issue AWS offers alternative connectivity options that bypass traditional Internet connections and provides direct connectivity.

The product is called AWS Direct Connect and is designed to be more cost effective for large scale requirements. However, in addition to AWS Direct Connect usage costs there are line costs associated with AWS Direct Connect network partners. These prices are not in the public domain (that I could find) which makes it difficult to evaluate.

In part 2 we will explore if it is possible to minimise the networking requirements between Avamar and Glacier.

I recently wrote about a topic dear to my heart, backup deduplication and capacity planning. The paper was published by EMC Proven Professional Knowledge Sharing program and is available from the EMC Community Network. A link to the paper is here.

Many organisations have embraced disk-centric backup architectures by adopting purpose-built backup appliances to overcome the reliability and performance challenges associated with tape-centric architectures.The market for purpose-built backup appliances reached 2.4 bilion in 2011 and continues to experience growth. This has resulted in many new vendors’ releasing solutions. The market is now saturated with a variety of solutions, from software-only to purpose-built backup appliances and combinations thereof. Each implementation has its strengths and weaknesses. This article will attempt to provide an objective comparison of the functional and architectural properties associated with deduplicated disk-centric backup implementations.Furthermore, for those that have already adopted deduplicated disk for backup, we discuss capacity planning and why traditional planning models that we apply to primary storage do not work well for deduplicated disk backup systems. To support this discussion, we will provide a generic overview of deduplicated disk backup sizing and how backup requirements and data profiles effect storage consumption. Equipped with this knowledge, the reader will be in a better position to understand and forecast deduplication storage consumption.

This will probably be my last post for the year. Hope you enjoyed the content in 2013. Look out for some new experiments in the new year.

Fortunately, NetWorker includes the ability to push software updates to hosts from the NetWorker server.

To setup this feature the software you would like to distribute must first be added to the NetWorker server software repository. This can be done via the Software Administration GUI available from NetWorker Administrator or the command line. For the purpose of this demonstration we will use the nsrpush command from the NetWorker server but either way will work.

This particular NetWorker server runs on CentOS which means we need to use a Windows system to help with adding the Windows software distribution to the repository. This system needs to be running a NetWorker Client and be accessible from the NetWorker server. It is used one-time to ensures Windows file schematics are preserved as the repository is populated on the UNIX NetWorker server. The inverse applies if we were using a Windows NetWorker server.

First extract the Windows software distribution to a directory on the UNIX NetWorker server and the Windows cross-platform client. In this case I have used /nw on UNIX and c:\nw81_win_x64\win_x64 on Windows. Then tell NetWorker to add the software to the repository by specifying the location on the UNIX server and Windows cross-platform client (server.mlab.local).