3 Data Reduction Methods ompression Single instance store (SIS) Data deduplication Encoding of data to reduce size Typically localized, such as to a single file, directory tree or storage volume form of compression, usually applied to a large collection of files in a shared data store Only one instance of a file is retained in the data store Duplicate instances of the file reference the stored instance lso known as redundant file elimination form of compression, usually applied to a large collection of files in a shared data store In contrast to SIS, deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents) Only one instance is stored for each common chunk Duplicate instances of the chunk reference the stored instance This terminology is not used consistently throughout the industry. In particular, the terms SIS and deduplication are sometimes used interchangeably. 3 Data Deduplication and Tivoli Storage Manager

5 How Deduplication Works Data Store Data Store Data Store b a Data Store a a c b 1. Data chunks are evaluated to determine a unique signature for each 2. Signature values are compared to identify all duplicates 3. Duplicate data chunks are replaced with pointers to a single stored chunk, saving storage space This This section section provides a generalized description of of deduplication technology. Individual deduplication products and and solutions may may vary. vary. 5 Data Deduplication and Tivoli Storage Manager

6 Data Deduplication Value Proposition Potential advantages Reduced storage capacity required for a given amount of data bility to store significantly more data on given amount of disk Restore from disk rather than tape may improve ability to meet recovery time objective (RTO) Network bandwidth savings (some implementations) Lower storage-management and energy costs resulting from reduced storage requirements Potential tradeoffs/limitations Significant PU and I/O resources required for deduplication processing Deduplication might not be compatible with encryption Increased sensitivity to media failure because many files could be affected by loss of common chunk Deduplication may not be suitable for data on tape because increased fragmentation of data could greatly increase access time 6 Data Deduplication and Tivoli Storage Manager

8 Where Deduplication is Performed pproach dvantages Disadvantages Source-side (client-side) Deduplication performed at the data source (e.g., by a backup client), before transfer to target location Target-side (server-side) Deduplication performed at the target (e.g., by backup software or storage appliance) Deduplication before transmission conserves network bandwidth wareness of data usage and format may allow more effective data reduction Processing at the source may facilitate scale-out No deployment of client software at endpoints Possible use of direct comparison to confirm duplicates Deduplication consumes PU cycles on the file/ application server Requires software deployment at source (and possibly target) endpoints Depending on design, may be subject to security attack via spoofing Deduplication consumes PU cycles on the target server or storage device Data may be discarded after being transmitted to the target 8 Data Deduplication and Tivoli Storage Manager

9 When Deduplication is Performed pproach dvantages Disadvantages In-band Deduplication performed during data transfer from source to target Immediate data reduction, minimizing disk storage requirement No post-processing May be bottleneck for data ingestion (e.g., longer backup times) Only one deduplication process for each I/O stream May not support deduplication of legacy data on the target server Out-of-band Deduplication performed after data ingestion at the target No impact to data ingestion Potential for deduplication of legacy data Possibility for parallel data deduplication processing Data must be processed twice (during ingestion and subsequent deduplication) Storage needed to retain data until deduplication occurs 9 Data Deduplication and Tivoli Storage Manager

12 Identification of Redundant hunks Unique identifier is determined for each chunk Identifiers are typically calculated using a hash function that outputs a digest based on the data in each chunk MD5 (message-digest algorithm) SH (secure hash algorithm) For each chunk, the identifier is compared against an index of identifiers to determine whether that chunk is already in the data store Selection of hash function involves tradeoffs between Processing time to compute hash values Index space required to store hash values Risk of false matches 12 Data Deduplication and Tivoli Storage Manager

13 False Matches Possibility exists that two different data chunks could hash to the same identifier (such an event is called a collision) Should a collision occur, the chunks could be falsely matched and data loss could result ollision probability can be calculated from the possible number of unique identifiers and the number of chunks in the data store Longer digest More unique identifiers Lower probability of collisions More chunks Higher probability of collisions pproaches to avoiding data loss due to collisions Use a hash function that produces a long digest to increase the possible number of unique identifiers ombine values from multiple hash functions ombine hash value with other information about the chunk Perform byte-wise comparison of chunks in the data store to confirm matches 13 Data Deduplication and Tivoli Storage Manager

14 Hash Functions Hash functions take a message of arbitrary length as input and output a fixed length digest of L bits. They are published algorithms, normally standardized as RF. Name Output size L (bits) Performance (cycles/byte) Intel Xeon: / assembly* ollision chance 50% (or greater) when these many chunks (or more) are generated ** hance of one collision in a 40 P archive*** (using 4K / chunk) Year of the standard MD / * SH / * SH / * SH / * Whirlpool / * * Performance analysis and parallel implementation of dedicated hash functions, Proc. of EURORYPT 2002, pp , ** The probability of one collision out of k chunks is p 1-e -(k^2)/2*n, where N=2 L ; when p=0.5, we get k N 1/2 = 2 L/2 (from birthday paradox). Probability of collision is extremely low and can be *** The probability of one hard-drive bit-error is about reduced at the expense of performance by using hash function that produces longer digest 14 Data Deduplication and Tivoli Storage Manager

15 Elimination of Redundant hunks For each redundant chunk, the index is updated to reference the matching chunk Index is updated with metadata indicating how to reconstruct the object from chunks, some of which may be shared with other objects ny space occupied by the redundant chunks can be deallocated and reused Deduplication index is critical Integrity Performance Scalability Protection 15 Data Deduplication and Tivoli Storage Manager

22 Deduplication Highlights Deployment of new clients or PI applications not required Legacy data stored in or moved to enabled FILE storage pools can be deduplicated Data migrated or copied to tape will be reduplicated to avoid excessive mounting and positioning during subsequent access bility to control number, duration and scheduling of PU-intensive background processes for identification of duplicate data Reporting of space savings in deduplicated storage pools Deduplication processing will skip client-encrypted objects, but should work with storage-device encryption Native TSM implementation, with no dependency on specific hardware 22 Data Deduplication and Tivoli Storage Manager

26 onsiderations for Use of TSM Deduplication onsider deduplication if Data recovery would improve by storing more data objects on limited amount of disk Data will remain on disk for extended period of time Much redundancy in data stored by TSM (e.g., for common operating-system or project files) TSM server PU and disk I/O resources are available for intensive processing to identify duplicate chunks Deduplication might not be indicated for Mission-critical data, whose recovery could be delayed by accessing chunks that are not stored contiguously TSM servers that do not have sufficient resources Data that will soon be migrated to tape 26 Data Deduplication and Tivoli Storage Manager

Creating a Cloud Backup Service Deon George Agenda TSM Cloud Service features Cloud Service Customer, providing a internal backup service Internal Backup Cloud Service Service Provider, providing a backup

Demystifying Deduplication for Backup with the Dell DR4000 This Dell Technical White Paper explains how deduplication with the DR4000 can help your organization save time, space, and money. John Bassett

UNDERSTANDING DATA DEDUPLICATION Tom Sas Hewlett-Packard SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material

Beyond: Optimizing Gartner clients using deduplication for backups typically report seven times to 25 times the reductions (7:1 to 25:1) in the size of their data, and sometimes higher than 100:1 for file

UNDERSTANDING DATA DEDUPLICATION Thomas Rivera SEPATON SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material

Data Deduplication HTBackup HTBackup and it s Deduplication technology is touted as one of the best ways to manage today's explosive data growth. If you're new to the technology, these key facts will help

the Availability Digest Data Deduplication February 2011 What is Data Deduplication? Data deduplication is a technology that can reduce disk storage-capacity requirements and replication bandwidth requirements

Backup Software Data Deduplication: What you need to know Presented by W. Curtis Preston Executive Editor & Independent Backup Expert When I was in the IT Department When I started as backup guy at $35B

Accelerating Backup/Restore with the Virtual Tape Library Configuration That Fits Your Environment A WHITE PAPER Abstract: Since VTL uses disk to back up data, it eliminates the media and mechanical errors

The Business Value of Deduplication DDSR SIG Abstract The purpose of this presentation is to provide a base level of understanding with regards to data deduplication and its business benefits. Consideration

TECHNOLOGY BRIEF NOTICE This Technology Brief contains information protected by copyright. Information in this Technology Brief is subject to change without notice and does not represent a commitment on

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP by Christophe Bertrand, VP of Product Marketing Too much data, not enough time, not enough storage space, and not enough budget, sound

Backups in the Cloud Ron McCracken IBM August 8, 2011 Session 9844 Legal Information The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.

E-Guide Checklist and Tips to Choosing the Right Backup Strategy Data deduplication is no longer just a cool technology, it's become a fairly common component of modern data backup strategies. Learn how

Barracuda Backup Deduplication White Paper Abstract Data protection technologies play a critical role in organizations of all sizes, but they present a number of challenges in optimizing their operation.

3Gen Data Deduplication Technical Discussion NOTICE: This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and

Help maintain business continuity through efficient and effective storage management IBM Tivoli Storage Manager Highlights Increase business continuity by shortening backup and recovery times and maximizing

E-Guide How to Get Started With Data Deduplication Data deduplication has certainly generated quite a buzz among storage professionals in the UK, and while there s a lot of curiosity and interest, many

Managed Services - A Paradigm for Cloud- Based Business Continuity Ron McCracken IBM 6 February 2013 Session Number 12993 Agenda This session is intended to expose key requirements for support of enterprise

Thomas Rivera, SEPATON SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material in presentations and literature

EMC PERSPECTIVE An EMC Perspective on Data De-Duplication for Backup Abstract This paper explores the factors that are driving the need for de-duplication and the benefits of data de-duplication as a feature

Data Deduplication for Corporate Endpoints This whitepaper explains deduplication techniques and describes how Druva's industry-leading technology saves storage and bandwidth by up to 90% Table of Contents

WHAT IS FALCONSTOR? FalconStor Optimized Backup and Deduplication is the industry s market-leading virtual tape and LAN-based deduplication solution, unmatched in performance and scalability. With virtual

Considerations when Choosing a Backup System for AFS By Kristen J. Webb President and CTO Teradactyl LLC. October 21, 2005 The Andrew File System has a proven track record as a scalable and secure network

A Detailed Review Abstract The white paper describes how the EMC Disk Library can enhance an IBM Tivoli Storage Manager (TSM) environment. It describes TSM features, the demands these features place on

Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among

White Paper Addressing NAS Backup and Recovery Challenges By Terri McClure and Jason Buffington February 2012 This ESG White Paper was commissioned by EMC and is distributed under license from ESG. 2012,

Thomas Rivera, SEPATON SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material in presentations and literature