StateTech E-Newsletter

You are here

Built-In Dedupe

Logan Kugler is a freelance tech writer based in Los Angeles. He has written about everything from IT security and data storage to mobile communications for more than 50 national and international publications. Follow Logan on Twitter @LoganKugler

If you manage a computer network in Louisiana, Hurricane Katrina made clear the importance of having a sound backup strategy. "Time here is measured as 'before Katrina' and 'after Katrina,'" says Peter Haas, the IT director responsible for the Louisiana Supreme Court network. "After Katrina, we're extremely picky about the backup tools we use."

While disaster recovery weighs heavily on Haas' mind, his day-to-day focus rests on the ability to retrieve files quickly, whether that's historical data from the court's extensive archives or lost data from a power outage, system failure or user error.

Eliminating Redundancy

Deduplication is a relatively new technology, but the principle is fairly simple. In every organization, there are pieces of data that are repeated dozens, hundreds or even thousands of times across all the files stored on a network. These could include whole files -- such as a memo sent to everyone in the organization and saved to every hard drive on every computer -- but much of the replication occurs within files; for instance, a signature block appended to every outgoing e-mail or a logo embedded in every PowerPoint.

Rather than save these scraps of data over and over again, deduplication scans every file for redundancy and replaces repeated data with a pointer to the original. "It's like a bouncer at a club," says Mike Fisch, senior contributing analyst at The Clipper Group. "To get in, you have to be original."

Deduplication offers a number of benefits when integrated with a backup strategy. First, it reduces the size of individual backups by eliminating redundant data. It also reduces the storage capacity required for subsequent backups because today's backup image likely shares much, if not most, of its data with yesterday's. With deduplication, backups can store exponentially more data over time than the actual space they take up. "You can easily get to 20 times, or even 50 times [the amount of data]," says Fisch. "So you can back up a lot more data to disk."

Because deduplicated backups are smaller than traditional backups, they can be run more quickly and transferred over the network or to offsite storage more quickly. That means lower bandwidth and overhead consumed by backup, and less time lost for recovery. "It's obviously quicker," says Haas. "We get a full backup every night. We wouldn't have been able to do that with traditional linear backup on tape."

Jim West, network manager for the city of Kissimmee, Fla., also appreciates the speed with which he can back up his entire network using deduplication built into EMC Avamar.

"Less than 1 percent of our files change on any given day," West says. "With traditional backup, we would have to back up the entire system." With deduplication, he can run daily backups instead of the weekly backups he used to start every Friday night.

Because of the small size of each backup, West is also able to maintain many more backups on disk instead of on tape. "I have two months of nightly backups on disk, where it's indexed and always available," says West. "If I had to find a specific file on tape, I'd have to go find the place on the tape where that file is. And the more data we accumulated, the bigger the number of tapes was getting."

Deduplication is catching on for good reason -- it saves time, money and hassle, and also opens up new possibilities that were never intended. "We've always had to safeguard the data," says Haas, "but now we're able to go back three or four years to retrieve documents, something we could not do without deduplication." Because of this, deduplication has become an essential part of Haas' toolkit. "It has to remain an important fixture within our architecture. It's just that important."

Make the Most of Data Deduplication

1. Keep backups longer: The more backups you have, the greater likelihood of finding the redundancy that makes deduplication work. Subsequent backups will get smaller and smaller -- incidentally freeing up the space you need to keep more backups.

2. Know your data: Video, photographs, scanned documents and audio tend not to yield much gain. If much of your backups comprise these types of file, consider bypassing deduplication to save overhead.

3. Don't encrypt or compress before dedupe: Encryption and compression eliminate many of the patterns that deduplication looks for. Apply them after the data has been deduplicated.

4. Dedupe as widely as possible: The more data deduplication has to work with, the more opportunities for finding redundant data, so include as many computers, servers and virtual devices as possible in your backups.

5. Let the stats worry about themselves: While it is satisfying to see you are saving 80 percent, 90 percent or even more of your disk space by using deduplication, don't worry about improving your ratio. Configure your software in the way that works best for you and don't worry about what the statistics say.