SANS Digital Forensics and Incident Response Blog

This post is based on a paper I published in December last year; "Overwriting Hard Drive Data: The Great Wiping Controversy" by Craig Wright, Dave Kleiman and Shyaam Sundhar R.S. as presented at ICISS2008 and published in the Springer Verlag Lecture Notes in Computer Science (LNCS) series.

Background

Opinions on the required or desired number of passes to correctly overwrite (wipe) a Hard Disk Drive are controversial, and have remained so even with organizations such as NIST stating that only a single drive wipe pass is needed to delete data such that it can not be recovered (that is a wipe of the data).

The controversy has caused much misconception. This was the reason for this project.

It is common to see people quoting that data can be recovered if it has only been overwritten once, many times referencing that it actually takes up to ten, and even as many as 35 (referred to as the Gutmann scheme because of the 1996 Secure Deletion of Data from Magnetic and Solid-State Memory published paper by Peter Gutmann, [12]) passes to securely overwrite the previous data.

To answer this once and for all, a project was started in 2007 to actually test whether or not data can be recovered from a wiped drive if one uses an electron microscope. For the full details, you will need to actually read the published paper, though this post offers a synopsis. In subsequent communications with Prof. Fred Cohen I have come to realize that there are certain other uses for the methods I have used in this effort. The recovery of data from damaged drives is possible. Further, using the mathematical methods employed in the experiment (Bayesian statistics) one can recover data from damaged drives with far simpler means than through the use of a MFM (magnetic force microscope).

On top of this, I have to note (thanks to Prof. Cohen) that many larger modern drives are not overwritten in the course of use in many sectors due to size.

Where did the controversy begin?

The basis of this belief that data can be recovered from a wiped drive is based on a presupposition that when a one (1) is written to disk the actual effect is closer to obtaining a 0.95 when a zero (0) is overwritten with one (1), and a 1.05 when one (1) is overwritten with one (1).

This can be demonstrated to be false.

This was the case with high capacity floppy diskette drives, which have a rudimentary position mechanism. This was at the bit level and testing did not consider the accumulated error. The argument arises from the statement that "each track contains an image of everything ever written to it, but that the contribution from each "layer" gets progressively smaller the further back it was made. This is a misunderstanding of the physics of drive functions and magneto-resonance. There is in fact no time component and the image is not layered. It is rather a density plot.

MFM - Magnetic Force Microscopy

To test this theory, we used a MFM. Magnetic force microscopy (MFM) images the spatial variation of magnetic forces on a sample surface. A MFM is a variety of what most people simply term an electron microscope.

Partial Response Maximum Likelihood (PRML)

The concepts of how Partial Response Maximum Likelihood (PRML), a method for converting the weak analogue signal from the head of a magnetic disk or tape drive into a digital signal, and newer Extended Partial Response Maximum Likelihood (EPRML) drive, explain how encoding is implemented on a hard drive. The MFM reads the unprocessed analogue value. Complex statistical digital processing algorithms are used to determine the "maximum likelihood" value associated with the individual reads.

Older technologies used a different method of reading and interpreting bits than modern hard drives that is known as peak detection. This method is satisfactory while the peaks in magnetic flux sufficiently exceed the background signal noise. With the increase in the write density of hard drives, encoding schemes based on peak detection (such as Modified Frequency Modulation or MFM) that are still used with floppy disks have been replaced in hard drive technologies. The encoding of hard disks is provided using PRML and EPRML encoding technologies that have allowed the write density on the hard disk to be increased by a full 30-40% over that granted by standard peak detection encoding.

Common misconceptions

Drive writes are magnetic field alterations, the belief that a physical impression in the drive that can belie the age of the impression is wrong. The magnetic flux density follows a function known as the hysteresis loop. The magnetic flux levels written to the hard drive platter vary in a stochastic manner with variations in the magnetic flux related to head positioning, temperature and random error.

The surfaces of the drive platters can have differing temperatures at different points and may vary from the read/write head. As a consequence, there are many problems with the belief that data is recoverable following a wipe. The differences in the expansion and contraction rates across the drive platters uses a stochastically derived thermal recalibration algorithm. All modern drives use this technology to minimize variance. However, even with this algorithm, the data written to the drive is done in what is in effect an analogue pattern of magnetic flux density.

A stochastic distribution

Complex statistically based detection algorithms are employed to process the analog data stream as data is read from the disk. This is the "partial response" component mentioned previously. A stochastic distribution of data not only varies on each read, but also over time and with temperature differentials. Worse, there is the hysteresis effect to be considered.

Hysteresis

Stochastic noise results in a level of controlled chaos [Carroll and Pecora (1993a, 1993b)]. When looking at the effects of a magnetically based data write process, the hysteresis effect ensures that data does not return to a starting point.

As such, you can never go to the original starting point. On each occasion that you add and delete data from a drive, the resulting value that is written to the disk varies. This begins with a low level format, changes whenever any data is written to the drive and fluctuates on each attempt to zero the platter. What in effect you get is a random walk that never quite makes it back to the original starting point (save exposure to a powerful magnetic force or annealing process).

Magnetic signatures are not time-stamped

There is no "unerase" capability [15] on a hard drive due to magnetic resonance. There are no "layers" of written data. The value of the magnetic field does vary on each write to the drive, but it does so due to other factors and influences. These include:

fluctuations in temperature,

movement of the head,

prior writes to the drive?

All and any of these effects will influence the permeability of the platter in a statistically significant manner.

Variability in magnetic force

Jiles [21] notes that in the event that the temperature of a drive platter is increased from, 20 to 80 centigrade then a typical ferrite can become subject to a 25% reduction in the in permeability of the platter. Within a drive, the temperature of "normal" operation can vary significantly. With constant use, a drive can easily exceed 80 degrees centigrade internally. The system may not get this hot, but all that is required is a segment of the platter, and this is common.

The permeability is a material property that is used to measure how much effort is required to induce a magnetic flux within a material. Permeability is defined the ratio of the flux density to the magnetizing force. This may be displayed with the formula: µ = B/H (where µ is the permeability, B is the flux density and H is the magnetizing force). Due to the changes experienced by a drive, MFM techniques (detailed above) as used with floppy drives do not work in modern hard drives.

The hypothesis and the experiment

To test the hypothesis, a number of drives of various ages and types and from several vendors were tested. In order to completely validate all possible scenarios, a total of 15 data types were used in 2 categories.

Category A divided the experiment into testing the raw drive (this is a pristine drive that has never been used), formatted drive (a single format was completed in Windows using NTFS with the standard sector sizes) and a simulated used drive (a new drive was overwritten 32 times with random data from /dev/random on a Linux host before being overwritten with all 0's to clear any residual data).

The experiment was conducted in order to test a number of write patterns. There are infinitely many possible ways to write data, so not all can be tested. The idea was to ensure that no particular pattern was significantly better or worse than another.

Category B consisted of the write pattern used both for the initial write and for the subsequent overwrites.

This category consisted of 5 dimensions:

all 0's,

all 1's,

a "01010101 pattern,

a "00110011" pattern, and

a "00001111" pattern.

The Linux utility "dd" was used to write these patterns with a default block size of 512 (bs=512). A selection of 17 models of hard drive where tested. These varied from an older Quantum 1 GB drive to current drives (at the time the test started) dated to 2006.

The data patterns where written to each drive in all possible combinations. Each data write was a 1 kb file (1024 bits). It was necessary to carefully choose a size and location. Finding a segment on a drive without prior knowledge is like looking for the proverbial needle in the haystack. To do this, the following steps where taken:

Both drive skew and the bit was read.

The process was repeated 5 times for an analysis of 76,800 data points.

The likelihood calculations were completed for each of the 76,800 points with the distributions being analyzed for distribution density and distance.

This calculation was based on the Bayesian likelihood where the prior distribution was known.

As has been noted, in real forensic engagements, the prior distribution is unknown. When you are trying to recover data from a drive, you generally do not have an image of what you are seeking to recover. Without this forensic image, the experiment would have been exponentially more difficult. What we found from this is that even on a single write the overlap at best gives a probability of as low as just over 50% of choosing a prior bit (the best read being a little over 56%).

This caused the issue to arise, that there is no way to determine if the bit was correctly chosen or not.

Therefore, there is a chance of correctly choosing any bit in a selected byte (8-bits) — but this equates a probability around 0.9% (or less) with a small confidence interval either side for error.

The Results of the Tests

The calculated values are listed below for the various drives. Not all data is presented here, but it is clear to see that use of the drive impacts the values obtained (through the hysteresis effect and residuals). The other issue is that all recovery is statistically independent (for all practical purposes). The probability of obtaining two bits is thus multiplied.

Table of Probability Distributions for the older model drives.

Table of Probability Distributions for the "new" (ePRML) model drives.

What we see is that it quickly becomes practically impossible to recover anything *and this is not even taking the time to read data using a MFM into account).

What this means

The other overwrite patterns actually produced results as low as 36.08% (+/- 0.24). Being that the distribution is based on a binomial choice, the chance of guessing the prior value is 50%. That is, if you toss a coin, you have a 50% chance of correctly choosing the value. In many instances, using a MFM to determine the prior value written to the hard drive was less successful than a simple coin toss.

The purpose of this paper was a categorical settlement to the controversy surrounding the misconceptions involving the belief that data can be recovered following a wipe procedure. This study has demonstrated that correctly wiped data cannot reasonably be retrieved even if it is of a small size or found only over small parts of the hard drive. Not even with the use of a MFM or other known methods. The belief that a tool can be developed to retrieve gigabytes or terabytes of information from a wiped drive is in error.

Although there is a good chance of recovery for any individual bit from a drive, the chances of recovery of any amount of data from a drive using an electron microscope are negligible. Even speculating on the possible recovery of an old drive, there is no likelihood that any data would be recoverable from the drive. The forensic recovery of data using electron microscopy is infeasible. This was true both on old drives and has become more difficult over time. Further, there is a need for the data to have been written and then wiped on a raw unused drive for there to be any hope of any level of recovery even at the bit level, which does not reflect real situations. It is unlikely that a recovered drive will have not been used for a period of time and the interaction of defragmentation, file copies and general use that overwrites data areas negates any chance of data recovery. The fallacy that data can be forensically recovered using an electron microscope or related means needs to be put to rest.

Craig Wright, GCFA Gold #0265, is an author, auditor and forensic analyst. He has nearly 30 GIAC certifications, several post-graduate degrees and is one of a very small number of people who have successfully completed the GSE exam.

15 Comments

rsreese

simsong

I'm confused by this. I read the paper. I didn't see:* That the authors validated their recovery approaches by attempting to read valid data.* How the sectors on the disk were found.* How the authors translate from the proprietary coding used by the vendors was translated to ascii.Can someone point to where this is in the paper?

craigswright

To answer your query.* That the authors validated their recovery approaches by attempting to read valid data.You should note that a read was conducted on initial writes with known data. This means that the initial patern was known and hence would be recovered in a manner that would lead to this point being validated. This is detailed in the paper.* How the sectors on the disk were found.Contiguous writes. All areas of the disk not checked left blank. thus the only data was our data. Knowing where the data was written was simple. No magic to this.* How the authors translate from the proprietary coding used by the vendors was translated to ascii.Not particularly difficult. The various PRML and ePRML schemes are published and available. Clock cycles and patterns are simple. It is not as much of an art as you seem to believe.Regards,Craig

ralienpp

This is an interesting article; everything seems clear, the figures support your statements.I just checked out NIST's "Updated DSS Clearing and Sanitization Matrix AS OF: June 28, 2007", and they indeed propose a much more simple procedure for hard drives.Can you recommend some articles by the "opposing team", where they provide a rationale for the multi-pass approach? If they still have something to say, I'd like to hear it.

craigswright

Other then the original paper by Peter Gutmann there is little on the subject.Even then I have found little science on the subject. Science requires proof. What starts as a hypotesis needs to be tested before it becomes science. I have found nothing along these lines.

lifelikeimage

This may add to your discussion as the firmware on the drive has had a SecureErase command in the ANSI set since 2002 (15GB or greater in size).This link detailed the initial information that leads to the research:http://www.wilderssecurity.com/archive/index.php/t-177197.htmlThe above link leads to some excellent reading and the research of UCSD and NSA on the subject.

Mr. Data Recovery

So, basically there is the possibility to recover fragments of data but not a whole host of contiguous data? I am assuming this conclusion would mean that a program like DBAN would, for all intents and purposes, truly securely erase your hard drive I have been using it with the intent that it does just this. However, I stumbled upon this idea ''" recently ''" that it may in fact be possible to recover formerly overwritten data.Interesting. This is something I will have to dig a little deeper into. Thanks.

s

No, there's no real chance of recovering fragments.Products that advertise multiple passes are unneeded.And this paper is probably the only real paper on the subject, so you can't dig deeper unless you try it out yourself.

Mike Anthony

A most interesting article, thank you. I have a question, if it's not too late to post it.I'm a PC tech and frequently ''overwrite' HDDs (both old or newer types) using software that writes 0's. I've never bothered to use more than one pass, so it was interesting to read that a single pass is sufficient to render old data obsolete.However, my purpose in overwrites was not for security reasons but only to clean the HDD completely so that a new system could be ''installed' over the top with no chance of interference from old data to new.Which brings me to my actual question: If a drive is NOT overwritten with zeros, but simply completely overwritten by installing a new Windows system, is it possible that the new install will ever be ''upset' or ''derailed' by the presence of old ''unzeroed' data?Here's a case in point: I've set up a new laptop for a customer. My work included many hours of tweaking and customization. It's now finished and works well.However, it was just one of three new units that he ordered, so now I have to duplicate the process on the other two.I frequently ''clone' HDDs, using a number of apps, but most usually Acronis Migrate Easy (not too good for SATA drives) or Acronis True Image. The cloning process is usually 100% effective, with the cloned drive being a byte-for-byte copy of the original. However, I normally ''zero' the target drive before commencing the clone, a process that can take several hours on a 500GB drive.I'm considering skipping this step with the two remaining laptops, ie, I'd just clone the first, fully tweaked SATA HDD directly onto the other two drives without ''zero-writing' them first.Your comments would be appreciated. Thank you.

Troy Jollimore

I think what he's saying is that it was a small probability of retrieving the value of a particular bit (like he said, a coin toss would statistically be more accurate) KNOWING what you were looking for. If you didn't know, that small probability would become infinitely smaller.@Mike Anthony: Not the right place to ask this question, but the answer is pretty much no. What you're doing is thorough, but not required.

Casey

Sorry about coming late to the discussion. I have read the original document, and for a long time Ive been a firm believer in one pass secure deletions. However there are some things I dont follow.1) This is a pet peeve of mine. Why do articles such as this talk about writing bits to a disk? Every man and his dog who have read this far knows that bits are never, and have never, been written to a disk. Bits are represented as the presence or absence of flux transitions. Furthermore, the universal RLL coding systems used to write data to disks use more than one Clock Sync Point per data bit, so a 1 might be represented by tn, or nt, or nn, as indeed might a 0. Overwriting a one with a zero, and vice-versa, is incorrect and misleading.2) Whilst the Linux dd utility may have been used to write the various bit patterns this is absolutely not what was written to disk. User data is scrambled several times by the disk controller before being written to disk using a RLL coding system. Neither you nor I, nor anyone else, knows what has been written to the disk, without taking it apart and looking at it. So overwriting a one with a zero, and vice versa, could never be constructed.3) RLL coding (using the common 2,7 as an example) is a method of substituting groups of two, three or four data bits with four, six or eight flux transition groups. Depending on which group its in, a data bit 1 might be represented by tn, or nt, or nn, as indeed might a 0. To interpret a data bit would require identifying the group and its start position, in other words decoding the full group, or possibly several groups. Just reading the two flux transitions for one bit would be meaningless, as would be the probability of recovering one bit in isolation.4) I know little about MFM techniques, or electron or atomic force microscopes, as they are variously called in the article. They apparently produce an image of the magnetic fields on the disk. Using that, how is the track and sector located? Even if a track can be seen, how is the correct track identified? How are sectors found? Starting at a random point the wave form would have to be decoded without error until a sector header is found and interpreted. As this would be unlikely to be the desired sector the headers/data/trailers/servo wedges/etc would have to be read until the correct sector is found. This seems a very onerous task.5) PRML is A method for converting the weak analogue signal from the head of a magnetic disk '' into a digital signal, and Complex statistically based detection algorithms are employed to process the analogue data stream as it reads the disk. Ill add the Viterbi algorithm: its worth a read on Wikipedia. Its difficult to grasp how these complex statistically based detection algorithms could be applied manually to an image produced by a MFM scan. Indeed theres very little detail in the original paper about the MFM scan process and findings at all.I have my suspicions that the this paper is based on theoretical probability more than what actually can or cant be extracted using a MFM. However I would be quite happy to be told that I have grasped the wrong end of many sticks, as I often do. A few answers would be nice, though.

Sarah

There's a very basic statistical error in this article. It claims that in, some scenarios, recovery techniques actually perform *worse* than chance. This just means that the authors simply chose the wrong recovery technique: If you have a technique performs correctly 36% of the time at recovering a random bit, then just flipping the output produces a recovery technique that performs correctly 64% of the time.Consequently, there's really no such thing as "worse than chance" in a scenario like this. Any significant deviation from 50%, in *either* direction, indicates that some data can be recovered in some scenarios. Find me a casino game where I can win 64% of the time against even odds and I'd never leave the table. That game would be the best investment strategy available anywhere.So the authors were able to get statistically significant results in some situations, presumably with a relatively limited time and budget. It's safe to say that governments have spent many millions of dollars, if not billions, pushing recovery techniques as far as they could. The authors make a strong case that accurately recovering a large fraction of the overwritten bits is quite difficult, but they also proved that some overwrite patterns leave behind significant amounts of information. To me, these results prove four things. First, only an adversary with significant resources can likely recover any amount of overwritten data. Second, perfect recovery of large sequences of overwritten data is likely impossible. Third, an single overwrite with a predictable pattern leaves significant amounts of information behind. Finally, and most importantly, a single overwrite with random data is probably more than enough security unless you are guarding critical state secrets.

Dale

Sarah, of course something can perform worse than chance. Please read up more on statistics before making such claims. Anyone who's tried playing the stock market and done it badly will be all too familiar with this concept.Of course an algorithm that performs worse than chance isn't a very good one, and could be used to produce one that performs better than chance. But that assumes the algorithm consistently (or at least on average) performs worse than chance, so unless you can determine the scenarios in which it performs worse than chance, or it on average performs worse than chance, you can't improve the average reliability.

Luis

For us lesser mortals, I wish there were some sort of table that spelled it out more clearly the unlikely hood of data retrieval from a wiped drive.For example:Writing all "0"s has this percentage of possible retrieval.Writing all "1" has this percentage of possible retrieval.Writing "0"''s and "1"s this many times has this percentage of possible retrieval.But then from the sound of it, just writing "0"s once in any case, all you might be able to retrieve are a few bits, and bits by themselves don't really make up usable data.

"This is awesome! We're seeing details that most people don't even know exist."- John Wright, Info Tech, Inc.

"For my line of work, basic &amp;amp; extensive understanding of the file system is extremely important. The literature and books on file systems for me are very critical &amp;amp; thanks you for them, great reference material"- Vince Ramirez, Las Vegas Metro P.D.