Need help in C#, twarted by sealed class...

First, this is my first C# program, I only really know PowerShell well, and do not have much experience with things like C#, C++, java, etc. just to give some perspective.

I wrote a program in powershell to search for duplicate files and optionally delete them and report statistics, but it's way intensive for interpreted PowerShell, as running it against 17,000 files took 5 hours. The script has optimizations, like checking if sizes match, and only if they do then checking the MD5. Only computing an MD5 once per file, and so on. But still is way too slow. So I wanted to do it in C# for speed. My problem is, I want the list of files to have custom properties, so I can determine if the file has already been matched and/or checked against the rest. And the way I wanted to do this is not allowed, because I wanted to extend the FileInfo class but it is sealed..So what's a better way to do this? I've included the powershell script so you can see what I'm trying to do, and the C# code I have.

/// <summary> /// Adds the specified file to the collection. If the item specified is a directory /// that directory will be crawled for files, and optionally (RecurseFolders) child /// directories. If the name part of the path contains wild-cards they will be /// considered throughout the folder tree, i.e: C:\Temp\*.tmp will yeild all files /// having an extension of .tmp. Again if RecurseFolders is true you will get all /// .tmp files anywhere in the C:\Temp folder. /// </summary> public void Add(string fileOrDirectory) { if (fileOrDirectory == null) throw new ArgumentNullException();

/// <summary> /// Raised when a new file is about to be added to the collection, set e.Ignore /// to true will cancel the addition of this file. /// </summary> public event EventHandler<FileFoundEventArgs> FileFound;

If I make a class, FileInfoExt, that contains a class, FileInfo, in order to access members/properties of the FileInfo class in FileInfoExt, I have to write code for each and every member/property, to interact with FileInfo from within FileInfoExt, for every member/property of FileInfo I wish to use? That's kinda a pita...

Your algorithm checks every file against every other file (though it at least bails out if certain criteria isn't met). This is extraordinarily expensive.

You should enumerate the files and first collect lengths. Then you can bucket all files based on their length. Any buckets that have only one element in them can be completely ignored. For buckets that have two elements in them, you can just do a byte comparison of the files. For buckets with more than two elements, you can generate hashes for the files and use those to determine matches.

// Ignore any cases where there is only one file of that length. Each 'set' in this list are // the files of the same length where there are at least two files. var potentialMatches = lengthToFilesMap.Where(kvp => kvp.Key >= 2) .Select(kvp => kvp.Value.ToList()) .ToList();

var lookups = potentialMatches.AsParallel().Select(list => { // TODO(cyrusn): optimize the case of just two files. No need to md5 in that case.

// TODO(cyrusn): pass in a proper comparer that will compare the md5 byte hashes // properly.

// Realize the groupings, and filter down to groups that have at least two items in them. return lookups.SelectMany(lookup => lookup) .Select(grouping => new HashSet<string>(grouping)) .Where(set => set.Count >= 2); }

The benefit of the nonparallel version is the ordering you get in the lists of duplicated files. Specifically for any given "List<string>" paths of duplicates, they will be ordered in the same order they were found when traversing the paths provided to this functoin. So if you always want to preserve files from earlier paths passed in and only want to delete any later duplicates you find, you can do so by then writing:

I don't get why C# would make any of this faster. You're certainly IO bound here and C#'s not going to make that any better.

Thanks for all the help I really appreciate it, I'm going to pour over all the info you guys have given later today. But to answer your question, Powershell is just really not a number crunching language. I made a test, I made 1000 files, all of different sizes (so MD5's would not be computed) and ran the C# code on it. It took 1.8 seconds. I ran the Powershell code on it as well, it took 1 minute 39 seconds. After you enumerate the files, it's mostly number crunching. It makes a huge difference when you have 17,000 files and 2TBs of data like I do, and I want to check it all for duplicates occasionally, like I said it took 5 hours to check that much data with the powershell script, I calculate it would take about 5 minutes with C# (assuming no huge files have size matches which I don't think there are.) In the past I've made request for a powershell compiler, and recently I suggested they try JITing the code like IE9 does (with one core interpreting and another core compiling, then switch to the compiled native code when it's available...) on the powershell connect web site but at this point, powershell is just not useful for large number crunching operations because it's interpreted.. which is unfortunate because I would love to have a fast powershell to do such things with.

And the reason the file/directory enumeration code is so complicated as you put it is because, as I said I am a complete C# noob, before yesterday I never read a C# book or looked at C# code, I just pieced together some stuff I found on the web and adapted other things I found on the web for this function, and that code is just what popped up in google, but as I said I'll look into your suggestion about making it simpler. I really have to, that code to enumerate the files is crashy (on access denied dirs\files) and doesn't work on the root directory for some reason, and I have no skill to fix it.

I wrote a program in powershell to search for duplicate files and optionally delete them and report statistics, but it's way intensive for interpreted PowerShell, as running it against 17,000 files took 5 hours.

At 100 MB/second, 2 TB will take about 5.5, 6 hours just to read. No computation: just reading.

If that 2 TB is spread across multiple spindles then the raw read time will be lower... but this is still the right ballpark. I'm not really sure what kind of speed-up you're looking for, this is an I/O-bound task. It's of random I/O (seeks, directory enumeration), and probably a reasonable amount of sequential I/O (for files with matching sizes).

Now, if your data profile is such that no files actually have matching sizes, then yes, 5 hours is long--but not because of number crunching. You keep talking about PowerShell being ill-suited number-crunching. I don't know what you're talking about--you aren't doing any number-crunching in PowerShell. You're using the .NET framework to do the only numerically-intensive task, the MD5 computations. The performance of that will be identical whether you call it from PowerShell or C#. That's one of the points of .NET, really....

I don't get why C# would make any of this faster. You're certainly IO bound here and C#'s not going to make that any better.

Thanks for all the help I really appreciate it, I'm going to pour over all the info you guys have given later today. But to answer your question, Powershell is just really not a number crunching language.

FWIW, your task has almost no number crunching at all. What you have is an IO reading + confidence building problem. Specifically, you're starting with N files that you have almost no confidence which you can say are duplicates. Using lengths gives you more confidence to say which are duplicates, but certainly not enough to start doing destructive operations. Using MD5 hashes gives you the most confidence, but is enormously expensive, esp if you have several large files of the same length that you are testing out.

The fortunate thing about confidence testing is that it can often tell you when things are not the same *very quickly*. i.e. if you have a file that is 1024 bytes and another that is 1025, then they are definitely not the same. So, what you want to do is structure your algorithm to as quickly as possible eliminate possibilities, and only do the most expensive checking if all else fails.

For example, in the algorithm i provided, i would likely insert a middle step. This middle step would take the first 1k of hte file and hash that. Only if the files collided after that first k would i actually go onto something more expensive. With the length, and the hash of hte first K you'd almost immediately eliminate anything without duplicates. At that point you'd actually have fairly high confidence (though long files of the same length with and differences after the first k would still collide), and you could switch to the full hash just to make certain.

That would limit the amount of IO you would need to do enormously.

Quote:

I made a test, I made 1000 files, all of different sizes (so MD5's would not be computed) and ran the C# code on it. It took 1.8 seconds. I ran the Powershell code on it as well, it took 1 minute 39 seconds.

I don't see how you can compare the two bits of code. They're very dissimilar.

Quote:

After you enumerate the files, it's mostly number crunching. It makes a huge difference when you have 17,000 files and 2TBs of data like I do, and I want to check it all for duplicates occasionally, like I said it took 5 hours to check that much data with the powershell script, I calculate it would take about 5 minutes with C# (assuming no huge files have size matches which I don't think there are.) In the past I've made request for a powershell compiler, and recently I suggested they try JITing the code like IE9 does (with one core interpreting and another core compiling, then switch to the compiled native code when it's available...) on the powershell connect web site but at this point, powershell is just not useful for large number crunching operations because it's interpreted.. which is unfortunate because I would love to have a fast powershell to do such things with.

The problem here is that you are computing hashes with powershell. Don't do that. Use the built in .Net functions that will do it for you.

Also, your algorithm is just *very* innefficient. You're repeatedly enumerating all your data when checking a single file. There's no need to do this and it is quite costly.

Quote:

And the reason the file/directory enumeration code is so complicated as you put it is because, as I said I am a complete C# noob, before yesterday I never read a C# book or looked at C# code, I just pieced together some stuff I found on the web and adapted other things I found on the web for this function, and that code is just what popped up in google, but as I said I'll look into your suggestion about making it simpler. I really have to, that code to enumerate the files is crashy (on access denied dirs\files) and doesn't work on the root directory for some reason, and I have no skill to fix it.

That's fine. I've certainly been in that position before. You write a whole bunch of code, only to realize you're duplicating something the Fx provides for you. That said, you def don't want to write your own enumeration. You also don't want to write your own hashing. Frankly, you odn't want to write much of anything All you want to do is write a simple query over your data that does what it can to minimize the amount of IO that you need to perform.

IME, from writing my own code to do this long ago, hashing isn't really worthwhile anyway. For the data I had, most file lengths are unique, and most replicated lengths are found at most in just a few files. I got the best performance from simply memcmp()ing the files piecemeal, and aborting as soon as a difference was encountered. There's no point in hashing (which must read the whole file) if a memcmp() will abort at the first byte because the files aren't in fact duplicates (which is the common case).

I can envisage that there are usage scenarios that may have different assumptions (for example, if you had a bunch of dated directories containing source code, which is a common version control (anti-)pattern)--lots of duplicate lengths, lots of duplicate files. But even then, it's not clear that hashing is useful. Hashing doesn't save I/O. In fact, it makes it worse: for hashing, you must read every file in its integrity, and on those rare occasions that you do have an identical hashcode, you should really make a bytewise comparison anyway to ensure it's not merely a collision. In no situation do you perform less I/O than you would in straight comparisons.

Heh, very good point DrPizza, hashing the whole file (for large files) when the first byte could be different is wasteful, I never considered that for some reason.Makes sense to say, read 1k bytes at a time, compare, then move on to the next 1k, and so on stopping when there's a difference.

Meta: I think the comparison is pretty valid, for the powershell code I disabled the segment of code that computes MD5's and deletes files, etc. so basically it just did a "dir <filespec>" and looped through each file with the inner loop and outer loop, same thing the C# code did, and the C# code (with the filinfoext fixed so it works) is two orders of magnitude faster at that. I tried with 1000 files (1.8 seconds vs 1 minute 39 seconds) and 1800 files (3.5 seconds vs. 6 minutes and some seconds).

And please excuse me for saying "number crunching" I meant processing intensive, versus I/O intensive. In my tests, because the MD5 hashing was disabled, the only file i/o was when it did the initial file enumeration. I'm not a programmer by profession or anything just a very light computer hobbyist basically, so I apologize for mangling the terminology like that.

IME, from writing my own code to do this long ago, hashing isn't really worthwhile anyway. For the data I had, most file lengths are unique, and most replicated lengths are found at most in just a few files. I got the best performance from simply memcmp()ing the files piecemeal, and aborting as soon as a difference was encountered. There's no point in hashing (which must read the whole file) if a memcmp() will abort at the first byte because the files aren't in fact duplicates (which is the common case).

I can envisage that there are usage scenarios that may have different assumptions (for example, if you had a bunch of dated directories containing source code, which is a common version control (anti-)pattern)--lots of duplicate lengths, lots of duplicate files. But even then, it's not clear that hashing is useful. Hashing doesn't save I/O. In fact, it makes it worse: for hashing, you must read every file in its integrity, and on those rare occasions that you do have an identical hashcode, you should really make a bytewise comparison anyway to ensure it's not merely a collision.

To have a collision probability of 10^-18 (already more reliable than almost anything else in the system), this would require approximately 2^98 unique blocks (2^115 bytes @128k) to be written, well beyond the limits for any forseeable storage platform.

There are storage platforms built on the idea of using hashing for dedupe. So i think it's ok for Agressiva to do the same. Given that, i think the above approach is reasonable. Though knowing more about his expected data sets would make it possible to better adapt the algorithm to his needs.

Meta: I think the comparison is pretty valid, for the powershell code I disabled the segment of code that computes MD5's and deletes files, etc. so basically it just did a "dir <filespec>" and looped through each file with the inner loop and outer loop, same thing the C# code did, and the C# code (with the filinfoext fixed so it works) is two orders of magnitude faster at that. I tried with 1000 files (1.8 seconds vs 1 minute 39 seconds) and 1800 files (3.5 seconds vs. 6 minutes and some seconds).

I don't have your code so i can't verify or contradict what you're saying. All i have is the code above (which use very different techniques to do what they're doing). If you provide me with your actual code that you're testing, i can try to figure out why they're behaving differently.

MD5 collisions can be effectively forced, so it doesn't seem "utterly paranoid" to me, assuming he wants some kind of a reasonably general-purpose solution: there are files floating around that are known to have matching hashes but different content. SHA-1 may be a liability too, though I believe the attacks at the moment are all theoretical, computationally.

MD5 collisions can be effectively forced, so it doesn't seem "utterly paranoid" to me, assuming he wants some kind of a reasonably general-purpose solution: there are files floating around that are known to have matching hashes but different content. SHA-1 may be a liability too, though I believe the attacks at the moment are all theoretical, computationally.

In any case, I'm still not seeing a good argument for hashing.

Thats why i said we needed more information from him. Does he really need his system to be safe from someone effectively trying to force a SHA collision? Or is this a normal file system with non hostile users?

The benefits to hashing for me is the following. i have the following files:

All these files are the same length and none are the same. Without hashing, i need to compare A against B. Then A against C... all the way up to A against H. That's 7 comparisons, each of which need to read in 2GB of data. So 14GB total read. And at the end of that, i still need to test B against C and B against D, etc. etc. It will be a total of 54 GB that i need to read in order to do this.

Where none of hash1 through hash8 collide. I will only read a total of 8gb and i will be able to eliminate the need to do any more reading of these files.

hashing also doesn't hurt (unless you're in some sort of scenario where you have a hostile attacker that's trying to force a collision) because the CPU can easily compute the hash faster than you can read the data. So since you'll need to read in the data, you might as well have the hash as it prevents you from having to do costly compares of files that are very similar.

Now, let's also look at the case where the files are not similar, they're identical:

In your system you'll need to perform 14gbs of reads. 2gb to see that A=B and to remove B. Then 2GB to see that A=C and to remove C... etc. etc. With a hashing system you only need to do 8gb of reads. After the reads you can then use the hashes to say (with EXTREMELY high confidence) "B through H are duplicates of A and can be deleted safely".

Given that default for ZFS dedup is to behave in this manner, i think it can be a sensible default for aggressiva as well. Now, if he comes back and says that it doesn't matter how low the priority is, it is unnacceptable to trust matching hashes, then we can address this.

IMO, given a normal FS, i would think it was acceptable to take the hybrid approach i mentioned before of "check length, check 1st page hash, check file hash". In practice i believe the first two checks alone would eliminate nearly all of the false positives, meaning that the running time of this approach would be O(D + N) (where D was the size of all the duplicated files on the system and N is the number of files examined). If D is low, then all you're doing is enumerating the files and doing a small amount of IO on each of them. This means, dpeending on how many duplicates you have, you will range between just looking at every file just for its metadata and a tiny bit of additional data, all the way to having to look at every byte on the system. IMO this is ideal (barring special smarts in the FS layer). You can't really avoid at least examining every file, and in the worst case you perform no more IO than to just look at each file once. Other approaches may have substantially worse worst cases, and that it what i was trying to avoid.

// Actually materialize the list. That way if someone 'foreachs' numerous times over this // query, they don't incur the cost of actually doing all the disk traversals after the // first time. return query.ToList(); }

I've tried to structure it in a very easy and natural to understand manner. The simple pattern i use in the main query is one where you repeatedly 'bucketize' files based on some criteria and then ignore any buckets with only one file in it. Note, each successive 'bucketizing' step operates on the previous bucket, not the entire set of files again. i.e. when you're comparing the hashes of the 1st k of the files, that's only for files that already matched based on length. Similarly, when you're bucketizing based on the entire hash, that's only for files that had hte same length *and* had an initial hash that was hte same. So each successive 'ToLookup' and 'Where' clause pair continually filters the potential match groups down to less and less elements until you are confident enough to actually return the values. If you know things about your data that would allow you to put in fast checks in between the "cheap but innacurate" length-check, to the "expensive but accurate" hash-check, then feel free to do so. This gives you flexibility to tune the application to your specific needs.

Thats why i said we needed more information from him. Does he really need his system to be safe from someone effectively trying to force a SHA collision? Or is this a normal file system with non hostile users?

A lot of filesystems (on shared-use systems) do have hostile users.

Quote:

All these files are the same length and none are the same. Without hashing, i need to compare A against B. Then A against C... all the way up to A against H. That's 7 comparisons, each of which need to read in 2GB of data. So 14GB total read. And at the end of that, i still need to test B against C and B against D, etc. etc. It will be a total of 54 GB that i need to read in order to do this.

Yes, because it's not possible to hold the buffered content of more than two files in memory simultaneously?

The way I would do this is to create a queue of sets of possibly identical files along with a file offset meaning "guaranteed identical up to here". Then your operation is: Dequeue one set. If there is only one element in the set, discard it and move on, otherwise start doing an N-way byte-by-byte comparison from the indicated offset. As soon as you get a difference, put all the files that don't agree with the first member in a new set, push it on the queue and continue until your current set is down to one file (unique) or end-of-file. This methods is optimal in that it is will only read each file up to the point it becomes unique, and only at at most once.

The biggest downside I can see to this approach is that you have to read in largish blocks (access time * STR, at least a few MB) to avoid thrashing your drive too badly. If there are many files of the same length you could have memory issues. You could resort to hashing in that case, or you could make the block size dependent on the cardinality of the set. If you have hundreds of files of the same size, start doing comparisons with small blocks. Unless your use case is really pathological, most will compare unequal quickly and you can bump up the block size to something more optimal for the disk. If for some reason you have to deal with huge numbers of identical or nearly identical files then hashing may be the way to go.

The real use case for hashing is if you are going to save the results for future use. If you can cache the hash values between runs and guarantee you can detect if files change between runs you can completely eliminate reading that file on the second run. This is easy to do if you are writing the filesystem yourself ala ZFS, possibly harder in other cases.

Now if only there was a dedupe program that could ignore metadata in media files and compare only the data streams. I must have 6 different copies of I Am The Walrus floating around on my computer after years of copying files to various places and fixing metadata differently in all of them. :sad:

Now if only there was a dedupe program that could ignore metadata in media files and compare only the data streams. I must have 6 different copies of I Am The Walrus floating around on my computer after years of copying files to various places and fixing metadata differently in all of them. :sad:

Do I have to add some 'using x' commands or something? My paths variable is a string array of paths.

I just want a list of fileinfo type of all the files in the paths variable, and/or all the files in all the paths of the paths variable, with optional recursion for directories, like the powershell command: $files=(dir $paths -recurse)

OK Nevermind I got it to compile in Visual C# 2010 Express. Now, why does that code stop in the $recycle.bin after I run it on C:\ with recurse set to true? It lists all 17,000 files on my G: drive, but only 24 on my C:\ drive, stopping as I said, in $recycle.bin.

How can I get FileInfo objects instead of strings for the file listing, and how can I get it not to stop in $recycle.bin on my C: drive? I assume I have to catch more exceptions? Can you just specify to ignore/discard any exceptions? And also, if I specify a file name instead of a directory, it crashes?

What I want exactly: I have a list of file names, file specs, and directories in a variable, so like paths = {"c:\file.dat","c:\notes\*.txt","c:\data\"} - and it returns all files that match the file name, the file spec, and the files in the directories, as FileInfo objects.

Hmm, maybe I mixed up something earlier, anyhow now I am getting "error" message I print on exception e and the app quits. How can I get the app to continue and just ignore invalid/access denied files/paths?