Is there a better way to store this data?

04-21-2010, 01:48 PM

I'm beginning a Proof of Concept that will store a checksum of each and every file of each and every server on my lan. I want to keep repeat data to a minimum, and still be able to "reconstruct" and search for files based on file size, names, locations and even the checksum's (hashes) themselves. I've attached a CSV that should give you a indication on how I envisioned the DB to be laid out. What I am curious about is if can be improved, if I can make any other optimizations to keep the DB as "light" as possible.
The reason it's designed the way it is should be obvious, however the "path-to-hash" section still troubles me that there will be a lot of repeated data over and over in all the fields.
I have hundreds of workstations and servers, so keeping repeat data and overall DB size to a minimum is an obvious design goal. As such, Filenames are independent of file hash(checksum), and ID numbers (in base36) are used instead. Exact copies of files will hash the same, even if the name is changed, so again hash's are independent of name. Root's, like partition's and drive letters are also independent of the rest of the path. Paths are also reduced to base36 number ID's. All of that seems well and good, but when cataloging which file names, their hash's, paths and on which computer they are on, it seems like I could probably reduce this further, but I'm wondering if I've already crossed a point of diminishing returns? Suggestions, questions and people telling me I'm crazy are welcome.

Comment

Your example is too complex for me to take the time to understand fully, but I think you are asking the wrong question. Start with a fully normalized design, and if you need to for performance, then denormalize it -- but not until you prove the need.

Comment

I agree with xaprb. With decent/proper indexes I'm not sure you're going to run into problems. Perhaps it's as simple as 1 table:
file_hash, file_name, file_path (machine, root, full path)

Obviously pulling the machine info and file hash out to new tables would be a smart move since multiple columns define each.

However, the duplicated paths would probably bug me too - way too much redundant data.

I guess it depends on how you're using the data. If you always search by file name or file hash the path is just a look up value. If, on the other hand, you want to see what all is in a directory on a specific machine or to compare the same directory on multiple machines then your version seems like it would work better.

I don't think I understand the "Filename-to-hash" table. I think some table would contain the ID from "Path-to-hash" (where you define properties of each file on each system) and "hashes" (where you define a unique file across the network). Honestly, you could add the hash ID column to "Path-to-hash" which would give you the entire path to each file and the hash in one spot. The "Filename-to-hash" table would go away, queries could be on any key, look ups would be simple.

Troy

Comment

Thanks for the replies!
The Filename_to_hash was part of not having to repeat filenames, so setup.exe or calc.exe, those (examples) filenames would only exist once in the DB, but they are common names so I should have a way to derive which hashes use that same filename. Different versions of files all have the same name, but their hash's are different. Even completely different files have common names, install/readme/contrib/kernel... etc... so I wanted to save space by only storing unique filenames, paths and root's.

I do need to "recall" where each file is on the network. The system may one day be used for a backup system as well as our forensics and compliance efforts. Again it's Proof of Concept currently, but I would like to reduce as many redundancies as possible in case this POC does come to fruition and can be rolled out on a global scale. Speed is always nice, but space saving is of the utmost importance currently. I can wait a minute or more if we need to run reports. An example scenario I've been using is pc_009 detects a virus in folder c:\windows\system32, with a daily dump of the machines hash's I could forensically give an estimate of when the file arrived, and possibly use it's hash to see if others might have the same file, even if hash wasn't used as a locator, perhaps the name is unique enough to locate on others... Again we have thousands of machines, and the average csv (root, path, filename, size, hash_1, hash_2)for them is 15megs or more, reducing to unique path's shaves off a few megs, and unique hash's down another 1 meg.

If you could, lay out a CSV like the one I did with your suggestions, I'm very visual and I am a novice who's ideas severely exceed his skill set. Thanks again, I do appreciate your input!
-rich