Detect “extra found files” which are contained within valid actual files

Detect “extra found files” which are contained within valid actual files

Posted: Tue May 07, 2019 2:10 am

by abolibibelot

Hi,

Is there a way to detect if a file found by R-Studio and appearing in the “Extra found files” section of the recovery tree is entirely contained within a valid actual file present in the “Root” section (found by the analysis of the filesystem) ?
File detection by way of file signature analysis is not perfect and can lead to false positives, when a random fake signature is found within what is actually a valid and complete file. For some file types in particular, the number of false positives or redundancies can be staggering. I analysed a 4TB HDD which was just formatted, and was almost full before that, I haven't extracted the files yet but it's highly likely that most of them are in perfect condition. And yet, beyond the 3.78TB of files found in “Root” (which is already a tad more than the drive's capacity), R-Studio found no less than 14.87TB of “Extra found files”. Within that mountain of data, 14.51TB are in “Multimedia Video”, 11.12TB are in “VOB video files”, 580.89GB are in “MPEG Video”, and among those VOB and MPG files, many are in fact contained within each other, with the latter files being 2048 bytes shorter than the former ones – most likely because there is a MPG or VOB file signature every 2KB inside a valid VOB or MPG file. And most likely, all those “extra found” VOB and MPG files are in fact entirely contained inside DVD images which are present in “Root”, most likely 100% valid and complete. R-Studio should detect that, or at least let the user know that those files are 100% redundant, and should detect any sequence of contiguous 2KB segments starting with either a VOB or MPG signature as a single file.

(Here “.X” and “VTS_01_2.VOB” are most likely duplicates – see below – while, for instance, “0166563.vob” to “0166567.vob” are all mostly identical files, each subsequent one being 2048 bytes shorter then the previous one, all of them being most likely contained within an actual .vob file found through the filesystem analysis, but there's no easy way to detect this.)

Side questions / suggestions :

– I read in the version changes for R-Studio 8.10.173857 that there was a new option to automatically exclude files in “Extra found files” which are strict duplicates of files found in “Root” (with the exact same name, size, attributes and location in this case – and if extracted at the same time those files are extracted as hard links), but I couldn't find that option. I also found mention of advanced informations added to the “Technician” edition regarding “overlapping files” – could this be exactly what I just requested above ? But the price tag of the “Technician” license is way too steep for me at the moment...

– There should be an option to sort files by type / extension. Generally speaking I would prefer to exclude files in “Extra found files” which are duplicates of files found in “Root” (files found by analysis of the partition's filesystem), to reduce the clutter, and avoid having to uncheck them manually (by the way : it's still much slower to check / uncheck a large number of files within a directory than to check / uncheck that whole directory, when there are thousands of files to check / uncheck the whole program can stay frozen for several minutes) but it can be useful to display them in order to detect which files no longer have their original content because they have been overwritten. For instance, if I find a MP3 file within the JPG directory, it means that this MP3 file has been overwritten by a JPG file and is therefore no longer valid as a MP3 file. So, in “Extra found files”, I usually extract all files which have no counterpart in “Root” (and are truly “extra found”), plus files which do have a counterpart but are of a different type from that indicated by their extension. As R-Studio extracts those files as hard-links that doesn't take more space and it's a reminder that those files are not valid, or are not what they appear to be. But as it is now I have to detect those abnormal / unexpected file extensions by verifying the whole list, and check those few files manually. As I said, I haven't yet found the option to exclude all files which are hard-linked with files from “Root”, so I don't know what it looks like, but it would be nice to also provide the option to display only files from “Root” which do not have an extension consistent with their detected signature. (Obviously there will be many such files within directories like “Text document”, since many valid files contain plain text and can be detected as such.)

– Why is there no longer a right-click menu option to go from one hard-linked file to the other(s) ? (Which could be used for instance in cases like those described above, to directly examine a folder containing at least one file with invalid contents, and check if other files in that folder have been overwritten as well, in which case it might not be worth extracting it especially if it used to contain large files.) In its place, there's a “Get info” menu, which is nice, but it seems to systematically make the whole software crash a few seconds after it's been displayed. (On Windows 7 64 bits.)

– An option to calculate MD5 checksums during the scan would be very nice, as it would make it possible to detect duplicates before extraction, and it shouldn't slow down the analysis too much. And also an option to calculate the MD5 for individual files after the scan. That 4TB HDD I'm analysing was most likely in some kind of NAS enclosure with a Linux based operating system, and it contains a lot of duplicates of large video files inside a “@sync” directory, with names like “.7”, “.Y”, “.Q”... (Those are not hard-linked with the actual files found in other (user-defined) directories, and have different timestamps, so as it is it's not possible to be certain that they are indeed strictly identical before extracting all of them and comparing them.) If checksums could be optionally displayed in a column alongside the size it would also allow to quickly check if some (large) files found on a given storage device with the same size and the same or a similar name as files already present on another device are indeed true duplicates, in which case it may not necessary to extract them, or if they're different (in which case the one about to be recovered is most likely partially overwritten, but it could also be that the one already present elsewhere and directly accessible has been corrupted at some point).

– Files in “Extra found files” should have a unique name, based on the sector number or cluster number for instance, which would not change if the same device is analyzed again with different options, in particular a different “known file types” selection, or analyzed again with a newer version of R-Studio, or if the volume has been altered between two analyses. The fact that the names are random makes it difficult to identify files which are identical or similar. Once, I started analyzing a 3TB HDD with R-Studio 7.7, extracting everything it had found, then I analyzed it again with R-Studio 8.0, which had improved the detection of MP4 and WMV files for instance, and I spent quite a lot of time doing extensive comparisons with a duplicate files finder (DoubleKiller) so as to identify partial duplicates with different sizes based on their first few sectors, and compare them to keep the better one (for most of them, when they were different, the one extracted by R-Studio 8.0 was more complete, but for some neither was complete and the one extracted by the more recent version had added garbage to a truncated or fragmented file). It would have been much quicker if those files had had the same name.

– The informations provided in the text file generated by the “Save file names to file” option should be more complete, and include at least the size and timestamps, like Recuva's similar “Save list to text file” feature ; other useful informations would be the first LBA (first cluster number), and the MD5 if it's been calculated, in case this suggestion is implemented at some point. There could be a list of items to check / uncheck or a customizable template so as to reduce the clutter when only some of those informations are required.

– In the Hex editor (by the way I've never been able to actually edit anything with it, even after checking “enable write” in the settings, so it's more like a hex analyzer), it could be useful to be able to select the whole list of sectors of a file, or a part of them, with the SHIFT / CTRL keys. (Here is one instance where it would have been useful ; I could extract such a list with Recuva or HD Sentinel, but R-Studio only allows to copy one value at a time, which is not practical when there are hundreds or thousands of lines.)

– When there are two folders with the same name, extracted at the same time, they are merged with no warning and the timestamps applied are those of the last extracted. It should be as it is for files, with a warning and an option to rename automatically.

I hope that this is all clear enough, and not too overwhelming for a single post!

Re: Detect “extra found files” which are contained within valid actual files

– I read in the version changes for R-Studio 8.10.173857 that there was a new option to automatically exclude files in “Extra found files” which are strict duplicates of files found in “Root” (with the exact same name, size, attributes and location in this case – and if extracted at the same time those files are extracted as hard links), but I couldn't find that option.

This option is "Do not recover duplicate files from Extra Found Files" on the Recover dialog box.

Re: Detect “extra found files” which are contained within valid actual files

Posted: Wed May 08, 2019 11:36 pm

by abolibibelot

As a follow-up, here is another screenshot showing the contents of that “@sync” directory, with a nested structure of subfolders named with 1 character. Apparently the Ext4 filesystem is case sensitive, and here letters of both upper and lower case are employed. When extracting that directory to a NTFS partition on Windows 7, R-Studio merges the contents of subfolders “a” with “A”, “b” and “B”, etc., with no warning, so the extracted structure does not accurately reproduce the original structure (and it would be very tedious to fix manually as there are 1620 subfolders -- I would have to first extract all the lower-case subfolders, then rename them, then extract the upper-case subfolders... this is not going to improve my O.C.D. tendencies ! ).

Re: Detect “extra found files” which are contained within valid actual files

As a follow-up, here is another screenshot showing the contents of that “@sync” directory, with a nested structure of subfolders named with 1 character. Apparently the Ext4 filesystem is case sensitive, and here letters of both upper and lower case are employed. When extracting that directory to a NTFS partition on Windows 7, R-Studio merges the contents of subfolders “a” with “A”, “b” and “B”, etc., with no warning, so the extracted structure does not accurately reproduce the original structure (and it would be very tedious to fix manually as there are 1620 subfolders -- I would have to first extract all the lower-case subfolders, then rename them, then extract the upper-case subfolders... this is not going to improve my O.C.D. tendencies ! ).

Thank you for reporting! We're investigating into this problem.

Re: Detect “extra found files” which are contained within valid actual files

As a follow-up, here is another screenshot showing the contents of that “@sync” directory, with a nested structure of subfolders named with 1 character. Apparently the Ext4 filesystem is case sensitive, and here letters of both upper and lower case are employed. When extracting that directory to a NTFS partition on Windows 7, R-Studio merges the contents of subfolders “a” with “A”, “b” and “B”, etc., with no warning, so the extracted structure does not accurately reproduce the original structure (and it would be very tedious to fix manually as there are 1620 subfolders -- I would have to first extract all the lower-case subfolders, then rename them, then extract the upper-case subfolders... this is not going to improve my O.C.D. tendencies ! ).

Thank you for reporting! We're investigating into this problem.

Actually you cannot have A and a folders on NTFS under Windows. We're going to add a specific option that would create something like A and a_ on file systems with Case insensitive environment.
This article may be of some help: Turn On Windows 10 NTFS Case Sensitivity if you need an immediate solution.