[HOWTO] File Location Catalog

Question

I've been seeing quite a few requests about knowing which files are on which drives in case of needing a recovery for unduplicated files. I know the dpcmd.exe has some functionality for listing all files and their locations, but I wanted something that I could "tweak" a little better to my needs, so I created a PowerShell script to get me exactly what I need. I decided on PowerShell, as it allows me to do just about ANYTHING I can imagine, given enough logic. Feel free to use this, or let me know if it would be more helpful "tweaked" a different way...

Prerequisites:

You gotta know PowerShell (or be interested in learning a little bit of it, anyway)

All of your DrivePool drives need to be mounted as a path (I chose to mount all drives as C:\DrivePool\{disk name})

Your computer must be able to run PowerShell scripts (I set my execution policy to 'RemoteSigned')

I have this PowerShell script set to run each day at 3am, and it generates a .csv file that I can use to sort/filter all of the results. Need to know what files were on drive A? Done. Need to know which drives are holding all of the files in your Movies folder? Done. Your imagination is the limit.

Here is a screenshot of the .CSV file it generates, showing the location of all of the files in a particular directory (as an example):

Here is the code I used (it's also attached in the .zip file):

# This saves the full listing of files in DrivePool
$files =Get-ChildItem-Path C:\DrivePool -Recurse-Force|where{!$_.PsIsContainer}# This creates an empty table to store details of the files
$filelist =@()# This goes through each file, and populates the table with the drive name, file name and directory nameforeach($file in $files){
$filelist +=New-Object psobject -Property@{Drive=$(($file.DirectoryName).Substring(13,5));FileName=$($file.Name);DirectoryName=$(($file.DirectoryName).Substring(64))}}# This saves the table to a .csv file so it can be opened later on, sorted, filtered, etc.
$filelist |Export-CSV F:\DPFileList.csv -NoTypeInformation

Let me know if there is interest in this, if you have any questions on how to get this going on your system, or if you'd like any clarification of the above.

Recommended Posts

Is there anyway this could be used to restore only the missing files in the event of a drive failure?

Good day.

Of course, this is kinda the whole point. Do you have Excel? You can load a CSV from before the loss of a drive, and a CSV from after the loss, and compare them. There is this function in Excel called VLOOKUP. You load both files into the same workbook as sheets, add a column to one of them and VLOOKUP the file paths in this one to the paths in the other; whatever's missing is what you've lost. You could setup conditional highlighting to do the same thing (I think.) Once you got a list in Excel, you can sort then copy/paste in a text file. You can then automate the process of recovery by writing a small batch script that reads the text file and copies the missing files from backup back onto the pool.

If you do not use duplication at all, then it's even easier, just sort by disk and whatever's on the lost drive is what you need to recover.

Share this post

Link to post

Share on other sites

nice personally i just tree to a txt file in the google drive folder but i script it like yours.

edited because i thought I might as well share what is in my script in case anyone prefers text files. i have that folder regularly backed up to goodle drive. i happen to like the format but yeah the fancier format allows some nice things to. I run it weekly as a 80mb file is not insignificant

I just do the folders currently because else it becomes huge. been thinking about moving it off the ssd though and if i do i will make it do to files at a time. one with the files because trying to go through the entire structure with files is a bit intimidating honestly and hard to glance through

Share this post

Link to post

Share on other sites

If the cause is what I think it is (your folder/file name paths are over 255-260 characters long), there may not be a lot that can be done, except to fix that issue directly by renaming the file names/folders so they are shorter.

You could use a utility like Bulk Rename Utility (http://www.bulkrenameutility.co.uk/Main_Intro.php) to rename files (if there are too many to do manually). You could also use the 'subst' command to temporarily create a drive letter shortcut to the file path, then use that drive letter to move/adjust as necessary.

Hope that helps...maybe someone else can chime in if there's a better way?

Share this post

Link to post

Share on other sites

Just a warning, something that got me recently - using -recurse on Get-ChildItem makes it follow all symbolic links also in case you have any. It might even follow through to .lnk file targets, I haven't tested that.

Share this post

Link to post

Share on other sites

Just a warning, something that got me recently - using -recurse on Get-ChildItem makes it follow all symbolic links also in case you have any. It might even follow through to .lnk file targets, I haven't tested that.

Share this post

Link to post

Share on other sites

I really liked the idea of doing this, but as others had also mentioned, I had issues with the maximum path limitation. I also wanted to log other information, like the drive model, serial number, etc. I also wanted to schedule it as a daily task, and to have it automatically compress the resulting files to save disk space.

I wrote a powershell script to do all of that. It relies on the dpcmd utility's output (for which you will need a recent version). DrivePool itself isn't limited by the maximum path limitation, and thus the dpcmd log output also isn't constrained by the path limitation. The script takes this log output, parses it with regex matches, retrieves associated disk information, and then writes out to a CSV. It then compresses it. The header of the file has a command you can paste into a CMD prompt to automatically schedule it to run every day at 1AM.

Please edit the file. Two variables need to be customized before using it, and the file describes requirements, where to put it, how to schedule it, what it logs, etc.

If you want to do a test run, you can just edit the two variables and then copy/paste the entire file into an elevated powershell window. The .CSV (generated once the .LOG is finished) can be viewed as it's being produced.

Also, you might want to hold off playing with this if you're not familiar with powershell scripting/programming in general/etc until a few other people report that they're making use of it without any issues.

@Christopher - If you want me to make this a separate post, just let me know. Thanks

Share this post

Link to post

Share on other sites

Nice solution to the path issue! I also like the idea of creating an "archive" location for past reports and zipping them up to save on HDD space. Since I now know there is at least some interest in keeping this going, what do you think of bring this to Phase Two: GUI? I've been meaning to get going on it for quite a while, but have had no time...

Share this post

Link to post

Share on other sites

It certainly sounds good. I'm not sure if it'll be worth us to put in the time required as end-users, though... If it was built into the application itself it'd provide more utility to the entire userbase, but as end-users writing an accessory utility, only a small portion of the overall userbase will end up using it. And, being honest, I'm sure not everyone is even as paranoid about all of this as we are

The only UI work I've done is in AutoIT, but from my limited experience with it, doing those first two tabs would be a lot of work. UIs, at least in AutoIT, are kind of a pain. Or, maybe I'm just less comfortable/experienced with it. Opening the CSVs directly and filtering on the headers with a spreadsheet program are probably things that anyone who would be using an accessory utility like this would be comfortable doing. I could be wrong though...

The third tab (and a UI for an installer which would put the file in a fixed path, run the task scheduler command, etc) would probably bring more utility to people for the time invested, though - I know some won't feel comfortable editing a script file, even to just modify a few variables. Even that would still probably be a fair amount of work, though... I work with powershell, regex stuff, etc a fair amount, so that was pretty easy. I'm not comfortable with trying to tackle a UI for it personally. If you feel inspired, though, then go for it!

The idea about retention is a good one - I think I'll update the script to add that as a variable and a routine to clean up old files.

Edit - I added a day-based retention setting and added the updated "V1.1" attachment to the post above.

Share this post

Link to post

Share on other sites

i changed a couple of paths so i could put it on my d drive and have created a separate directory for the logs

i have initiated it from within task Scheduler and its off to the races

Now it has 40+TB to run through (duplicated pool) which is balancing at present - so will be interesting to see if i get files in more than one location in the log

Also will i get two instances at 1:00 as i guess it will not have finished by then

Couple of thoughts

1. Yes a UI would be nice but not essential

2. More pressing with large pools is the size of file its going to create - i'm thinking put the info into a free db program e.g. SQL Lite - or something more powerful - depends on performance- rather than use Excel

3. Some sort of summary file at the end - or thats generated form the db

4. Not looked whats required but there is an api for doing plugins for DP - could be integrated that way perhaps??

Will report back when the machine breaks or finishes

Have experience of the above not a big task just needs a bit of time to code it up

Share this post

Link to post

Share on other sites

Also will i get two instances at 1:00 as i guess it will not have finished by then

Since you initiated it via task scheduler, it shouldn't run a second time as long if it's still running.

I've done some work with SQLite via powershell before. It definitely allows for more flexibility and performance, and I'm comfortable running SQL queries, but without a UI, I think not being able to just open a CSV and filter in excel would limit the usefulness for a lot of people... it certainly might be a good alternate "mode" to have, though. Worst case, there's probably an easy way to convert a CSV over to a SQLite DB with one table.

It will definitely be interesting to see how this turns out for you, since it seems like you have an extremely large pool. If you get the resulting CSV open in excel when it's done, please jump to the end and let us know the number of rows, and how long it took to produce it (the difference between the timestamp on the "DPAuditLog-Current.log" and "DPAuditLog-Current.csv") along with the CSV file size (and the .ZIP file size). Thanks!

Well its generated 150 mb file in 20 minutes so not huge but it has a million or more to run at

did your tests have the full unc path including the volume name - as i dont have my drives lettered as i have more than 26

@dj80 also will the csv have the unc path or will it drop that to device 1,2 ,3 etc?

It doesn't rely on having any of the drives mounted. The path it records in the CSV is relative to the base of the pool, like: "BACKUPS\365\CLIENT\data\filename.file"

It then records, per line, the disk number (what you see in disk management), along with the disk model and serial number (to help physically identify it) and the file size in bytes.

...I guess one way of cutting down on the size would be to just record the disk number, and then record the associated disk models and serial numbers for each number as a separate file that is zipped up together with the CSV.

Share on other sites

The script takes the trailing "[Device XX]" from the dpcmd output, matches it up to the "DeviceId" property from "Get-PhysicalDisk" in powershell, and then pulls the disk's serial numbers and models from other corresponding properties. So those are added to the CSV (which the script produces once the dpcmd output is finished to the .LOG file), rather than being logged directly by dpcmd.

You'll know the dpcmd portion is finished when you don't see it running any longer in task manager. At that point, you could stop the task, and you should have an incomplete .CSV to examine (if you don't want to wait until it's completely finished to check the format/etc).

I think Powershell is not liking the task you have set it though - keeps taking 1.5 GB of memory and falls back to 200 MB and climbs again to 1.5GB and back to 200mB - watched it do this several times already - only taking about 6% of cpu though - will leave it for a while to if its actually doing anything

Share this post

Link to post

Share on other sites

well got bored waiting and killed the process for Powershell - it was just cycling between high and low memory every 20 seconds or so

i guess it was doing something but very slowly

could you change your script to write as it goes for testing so i can see if its doing what we expect

Odd - I watched for any memory issues on my end, but it only seemed to go up very minorly as the array in memory increased. Maybe powershell is handling memory allocation differently between our powershell versions.

Anyway, I switched it over to use a streamwriter method instead. I guess I should have done that to begin with. Now that I'm avoiding constructing objects for each line in an array, it's gone from 50 lines per second to over 700. No memory buildup will happen as well. Plus, the .CSV file can now be read while it's being written.

Share this post

Link to post

Share on other sites

1. it only lists one file location per file - should have 2 as have x2 dup and a few files have 3x

The idea is that you'd sort by the path column in Excel or another spreadsheet program, and then you'd see all the disks that a particular file reside on. I did test this on a pool without duplication, though... I just thought the "dpcmd" output was going to work like that, since it listed specific disks associated with each file line by line. Did you sort by that column in excel, and still only saw one copy of each file listed???

Edit: Oh! My bad... I just looked at the example output you posted earlier and saw how it's handling duplication. I'll have to revise the script. Thanks for catching that.

2. Does not list directory locations - useful if you want to find all your files and copy them to a new disk etc

You mean that the path doesn't start with the drive letter?

3. zip does not work - empty file

I thought this might not work for some... can you let me know:

* What version of Windows you're using

* What version of powershell you're using (paste $PSVersionTable.PSVersion in a powershell window)

Share this post

Link to post

Share on other sites

Okay, uploaded new V1.3 that fixes the bug where it wasn't recording all the duplicated entries for each file. In my test pool I've now got some 2x duplicated files (1 spare copy per file) and those are all being logged properly. If anyone has a pool with 3x or more, I'd appreciate feedback... it should scale, but I couldn't test it. Thanks!

Share this post

Link to post

Share on other sites

Aaaand added V1.4 just now. I realized it wasn't properly zipping files on my side either, and it was due to something I had added after I tested the zip functionality originally. That has been fixed, and it's zipping files again now for me. Let me know if it doesn't work for you now with v1.4, thanks