Author
Topic: Utility to allow keeping only the newest copy of several files on a drive (Read 1691 times)

This is a little difficult to explain and currently i get it done through the use of several different programs. One a number of old backup drives, i have multiple daily backups of entire directories of program files and folders along with the data used by them. What i need is a way to extract all of the files of certain types and of those files, remove all but one (the newest). Everything else can be deleted leaving me with a folder full of nothing but a single copy of any specific file and it being the newest version.

the drives are all full (smallest is 500GB largest 1TB).

The files to be kept are specific extensions like .doc .docx .pdf etc. At this point, all of them will be document types, though the use of audio files is in the works. I have been doing this in steps using one extension at a time to find all .doc for example and move them move them all out to another folder. The second step is removing the duplicates created when i do this as each file was backed up once each day sometimes for months making 60 copies of that file. Because the backups were done on each directory at different times i have to deal with each drive as a whole and cannot just find the newest copy of a single directory. Even if i could , i still need to be sure i only keep the last version of each file and this could end up being on a different drive.

In theory, for the reason they are kept, I do not need to keep the actual "path to the file" as: C:\a\b\c\d\filename. This same path including the filename exists once in every backupBut having that information could be of some use one day as it is possible that a given file could have been used in one project and then restarted in another. The project names and other information is in the path. But i was only asked to worry about the files themselves.I have tried various duplicate removers with each offering some advantage but nothing i have found can do the whole thing in one step.To make it even harder, each path would have probably 15 or more files in it and keeping the full path name attached to all the files would also be wasteful and cumbersome.the ideal would be to end up with the latest versions of every file that exist in each path kept and all others discarded.

C:\a\b\c\d would end up with 1.doc, 2.pdf, 4.txt. 5.docx (example only most of the files are pdf's) That would preserver the path to give the logic of why the file was there to start with.As it is, that same path including all those files exists multiple times and in most cases the files don't even change but in some cases they do or i would sort the whole mess by "date of path", keep the newest version of the data directory and be done with it.However, doing it that way would also end up omitting a lot of documents that were deleted during the term of the project and they want to keep all that were ever in each one even if it as deleted during the term of the project.

As I said, the path is something i think will one day be an item they will wish they had kept but all i was asked to do it keep all the documents, just one big pile of them.

Thanks for any ideas. There are at least 20 more of these drives i have to reduce to the newest single copies of stored documents only. The rest all gets deleted and the drives reused. I am probably approaching this with tunnel vision and there must be an easier way.

Sounds like it would be easy on VMS. A DEC guy wrote some Windows command line utilities to simulate some of the VMS file features like copy with version number stuck on. For example the VMS commandpurge /keep=1

Sounds like an interesting document control problem. I'm not sure I understand it very well though.The sorts of things I would examine would be pretty basic, before suggesting a solution. For example, just trying to find out some of the definitive constraints and limitations to what is known about the documents stored.Assuming this is a Windows OS:

(a) Have the files for any given project all been named according to a strict and consistent file-naming convention/rule, and if so, then what is that convention/rule? What exceptions might there be to this rule?

(b) Have the file meta-data/properties fields been used consistently to include uniquely distinguishing project meta-data? What exceptions might there be to this?

(c) Are the meta data (properties) and contents of the production and/or backup files being Indexed by WDS? (Has WDS been installed with all the necessary iFilters to enable this?) What exceptions might there be to this?

(d) Is the Date Modified field of any given set of latest file + its backup files, for any given filename, the only differentiator between the files which can be used to detect which is the latest? What exceptions might there be to this?

(e) Would you be able to connect all or several of the associated operational and backup drives to one computer, and search across them simultaneously for specific document types and dates using something like (say) Everything?

(f) Do you maintain/update a catalogue of every disk (operational and backup), and would you be able to concatenate the catalogues into one database for search and analysis?

(g) Was any versioning control method or added version-related meta-data applied to the documents, by he Windows OS or backup programs? What exceptions might there be to this?