I am trying to speed up my monthly archiving manual work by doing some automatic things. I am on Win8.1 64 bits home.

[0) I am using Synckback pro V8 to periodically move files (zip, rar, pdf, jpg and png) to a folder named "folder1".When this Synckback pro profile has run, it can run a program (see image REQ: CLI : basically unzip + run PDFTextChecker and move resulting files) after the above files has been moved ("Run after profile").I have discovered that I can run a .bat file which itself run several bat files (https://stackoverflo...es-within-a-bat-file). ]

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-...canner/Download.aspx)

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/) to check differences and find those Finereader bugs!)

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/) to check differences and find those Finereader bugs!)

why is there a limit? i thought finereader is just a local windows software. why would there be a monthly limit? is there a subscription that goes along with it? I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity. I have archivarius right now, but i wish the output would be a little more fancy than plain text.

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/) to check differences and find those Finereader bugs!)

why is there a limit? i thought finereader is just a local windows software. why would there be a monthly limit? is there a subscription that goes along with it? I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity. I have archivarius right now, but i wish the output would be a little more fancy than plain text.

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-...canner/Download.aspx)

How would you determine what the shortened name should be?Truncate, remove every second character, etc, etc.

Would be a simple matter of parsing the output files and issuing the appropriate Move-Item sub-command - I'm assuming you also want to keep the existing folder structure?

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder!

Probably be easier to just clone folder2 to folder2-orig and then let it run on folder2 letting it delete the originals.

I'll look at patching something together, might be a few days though. Also, I don't have ABBYY Finereader so can you give me the various commandline options, (both to keep original or not).

Truncate is fine as long as it keeps the file extension. For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.

- It doesn't 'unzip' rar files (it is working fine for zip files).- If possible do not delete the original zip or rar file if there is an error unzipping.- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?

>PDFTextChecker.exe. This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters. If any are found, it considers that searchable.

[The following is very optional : From experience maybe add a step : from "any alphanumeric characters" to minimum 1,000 (?) alphanumeric characters (I stumbled in the past on some strange pdf files with more than 0 but less than 10 characters = first page OCRed and all the other pages not ocred) ? Anyway, nothing is perfect for any kind of pdf files !]

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

[ Just for the little story : In fact manually I do other steps ! :I run PDFInfoGUI https://www.dcmember...download/pdfinfogui/. I import "!Not_Searchable.txt", I rank column by "Encrypted" : I copy/paste in Excel the pdf files with neither yes or no in the column encrypted. Then I run an excel macro to delete those files as they are 'buggy' and can't be opened by PDFSumatra.I copy/paste in Excel the pdf files with no in the column encrypted. And I replace the path of the "!Not_Searchable.txt" file. then I use this https://www.donation....msg330784#msg330784 to move the files. And then I do the OCR with Finereader.I ignore the pdf files with yes in the column encrypted. In the past I have tried to use a "Pdf Password Remover" with CLI commands. It worked well for most of the files but it destroyed a few files (maybe 2%?). I also realized that once un-encrypted, most of the pdf files were already OCRed ! So I decided that it was a step too much ! ]

For Finereader, I am sorry I was not clear enough. There is no need to send a CLI command to it as I use its "Hot Folder" feature which allows me to run automatically every day which is fine for me. (For the little story : in Finereader I have 3 folders : folder_in (original pdf files) , folder_moved (when Finereader has OCRed the file it moves it from in to moved), folder_out (when OCR is done)Then in pathsync I use the following settings : see REQ: CLI : basically unzip + run PDFTextChecker and move resulting files. After writing this I realize that I forget one more step as it has already appeared to me in the past ! : I need to check if Finereader deletes a few pdf files without moving them to folder_moved or simply letting them in folder_in ! )

Truncate is fine as long as it keeps the file extension. For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.

Strange, works fine with zip and rar files here.Maybe a later version of the 7z libraries are required for them or just switch to calling 7zsa.exe instead.

Do you have a rar file you can let me play with?

- If possible do not delete the original zip or rar file if there is an error unzipping.

OK, have to see if there is an error code returned.

- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?

That's because I missed the word 'recursively' in your OP.

Guess it'd have to keep looping until there were no archives left or something ... have to think about it.

>PDFTextChecker.exe. This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters. If any are found, it considers that searchable.

Yeah, I created an image PDF for a test. After running pdftotext.exe on it there was a much smaller text file with 'Page 1/1' in it. So might end up thinking image PDFs are text PDFs due to headers/footers in the PDF.

If there's not likely to be headers/footers then it'd be easy enough to check and act on the result.

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

Should be easy enough by removing any characters not in the old ASCII table.

BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"

Thanks for the explanation. I was only thinking about "filename_with_more_than_260_characters" and not "folder_name_path_with_more_than_260_characters" !I didn't realized. This is more difficult indeed ! Let's keep it easy : either forget about this step (or either just do truncate "filename_with_more_than_260_characters" and I manually do a check at the end of the month just in case for long folders path. Sorry about the trouble.)

Well there you go - 7zip can't open that archive, was it created using some strange options?

Many thanks for your detailed explanations. Your last code update is working great for 'unzipping' zip and rar files.

(A simple note for those following : I have downloaded this 7zip version (https://www.7-zip.org/a/7z1900-x64.exe) that I have extracted it in "C:\Program Files\7-Zip". Then I have copied "7z.exe" and "7z.dll" in my working directory.)

Still doesn't do extracted archives ... still thinking about it

I thought that if I run it twice it would find them (example : folder1\6\6\example.zip) during the next run (which would be fine) but no.

Checks for text/image based PDFs and moves them into different folders, (recreating folder tree) - all it does is count the number of text lines in the first 5 pages, (or less if there's less than 5 total), and if the number is greater than a set threshold regards it as a text based PDF

New-Item : The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.

ADDENDUM: Give the update a try, it should remove any special characters from file names and in theory it'll shorten filenames if it's the length of the filename that pushes the total path length over the 259 character limit.

If, however, it's the path that exceeds the 247 character limit then the file won't be touched, it'll remain in the initial folder ... so goes the theory

ie.- If the filename has diacritics and various other strange characters, they'll be removed. (At this point no rename happens.) An example: This: François-Xavier!!#@$%^&()_+}{ €$¥£¢ ^$.+()[{ 0123456789.pdf will turn into this: FrancoisXavier()_ Yc () 0123456789.pdf- If the folder path is less than 248 characters and the full path is less than 259 characters, the file will be renamed.- If the folder path is less than 248 characters and the full path is greater than 259 characters, the new filename will be truncated and then the file is renamed.- If the folder path is greater than 247 characters, nothing happens - the file isn't renamed, it will remain in the initial folder.

I might have to tweak the Get-ChildItem statement in the Check-PDF routine to ignore file paths greater than 259 characters, see how you go.

Currently doesn't check for the existence of a file with the same name before renaming.

Not sure whether you mean folder1 should be completely empty at the end or not since files that aren't PDF will be remaining - ie. delete everything in folder1 after running the Check-PDF function.

In which case, what happens to the non-PDF files that were in the folders?Delete or move with PDF?

Keeping (like you have already done) what is left inside folder1 is fine.

I did some tests and here is a zip files with examples : The main bug (see "2-itextsharp.pdfa") is when in folder1 a folder is named "bla.pdf" it causes a "stackoverflow" bug in powershell (it closes Powershell and restarts it when I run it in edit mode).The other thing is strange filenames (maybe some asian text?). Edit 1: it is because on my computer the folder path is too long on some of those one ! Otherwise it is renaming them fine !

Others are detecting invalid pdf files (see "1-malformed_pdf"- from what I understand there are at least 2 problems : 1)Invalid pdf file and 2)The image file format has not been recognized ). From experience it is a complex problem and I think it is better if I do it by hand with PDFinfoGUI (*) and remove them with an excel macro as I can see very fast if I need to download again some important files. So please forget about those. (*) neither yes or no in column encrypted and other columns - Then I copy the list - except the important onehttps://filebin.net/...tests.zip?t=6qfxn4l3

Also, many thanks for the detailed explanations of long names. I appreciated. Your truncate filename current code is just already very fine for me. Thanks.

Currently doesn't check for the existence of a file with the same name before renaming.

If I understand well it is because "folder1\venise-.pdf" would have the same name of a file already available in folder3 "folder3\venise.pdf". Renaming the new one (for instance with a counter "venise1.pdf" would be fine).

I have added a small function in order to delete empty folders in folder3

So I have copied "itextsharp.dll" in C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dllIt may explain why if I try to use 4wd's code in another hard drive (example : L:\) even if I copied the 7z.dll, 7z.exe and itextsharp.dll files and adapt the code for new locations, the script doesn't show errors but it fails to run properly the Check-PDF part. It moves some image based pdf in folder2 instead of folder3 for some files ? So I stay in C:\

So I have copied "itextsharp.dll" in C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dllIt may explain why if I try to use 4wd's code in another hard drive (example : L:\) even if I copied the 7z.dll, 7z.exe and itextsharp.dll files and adapt the code for new locations, the script doesn't show errors but it fails to run properly the Check-PDF part. It moves some image based pdf in folder2 instead of folder3 for some files ? So I stay in C:\