Pages

Saturday, April 23, 2011

I had a large number of emails in Thunderbird (an email program like Outlook, but open source freeware). I wanted to export each of those emails to its own distinct PDF file with a filename containing Date, Time, Sender, Recipient, and Subject information in this format:

2011-03-20 14.23 Email from Me to John Doe re Tomorrow.pdf

In that example, I might ultimately eliminated the "from Me" part as understood, but of course other emails would be from John back to me, so for starting purposes I wanted all five of the fields just listed. The steps I went through are described below. There is a summary at the end of this post.

So far, I had already worked through the process of exporting those emails to distinct EML files. I had also used a spreadsheet to rename those EML files so that they would provide clearer and more complete information about the file's contents. (I was using Excel 2003 for spreadsheeting. OpenOffice Calc was now able to handle a million rows (i.e., to rename a million files), but it had not been stable for me. One option, for those who had more than 65,000 EMLs and therefore couldn't work within Excel's 65,000-row limit, was to do part of the list at a time.) This post picks up from there, summarizing a more streamlined approach to the steps described at greater length in the two previous posts linked to in this paragraph.

I had previously tried to begin with the Index.csv file exported from Thunderbird via ImportExportTools, but that had been a very convoluted and unsatisfactory process. I did continue to use Index.csv, but my main effort was to work up a spreadsheet that would use and alter the filenames created when I exported EMLs from T-bird, also using ImportExportTools. As described previously, I had developed some rules for automated cleaning of various debris from filenames, such as the underscores that ImportExportTools inserted in place of quotation marks and other characters.

To summarize the approach described in more detail in the previous post, I got the filenames from the folder where ImportExportTools had put them by using this command at the CMD prompt (Start > Run > cmd): "DIR /b > dirlist.txt" and then I copied and pasted the contents of dirlist.txt into an Excel spreadsheet. There I extracted the Date, Sender, and Subject fields from those filenames using Excel functions, including FIND, MID, TRIM, and LEN, all described in Excel's Help feature and in the previous post. I also used Excel in a separate worksheet to massage the data on the individual emails as provided in Index.csv.

The two worksheets did not produce the same information, and I needed them both. The one contained actual filenames, which I wanted to revise en masse to be more readable and to include the "To" field, which was contained in Index.csv. Many of the things that ImportExportTools screwed up about the subject fields of emails, for purposes of CMD-compatible filenames (and going well beyond that) involved the underscore character. Hence, the chief sections in the main worksheet (where I revised the data from dirlist.txt), going across the columns, were as follows:

That accounted for the bulk of the needed changes in the Subject field, in the files I was working with. I set these rules up to eliminate the first one, or in some instances two, occurrences of the underscore string in question. Few emails contained more than that; for those few, leaving the additional underscores in place was acceptable. There would be some predictable misfires of these rules, but they would generally improve the situation, and when dealing with a large number of EMLs that I didn't intend to rename manually, this was the best that I could hope for under the circumstances.

Then I used VLOOKUP to search for a match with the Index.csv-style Date and Time (e.g., 19980102-0132) data in the Index.csv worksheet, and also for a match with the Index.csv Date+Time+From combination. (Sometimes the From field was necessary to distinguish two or more emails sent at the same time. Because of the underscores and other oddities about the EML filenames, subjects were too different to compare in most cases.) This identified precise matches between the two worksheets for about 80% of EMLs.

So now I was going to try using that same spreadsheet with another batch of emails exported from Thunderbird. I exported the Index.csv and the EMLs, and set to work on the spreadsheeting process of reconciling their names and producing MOVE commands for a CMD batch file that would automatically rename large numbers of EMLs to be readable and to include data from the To field.

This time around, I did a first pass to bulk-recognize and batch-rename that first 80% of the EMLs. The CMD command format was this:

MOVE /-y "Old Filename.eml" "Renamed\New Filename.eml" 2> errlog.txt

This renamed the old EMLs to the desired new EML filenames, put them into the Renamed subfolder, and gave me an error log to say what went wrong with any of the renames. The error log wasn't very useful, so I stopped creating it in these commands. What I had to do instead, to find out which EMLs had been successfully renamed, was to do a dirlist.txt for the Renamed folder, feed that back into the spreadsheet, and delete those lines that had executed successfully. For about 15% of the emails, I could not automatically detect matches between data from Index.csv and actual files, so I wound up naming those files according to date, time, and sender only. Finally, I got down to less than 1% of emails that I had to rename in a more manual fashion, mostly due to non-ASCII characters in their filenames. For that, I used Bulk Rename Utility.

I was not sure whether this route wound up being better than the approach of using one of the shareware programs discussed in the previous post. I was not aware of the potential difficulties when I was looking at those programs, so for example I didn't try them out on emails with Chinese characters in their Subject fields. The other way always looks easier after a project like this. The approach I had taken had surely been more time-consuming than if I had known of a killer app that would do exactly what I wanted without unanticipated complications or failures. Absent a reliable, obvious solution at an affordable price, the main thing I could say at this point was that at least the conversion to EML was done.

Final Step: Converting EMLs into PDF

With EMLs thus exported from Thunderbird and mostly renamed to indicate date, time, sender, recipient, and subject, the remaining task was to convert the EMLs to PDF. This, it developed, might not be as simple as I had hoped. There was, first, the problem of finding a program that would do that. Some of the emails were simple text and could have been easily converted to TXT format just by changing their extensions from .eml to .txt. Acrobat and other PDF programs would readily print large numbers of text files, unlike EMLs. Other EMLs, however, contained HTML (e.g., different fonts, different colors of print, images). I wasn't sure what would happen if I changed their extensions and then printed. I noticed that the change to .txt caused the HTML codes to become visible in one message that I experimented with. When I converted that file to PDF using Acrobat, its header appeared in a relatively ugly form, but the colors and fonts seemed to be at least somewhat preserved. In another case, though, the PDF was largely a printout of code -- a truly undesirable replacement for what had been a pretty email with photos included. My version of Acrobat (ver. 8.2) did not provide any editable settings for conversion from text or HTML to PDF.

Thunderbird was my default program for displaying EMLs. I wondered if a different program could view them and would have better PDF printing capabilities, or if I should try converting them into another interim format in order to then convert them to PDF. A search led to the claim that Microsoft Word (or other programs) could display EMLs. I tried and found that this was essentially untrue: in Word, there was almost nothing left of that pretty email I had just tested. Converting EML to MSG seemed to be one option, but this looked like a dead end; that is, it didn't look like it would be any easier to PDF an MSG file than to PDF an EML. Getting the EMLs into Outlook wasn't likely to be a solution; as I recalled, my version of Outlook (2003) had been unable to batch print emails as individual PDFs. Marina Martin said that MBOX was the standard interoperable email file type. I could have exported from Thunderbird directly to MBOX using ImportExportTools, but I had not investigated that; I had assumed that MBOX meant one large file containing many emails, like PST, and I had wanted to rename my emails individually. Martin gave advice on using eml2mbox to convert EML to MBOX; hopefully I would not have lost anything by taking the route through EML format. But if MBOX was such a common format, there was surprisingly little interest in converting it to PDF. My search led to essentially nothing along those lines. Well, but couldn't Firefox or any other web browser read HTML emails? I tried; neither Firefox nor Internet Explorer were willing to open an EML. I renamed it to be an .html file. Both opened that, but here again the problem was that the header was so ugly and hard to read: it was just a paragraph-length jumble of text mixing up the generally important stuff (e.g., from, to) with technical information about the transmission. Even assuming I could work out a batch-PDF process for HTMLs, this was not the solution. There were other possibilities, but in the end it did appear that I simply needed to buy an EML-to-PDF converter.

It tentatively appeared that MSGViewer Pro ($70) might be the most frequently downloaded program in this area, ahead of its own sister program PSTViewer Pro as well as Total Mail Converter. A search for reviews led to very little. It didn't appear that MSGViewer Pro had the ability to include image attachments within the PDF of an EML, as Total Mail Converter Pro ($100) supposedly did. On the other hand, MSGViewer Pro supposedly provided a free five-day trial. I decided that I did not have time to mess with endless numbers of attachments right now, and was therefore willing to just zip the EMLs into a single file for possible future processing, if I decided that there was sufficient need and time for that. Given my unlikelihood of using these programs very often, I also hoped that their prices would drop. I figured that if the MSGViewer Pro trial was fully functional, I might be able to take care of my need for it now, converting EMLs into PDFs without attachments, and otherwise let the matter sit for another year or more.

On that basis, I downloaded and installed MSGViewer Pro. It was apparently designed for an older version of Windows. When I installed it, I got one of those Win7 messages indicating that it might not have installed properly, and inviting me to reinstall using "recommended settings," whatever that meant. I accepted the offer. Once properly installed, I ran the program. A dialog came up saying, "Trial is not licensed for commercial use." I clicked "Run Trial." Right away, I found that its Refresh feature did not work: I copied some EMLs into a separate folder to experiment with, and could not get the program to find that folder. I killed the program and started over. Now it found the folder. I selected those messages, clicked the Export button, and told it to give the resulting PDF (one of the available output options; the others were TXT, JPG, BMP, PNG, TIFF, and GIF) the same names as the input files. It had a nice option, which I accepted, to copy failed messages to a separate folder. A dialog came up saying, "You can only export 50 emails in trial version of MsgViewer Pro." So that popped that fantasy. It ran pretty quickly and reported that all of the files had been successfully exported. Sadly, the results were no better-looking than I had been able to achieve on my own, with other measures described above. HTML codes were visible in some PDFs -- or perhaps I should say, not visible, but overwhelming: it looked like a piece of ordinary HTML coding. The typeface was tiny. Some lines were actually split down the middle horizontally, with the top half of a line of text appearing at the bottom of one page and the bottom half appearing at the top of the next page. In a word, the results were junk. I uninstalled MSGViewer Pro.

I decided to try Total Mail Converter Pro. No installation problems. When installation ended, the program started right up without giving me a choice. Then it decided I needed to log onto Gmail. This was not my plan, so I canceled that. I liked its interface better than MSGViewer Pro: smaller but still readable font, seemingly more options. I selected my test files and clicked the PDF button. It gave me options to combine the files into one PDF or produce separate files. It also provided a file name template, with choices of subject, sender, recipient, date, and source filename. I tried these. There were other options: which fields to export, whether to include attachments in the doc or put them in separate folders, header, footer, document properties. It did the conversion almost instantly. The date format was month-day-year. The subject data weren't cleaned up, so I would still have had to go through something like my spreadsheet process to get the filenames the way I wanted them. Moment of truth: the file contents included a colored top part, as I had encountered with Birdie (see previous post). HTML codes were still visible in some messages, but in others the HTML seemed to have been better converted into rich text. Typefaces were still tiny. Definitely a better program. But worth $100 for my needs?

Ideally, I would have been converting my emails to PDF as I went along, without converting them around and around, from Outlook to Thunderbird to EML and wherever else they might have gone over the past several years. This might have better preserved what I recalled as the colorful, more engaging look of some of them, and perhaps I would have come up with better ways of capturing those characteristics as I continued to become more experienced with the process. In the present circumstances, where I really just wanted to get the job done and move on, it seemed that playing with that sort of thing was not a short-term option.

Since I was planning to keep the EMLs anyway, and since I did not plan to view these emails frequently, I decided that I really didn't lose much in informational terms by going with the free option identified above. I took a larger sample of EMLs and, using Bulk Rename Utility, renamed them to be .txt files (though later I realized I could have just said "ren D:\Documents\*.eml *.txt"). Since I had installed Adobe Acrobat, I had a right-click option to convert to Adobe PDF. No doubt some freeware PDF programs provided similar functionality. The Acrobat conversion of these files into PDF was not nearly as fast as that performed by Total Mail Converter Pro. Acrobat put each of those newly created PDFs onscreen and obliged me to manually confirm that I wished to save them. I had converted 40 files, and wasn't interested in manually closing all 40; ultimately I had to use Task Manager to shut them down. That problem turned out to be just a result of the settings I was using for my default Bullzip PDF printer; changing those defaults and using Acrobat's Advanced > Document Processing > Batch Processing option made the process completely automatic. In terms of appearance, it seemed the fonts, HTML handling, and other features were more or less the same as I had gotten from those other programs (above). I probably could have made the average resulting email more readable (except where HTML formatting made clear who was responding to whom) by looking for a program that would strip the HTML codes out of those TXT files, but I didn't feel like investing the time at this point and wasn't sure the effort would yield a net improvement.

Briefly, then, the PDFing part of this process involved using a bulk renamer to replace the .eml extension with a .txt extension, and then using a bulk PDF printer or converter to convert those TXT files into PDF. This approach still preserved the look of some emails, while allowing others to be overrun with HTML codes.

I ran that batch process on a full year's set of EMLs. I converted 1,422 EMLs into TXT files by changing their extensions with Bulk Rename Utility. Somehow, though, Acrobat produced only 689 PDFs from that set. Which ones, and what had happened to the rest? Acrobat didn't seem to be offering a log file. My guess was that Acrobat went too fast for Bullzip. There was no real reason why I shouldn't have been using Acrobat's own PDF printer for this particular project -- in fact, I did not remember precisely what Acrobat snafu had prompted me to switch to Bullzip as my default PDF printer in the first place -- so I went into Start > Settings > Printers and made that change now. I also right-clicked and changed some of the Printing Preferences, for that printer, so that it would run automatically. I deleted the first set of PDFs and tried again. I noticed, this time, that Acrobat was not even trying to convert more than 689 files -- it was saying, "1 of 689," "2 of 689," etc. What was causing it to overlook these other files, I was not sure. It seemed I would have to do a "DIR /b > Printed.txt" command in CMD, and then convert Printed.txt into a Deleter.bat file that would delete the text files that were successfully printed, so as to highlight the ones that remained. (See previous post for details on these sorts of commands.)

(Incidentally, I had also noticed, now that I was working with the Acrobat batch options, that it had a "Remove File Attachments" option. While it did not seem to work with EMLs, possibly it would have been useful if these emails had been in MSG or PST format.)

The automated process got as far as file no. 2 in the list before it stalled. Why it stalled, I had no idea. I clicked on the X at the top right-hand corner of the dialog to kill it -- I even said "Close the Program" when Windows gave me that option -- and then Acrobat took off and printed a couple hundred more PDFs before stalling in that same way again. Possibly I had the Acrobat PDF printer's properties set to stop on encountering an error. I ran through most of that first set before spacing out and killing Acrobat (the whole program) at a stall, instead of just killing the stalled task. I deleted those that had printed successfully, creating a Deleter.bat file for the purpose as just mentioned, and ran another batch. This time, Acrobat was printing a total of 667 files. So I figured the situation was as follows: Acrobat would print PDFs through a glorified command-line kind of process, and that command line would accommodate only so many characters. If I'd had shorter file names, maybe it would have been willing to print thousands of TXT files at one go. If I had wanted to add complexity to the process, I could have renamed my files with names like 0001.txt, reserving a spreadsheet to change their names back to original form after conversion to PDF. But with my filenames as they were, it was only going to process 600 or 700 at a time. That was my theory.

When Acrobat was done with the second set -- the first one that had run through to completion -- it showed me a list of warnings and errors. These were errors pertaining to maybe a dozen files. The errors included "File Not Found" (typically referring to GIFs that were apparently in the original email), "General Error" (hard to decipher, but in some cases apparently referring to ads that didn't get properly captured in the email), and several "Bad Image" errors (seemingly related to the absence of an image that was supposed to appear in the email). A spot check suggested that the messages with these errors tended to be commercial (e.g., advertising) messages, as distinct from personal or professional messages that I might actually care about. In a couple of cases a single commercial email would have several errors. But anyway, it looked like they were being converted, with or without errors.

I decided to try printing the next batch with Bullzip instead of Acrobat printer. I had to set it as the default printer in Settings > Printers. I also had to adjust its settings (by going to its Options shortcut in the Start Menu > General and Dialogs tabs) so that it would run without opening dialogs or completed PDFs. Would it now process significantly more than 600 input files? The short answer: no. So for the next round, I tried selecting all the TXT files in a folder and right-clicking > Convert to Adobe PDF. This was a bad idea. Now Acrobat wanted to open a couple thousand documents onscreen. I had to force-reboot the system to stop this one.

So now I thought maybe I'd look for some other text-to-PDF converter. It sounded like ActivePDF was a leading solution for IT professionals, but I didn't care to spend $700+. Shivaranjan recommended Zilla TXT To PDF Converter ($30). Softpedia listed a dozen freeware converters, of which by far the most popular was Free EasyPDF. But I couldn't quite figure out what was going on there. There was no help file, and the program wasn't even listed on its supposed creator's webpage. CNET called it fatally crippled. I didn't know why 30,000 people would have downloaded it. Back to Softpedia's list: Free Text to PDF Converter was another possibility with a Good rating. Its webpage said it could batch-convert text to PDF files. I went into its Open option, selected a boatload of TXT files, and saw no sign that it had any intention of doing anything with them. Looking more closely at its starting screen, I saw it said this:

The documentation webpage said I was supposed to drag the TXT files into the window on the main screen to convert them. It also said this program would convert only plain text, not HTML. I wasn't sure what that meant for the EMLs that contained HTML code as plain text. The optional parameters had to do with font, paper size, etc. In the folder where I had my TXT files to be converted, I tried this command:

with quotation marks as shown, on the command line. It worked. It produced a PDF. There was no word wrap, so words would just break in the middle at the end of the line, like this:

We can't pledge that we've entirely emerged from th
at episode, but this
past summer I sat down and rewrote the entire man
ual in a way that makes
more sense. The guy just didn't know how to phrase

The print size was very large, though there were parameters to change that, but nothing, apparently, to persuade lines to break at the ends of words rather than in the middle. This could defeat Copernic text searching, rendering some PDF file contents unfindable, so it wasn't going to be a good solution for me. But it really seemed like the command line approach, which would let me name each file to be converted, was the answer to the problem of being able to process only ~600 text files at a time. Another possibility: AcroPad. The following command worked:

Acropad "File to Convert.txt" "File Converted.pdf" Courier 11

I could have named other typefaces and font sizes. Output was double-spaced. Lines were broken at the ends of words, not in the middle. HTML code in the file was just treated as text and printed out as-is. I kept searching. A post by Adam Brand said I could use a command to automate printing if I had Acrobat Reader installed. That prompted another search that led to several insights. First, it turned out I could print a file from the command line using a Notepad command in the form of "notepad.exe /p filename." Since my default printer was a PDF printer, it printed a PDF -- a nice one, too, for basic purposes, nicer than some of the output I was getting from the programs tested above. It put the output on the desktop. I changed the location for the output by going into the Desktop folder for my username. Since I was running as Administrator, the location was C:\Users\Administrator\Desktop. There, I right-clicked on the Desktop folder, went to Properties > Location tab and changed it. (Another Notepad option, which I didn't need, was to specify which printer I wanted to use: /pt.)

The Notepad approach did nothing with HTML codes in these plain text files. An alternative that would work with rich text, which might or might not help in my case, was supposedly to try the same switch with Wordpad: "wordpad.exe /p filename." But when I did that, I got an error message:

'wordpad.exe' is not recognized as an internal or external command, operable program or batch file.

This was odd. To fix it, I ran regedit (Start > Run) and went to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\. There, following instructions, I right-clicked on App Paths and selected New > Key. I named the new key "Wordpad." I right-clicked on that Wordpad key and selected New > Expandable String Value. It apparently didn't matter what I called it. I called it ProgramPath. I right-clicked on ProgramPath and pasted in the path where Wordpad was, which I had obtained by going into the Properties of the Wordpad shortcut on my Start Menu. In other words, what I entered here included quotation marks and the name of the executable wordpad.exe, with extension. The instuctions said that, to run Wordpad from the command line (as distinct from in Start > Run), the command would have to begin with the Start command. For present purposes, what I would type at the C prompt would be "start wordpad /p filename." This worked (and I exported the new registry key and added it to my Win7RegEdit.reg file for future installations), but it did not produce a superior PDF compared to that which Notepad had produced, and for some reason it truncated the filename of the resulting PDF.

Revised Final Step: Converting TXT to HTML to PDF

Searching onward, there was a possibility of treating them as HTML rather than TXT files. I had flirted with this earlier but had not grasped that, of course, these actually were HTML files in the first place; they had become EMLs and TXTs only later. I typed "ren *.txt *.htm" to rename them all as HTML files. To print them, there were some complicated approaches, but I hoped that PrintHTML.exe would do the trick. The syntax, for my purposes, was this:

printhtml.exe file="filename.htm"

with optional leftmargin=1, rightmargin=1, topmargin=1, and bottommargin=1 parameters, among others that I didn't need. The printhtml.exe file would of course have to be in the folder with the files being printed unless I wanted to add it to the registry as just described for Wordpad. PrintHTML wouldn't work until I installed the DHTML Editing Control. I did all that, and got no error messages, but also did not seem to get any output. I decided to put that on hold to look at another possibility: automated PDF printing using Foxit Reader on the command line. Pretty much the same command syntax as above:

"FoxitReader.exe" /p filename

Here, again, there was a need for a registry edit, unless I wanted to park a copy of Foxit in every folder where I would use it from the command line. But the instructions were only for using Foxit to print PDFs, so I got an error: "Could not parse [filename]." There was also an option of using Acrobat Reader to print a PDF silently or with a dialog box, but there again it wasn't what I needed: I was printing HTMLs. I returned to that printhtml.exe program mentioned above. The command ran, with no indication any errors, but there did not seem to be any output. Another possibility was:

RUNDLL32.EXE MSHTML.DLL,PrintHTML "Filename.htm"

But for me, unfortunately, that produced an empty PDF. Turning again to freeware possibilities, I found an Xmarks list of top-ranked HTML to PDF programs. Most of the top-ranked items were online, one-file-at-a-time tools. Others required PHP knowledge that I didn't have (e.g., HTML_ToPDF, PDF-o-Matic). HTMLDOC looked promising for command-line usage; I found its manual; but when I downloaded and unzipped it, I couldn't find anything that looked like a setup or installation file. Apparently the version that's free is the source code, and I didn't know how to compile it. DomPDF and html2pdf (and, I suspect, some of these others) were apparently for Linux, not for Windows. I tried wkhtmltopdf. When I ran it, I got an error:

wkhtmltopdf.exe - System Error
The program can't start becuase libgcc_s_dw2-1.dll is missing from your computer. Try reinstalling the program to fix this problem.

Possibly the reason I got that error is that I was trying the same trick of running the program in a folder where my PDF files were. I had copied the executable (wkhtmltopdf.exe) to that folder, but had not brought along its libraries or whatever else it might need. I tried running it again -- I was just trying to use the help command, "wkhtmltopdf -- help" -- but this time pointing to the place where the program files were installed:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -- help

and that worked. I got a long list of command options. What I understood from it was that I wanted, in part, a command like this:

Error: Failed loading page http: (sometimes it will work just to ignore this error with --load-error-handling ignore)

So I tried adding that long parameter to the command. It seemed like it worked: it gave the error message but then proceeded through the rest of its steps and announced, "Done." But I didn't see any output anywhere. Then I realized there was an error in what I had actually typed. I tried again. This time, it gave me a different error message: "You need to specify at least one input file, and exactly one output file." So the format I was supposed to use, aside from that additional "--load-error-handling ignore" parameter, was this:

And that worked. At last, I had a mass-production way of converting EMLs (by changing their extension to .htm, not .txt) to PDFs. It was too early to break out the champagne, but at least the computer and I were back on speaking terms. Now I just needed to run "DIR /s /b > dirlist.txt" in the top-level folder under which I had sorted my emails, convert that dirlist.txt file into a .bat file that would convert the file listings into batch commands, and run it. I was afraid the whole command, with the introductory reference to C:\Program Files, would be too long for Windows in some cases, so I edited the registry as described above, so that I would only have to type wkhtmltopdf.exe at the start of each command line. But now that registry edit wasn't working -- it certainly seemed to previously -- so I copied all of the wkhtmltopdf program files to the folder where I would be running this batch file. I didn't want the computer to crash itself by opening hundreds of simultaneous wkhtmltopdf processes, and I wanted to move the PDFs, so the format I used for these commands was:

That worked. Now I investigated the longer list of wkhtmltopdf command-line options, by typing "wkhtmltopdf -H" (with a capital H). Whew! The list was so long, I couldn't view it in the cmd window -- it scrolled past the point of recall. I tried again: "wkhtmltopdf -H > wkhtmltopdf_manual.txt." I couldn't add too much to the command line -- I was already afraid the long filenames would make some commands too long for CMD to process. But having viewed some output of these various PDFing programs, a few sets of commands seemed essential, including these:

-T 25 -B 25 -L 25 -R 25
--minimum-font-size 10

The first set would give me one-inch margins all around. Putting these on the already long command line increased my interest in another option: --read-args-from-stdin. This one, according to the manual, would also have the advantage of speeding up the process, since I would be starting wkhtmltopdf just once, and then re-running it with different arguments. The concept seemed to be that my conversion batch file (or, really, just a typed command) would contain this:

start wkhtmltopdf --read-args-from-stdin < do-this.txt

and then do-this.txt would contain line after line of instructions like this one:

Or perhaps they could be rearranged so that some of the contents of the second could be in the first, and therefore would not have to be repeated on every line in do-this.txt. In which case the main conversion command would look like this:

and do-this.txt would contain only the "before" and "after" filenames. I decided to try this approach. Unfortunately, it didn't work. It froze. So then I tried just the minimal one shown a moment ago, putting all options except --read-args-from-stdin in the do-this.txt file. Sadly, that froze too. I tried the minimal command plus just filenames, leaving out the several additional commands about margins and font size. Still no joy. So, plainly, I did not understand the manual. I decided to go back to the approach of just putting it all on one line and repeating all commands, in a batch file, for each HTM file that I was converting to PDF. Each line would begin with "start /wait," not just "start," for reasons stated above. This worked, but now I noticed a new problem that I really hadn't wanted to notice before, because I just wanted this project to be done already.

Separating EMLs With and Without HTML Code

The new problem was that emails that were originally in HTML format turned out best when they were now renamed with an .htm extension, and processed that way, but the ones that didn't have HTML codes in them were now reduced to a mess. Specifically, line and paragraph breaks were gone; everything was just jumbled together in one continuous stream of text. Every non-HTML email was now being represented by a single long paragraph. To get decent output, it seemed that I needed to separate the emails that contained HTML code from those that did not. I would then use wkhtmltopdf with the former, but not with the latter. But how could I tell whether a file contained HTML code? I decided that an occurrence of "</" would be good enough in most cases. But then it occurred to me that there might be programs that would sort this out for me. A search led to the FileID utility. Their read-me file led me to think that this command, entered in the top-level folder where the files to be checked, might do the job:

"D:\FileID Folder\fileid" /s /e /k /n

This would run FileID from the folder where its program files were stored, and would instruct FileID to check all files in all subdirectories, to automatically change file extensions to match contents, to delete null files, and not to prompt me for input. But it did not seem to be working. Regardless of whether I entered these options as upper- or lower-case (e.g., /S or /s), FileID paused after every screenful of information, and did not seem to be renaming anything. I decided to try again with another command-line program of similar purpose, TrID. TrID had an online version and a GUI. On second thought, I decided to give the GUI version a whirl. I downloaded the program and its XML definitions. (I already had the necessary .NET Framework installed.) As advised by Billy, I moved everything from the XML definition folder (after unzipping them with WinRAR) into the folder containing the TrIDNet.exe file. I doubleclicked on that executable and saw that it would process only one file at a time.

I moved on to the command-line version. This called for a download of a different set of program files and definitions. I wasn't sure whether TrID would actually change incorrect extensions, or just detect them. Again, rather than plow into the support forums, I just tried it out. But in this case, that strategy didn't work: there was no manual or other use instructions in the download. The forum contained a tip on using PowerShell to fix extensions, but I didn't know enough about PowerShell to be able to interpret and adapt that tip to my situation. But, silly me, I forgot about just getting online help. In the folder where I had unzipped TrID.exe, I opened a cmd window and typed "trid -?" and got the idea that I could type "trid -ce" or perhaps "trid *.* -ce" to have the program change file extensions as needed, for all files in the current directory. It didn't appear to have a subdirectory option, so I would have to do some file moving.

A different approach was to use a CHK recovery program to detect the proper extension for anything with a CHK extension. While FileCHK looked like the better program for recovering real CHK files, it looked like UnCHK would have more flexibility for my situation, provided I first ran "ren *.htm *.chk" to change the file extensions to .chk. When I tried to run unchk.exe, I got an error message:

The program can't start because MSVBVM50.DLL is missing from your computer. Try reinstalling the program to fix this problem.

Eric had already warned me, in the read-me file, that this meant I needed to download and install the Visual Basic 5 runtime. I did, and tried again. Now it ran. I couldn't find documentation or a /help option to explain its settings. It took me a while to realize it wasn't a command-line program, though it could run from the command line. It was very bare-bones. I started it, navigated to the first of the folders I wanted to repair, and (having renamed files to have .chk extensions), gave it a try. It gave me a dialog asking about Scan Depth. I knew from the read-me that I wanted the Whole Files option. It ran for a while and then disappeared. It didn't seem to have done anything. After some more searching around, I concluded that this CHK approach wasn't what I wanted.

So I looked elsewhere. If I wanted to spend a day or so refreshing my aging knowledge of BASIC programming, or invest some time in learning more about batch scripting or Microsoft Access or some other program, I was pretty sure I could work up a way to examine file contents. But I wanted a solution faster than that, if possible. The CMD batch FIND command looked like it might do the job. But the command that I thought should work,

FOR %G IN (*.txt) do (find /i "</" "%G")

didn't. It wasn't because "</" were weird characters; it wasn't finding files containing ordinary text either. I tried again with the FINDSTR command:

findstr /m /s "</" *.* > dirlist.txt

This looked promising. But when I examined dirlist.txt, I saw that many of the files listed in it were better presented as TXT than as HTM. Apparently I should have been looking for files with more substantial HTML content. A spot check of several emails suggested that the existence of an upper- or lower-case "<html" might be a good guide. So apparently I would have to run FINDSTR twice:

with two ">" symbols in the second one, so as to avoid overwriting the results of the first search with the results of the second. I tried that. There were some error messages, "Cannot open [filename]," apparently attributable to weird characters in the file's name; somehow it seemed I had still not entirely succeeded in cleaning those up. I assumed FINDSTR's failure in this regard would leave those files being treated as TXT by default, which would probably be OK since the majority of files overall appeared to be non-html. Ultimately, dirlist.txt contained a list of maybe 40% of all of the emails I was working on. That seemed like it might be about right. In other words, it seemed that about 60% of the emails were best treated as plain text, and I would be getting to those shortly. I put dirlist.txt into a spreadsheet to produce commands that would run wkhtmltopdf on the files that those two commands listed in dirlist.txt. The key formula from that spreadsheet:

That formula, applied to each file identified as containing "<html," produced PDFs that looked relatively good. I found that I needed a way of testing them, though, because in a number of cases wkhtmltopdf had produced PDFs that would not open. I also noticed that the batch file running these commands kept acting like it had died. Windows would say, "wkhtmltopdf.exe has stopped working," and I would click the option to "Close the program." And then, after a while, it would come roaring back to life. This may have happened especially when wkhtmltopdf was converting simple email messages into PDFs of a thousand pages or more. A thousand pages of gibberish. In a number of cases, too, the resulting PDF was a failure. When I tried to open those PDFs, Acrobat said this:

There was an error opening this document. The file is damaged and could not be repaired.

I was not sure what triggered these problems. I wondered if possibly the simpleminded conversion from EML to HTM by merely changing the extension caused problems in the case of EMLs that contained attachments. If that was the case, then what I should have done might have been to export from Thunderbird in HTML format in the first place -- to do two exports, in other words: one for EMLs, which would include attachments, to be zipped up into an archive and shelved until the future day when there would be a simple, cheap solution for the PDFing of emails plus their attachments; and another export in HTML, for purposes of PDFing here and now, without attachments. I tested this with one of the gibberished emails and found that, when exported from T-bird as HTML using ImportExportTools, it did print to a good-looking PDF. In that approach, the naming procedures used to rename the exported emails in the desired way -- containing date, time, sender, recipient, and subject information -- would apparently have to be preserved and reapplied, so that both exported sets -- the EMLs and the HTMLs -- would be named as desired.

To investigate these questions, I traced back one PDF that did not open -- that produced the error message quoted above -- and one that opened but that was filled with gibberish. The one that was damaged did not come from an email that originally contained attachments. I was able to print that email directly from Thunderbird without problems. So I wasn't sure what the problem was there. For a sample of one filled with gibberish, I chose the largest of them all. This was a 3,229-page PDF that was produced from a little two-page email that did originally have an attachment. I sampled three other PDFs containing gibberish. All three had come from emails that originally had attachments. So it did appear that attachments were foiling my simplistic approach of just changing file extensions from EML to HTM. I wondered if it was too late to just change the extensions back to .eml, for the ones that had not produced good PDFs, and maybe PDF them manually. I tried with one, and it worked. So that would have been a possibility, assuming I had time for printing emails one by one.

It seemed the gibberish might not be gibberish after all. It might be a digital representation of the photograph or whatever else was attached to the email. I didn't know of a way to test text for gibberish, so this didn't seem to be a problem that I could deal with very effectively at this point. I could name some files as HTM, as I had done, and just accept a certain amount of gibberish -- perhaps after screening out the really large PDFs (or, earlier in the process, the large EMLs, TXTs, or HTMs), which seemed most likely to have had attachments -- or I could rename them all as TXTs and print them that way, looking solely for the text content without regard to their appearance (and still probably getting gibberish). If I needed to know how they looked originally, I would have to go back to the archived EML version of the PDFd text. A third option was to go back to T-bird and re-export everything as HTML, thereby skimming off the attachments, and then use my saved renaming spreadsheets to rename the newly produced, roughly named HTMs, and then do my PDFing from those new HTMs. Presumably, that is, the new HTMs would print correctly, since they would not have attachments.

Back to the Drawing Board: T-Bird to HTML to PDF
I decided to try that third option. I went back to Thunderbird and used ImportExportTools to export the emails as HTML rather than as EML. It would have been more logical to start by PDFing these HTMLs, to make sure that would work; but at this point I had such a clutter of emails in various formats that I decided to proceed, as before, with the renaming process first, so as to be able to delete those that I wasn't going to need. Having already worked through the process of renaming to the point of achieving final names, I used directory listings and spreadsheets to try to match up the "before" names (i.e., the names of the raw HTML exports) and the "after" names (i.e., the final names I had developed previously).

Once I had the emails in individual HTML files with workable filenames, I ran wkhtmltopdf again. I started by taking a directory listing of the files to be converted; I put those into a spreadsheet, as before; and in the spreadsheet I used more or less the same wkhtmltopdf formula shown above in order to produce working commands. These pretty much succeeded. I was now getting good PDFs from the emails. It seemed that wkhtmltopdf had a habit of wrapping lines severely or perhaps indenting them too much. That is, if I wrote an email in reply to someone else, the text of my email would look fine,

but the text of the message

to which I was replying,

typically shown below the

reply text, would be

indented and then broken

like this.

Wkhtmltopdf converted HTML files to PDF at a rate of somewhat more than one email per second. Of course, these were small files, as email messages tend to be. There was a problem with them taking up a lot of disk space; it seemed I might have been well-advised to format the drive to have smaller than the default cluster size. The program slowed down considerably at times. I assume it was running into complexities with some HTML files.

The batch file ran and finished, but it had converted only about half of the HTMLs into PDFs. I decided to test the PDFs before deleting the corresponding HTMLs. I opened a half-dozen of them without a problem. Then, for a more thorough test, as described in a separate post, I ran an IrfanView batch conversion from PDF into RAW format. I chose RAW because it would result in just one file. TIF might have been another possibility. It did appear that this process was all working well. Ultimately, these steps converted all of the HTMLs into PDFs.

Summary

The first part of what I was able to achieve, at this point, was to export my emails from Thunderbird to EML format, using the ImportExportTools add-on for Thunderbird. Once I had exported all those EMLs, I used a zipping program (either WinRAR or 7zip) to bundle them together into a single file containing all of a year's emails. I took these steps because EML files, unlike HTML, PDF, JPG, TXT, or other formats, were able to contain email attachments along with the text of the email messages. I planned to keep these year-by-year ZIPs of EMLs until some point when I could find a cheap and broadly accepted program for printing both the email message and its attachment into a single PDF.

The other main achievement was to work out a process for converting HTMLs (also exported from Thunderbird via ImportExportTools) into PDFs. I used wkHTMLtoPDF for this purpose. I ran it in a batch file, produced by a spreadsheet, so that there was one command per file. I used DIR folder comparisons and other means to test that all files were being converted and that they were being converted into valid PDFs.