The pdf files come over like PDF Document 123456_latest.pdf, along with
some of the contents of the file for pdf's that have text (as opposed to
scanned pictures). It works awesome. I'm going to hold off on the "more
adventurous" one for now :) Thanks again.

Dan Naughton

According to naughton@domino.danielwoodhead.com:
>
> There was a parse_doc.pl script that I downloaded with htdig. But the
> directions said that if acroread was in the path, it would find it and
> parse the .pdf's by default. If you wanted an external parser other than
> acroread, you would have to specify it in the htdig.conf. I tried it
both
> ways, with similar results. I finally left it on the default (acroread).
>

Geoff's suggestion would work, but it could be tedious to manually enter
the file name (or parts of it) into the title field of each PDF, using
Adobe's Acrobat Exchange.

An alternative that doesn't really involve any programming is to install
the xpdf package and the conv_doc.pl script, and change the PDFINFO
definition in conv_doc.pl to "/bin/true". Then, add an external_parsers
definition in your htdig.conf, as shown in conv_doc.pl's comments. In
this way, when it parses PDFs, it won't run the real pdfinfo program,
so it won't grab the real title field from the PDF (if one is defined),
so it will fall back to making up a title like this:

PDF Document 123456_latest.pdf

If you're feeling a bit more adventurous, and would want the title to
include both the real contents of the PDF's title field plus the file
name, you could instead define PDFINFO to be the real pdfinfo program,
and then find this section of conv_doc.pl:

# print out the title, if it's set, and not just a file name, or make one
up
if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
@parts = split(/\//, $ARGV[2]); # get the file basename
$title = "$type Document $parts[-1]"; # use it in title
}

and change it to this:

# print out the title, if it's set, and not just a file name, or make one
up
@parts = split(/\//, $ARGV[2]); # get the file basename
$title = "$type Document $parts[-1] - $title"; # use it in title

This will give you something like:

PDF Document 123456_latest.pdf - Title of your document

Either way, the number in the filename will get parsed as a word in the
title. You'll need to keep your allow_numbers and valid_punctuation
attributes as shown in your earlier e-mail message, so that htdig will
parse and store the number separately.