I need to index all back issues of Mamluk Studies Review (open access, now digital only but formerly print) but have not had much luck finding ideas about how to go about it.
Searching the Web for info about indexing PDFs leads largely to results about indexing them on a computer for improved searches, or to indexing services.
I hope to find software (or scripts or something!) that can

read PDF files

understand the idea of page numbers

understand that each page in a pdf is a distinct entity

handle Unicode and diacritics (and, ideally, Arabic script)

see phrases or hyphenated words that break across pages as single items

I don't expect anything to happen automatically: I know I (or better yet an unwary grad student) will have to actually go through and mark words and phrases to be included in the index.

Bonus points if it can be taught to ignore certain strings when alphabetizing. For example, since 'al-' is Arabic for 'the', it doesn't affect alphabetization (so al-Nasir Muhammad goes in the N section).
Similarly, there needs to be a way to instruct it that ā and a are the same for purposes of alphabetization, as are ṣ and s, etc.

Super bonus points if it can recognize (or learn to recognize) variations on a word or phrase in terms of spelling (often inconsistent when transliteration is involved), word order or intervening words.

What I have: 23 issues of the journal as whole-book pdfs, as well as individual pdfs of all articles. Unfortunately, the first half dozen or so were created without Unicode, using proprietary fonts with non-standard encodings. Messy, but I can work around it somehow. I also have InDesign files (various versions) for about half the issues. This will all be done in Windows (32-bit XP and 64-bit 7). I always have the latest version of Acrobat (not reader, the full program).

The resulting index will be posted on the Web, probably both as a PDF and in some more dynamic and usable format(s).

I'm confused. Are you making a concordance (list of words/phrases present in text with pointers), or an index (synthesized list of important terminology, with pointers to meaningful mentions while omitting passing ones)? They're not at all the same thing.

One thing I've been playing with today is going through a pdf and using the highlight tool on words/phrases, in the hope that I can then export the comments list (which has page numbers) to some format I can work with. Doesn't work very well for the older issues with the messy fonts, since you can't always tell what the word was supposed to be (Ṣubḥ becomes ˝ubh˝S and maqāmah becomes maqa≠mah or mah≠maqa, and words with lots of diacritics become almost unrecognizable as words). Those fonts were on long-dead Macs running OS7-OS9, so aren't available to me now.

I believe you're looking for an automated way to create a back-of-the-book index, correct? 'Indexing' tends to refer to building indices for information retrieval (such as Terrier and Lucene's PDF parsers), which is why you couldn't find it on Google.

Back-of-the-book indexes are tough to parse. Patrick Juola wrote about the need for such software and the technical challenges in Killer Applications for Digital Humanities. If I recall, he had early work in the area: I'm not sure what came of it.

I don't know if there is any software that would do what you need. However, since it's a tough problem, you can be sure that researchers have tried it. Your best bet is to look through the research literature and see if any researchers have released their code. A scholar search for 'back-of-the-book indexing' along with keywords like 'unsupervised', 'semi-supervised', or 'automated' gave me some potentially useful articles. Still, you'd probably have to split the problem into two parts — parsing PDFs to text and generating an index — as I suspect there aren't any tools mature enough t include PDF parsing.

To be honest, your approach of going through manually and highlighting notable terms sounds more tractable to me. With the OCR problems: have you tried re-applying text recognition on the older issues with the newest version of Acrobat Professional? Their OCR improves often.

Not automated. Just more convenient, and perhaps with some automated features to help with the actual index creation.

I hadn't thought of running any OCR on the older files, since I made them many years ago from the original Word or Nisus files (i.e., they were never scanned or OCRed), but that's a great idea that I'm about to try. Don't know if OCR will ignore the text that's already 'live' though, or if I'll have to flatten them first.

I'll definitely take a stroll through the research and see what I can find.

The OCR idea seems to be a bust, unfortunately. It's a pain to convert to a "flat" file without renderable text. Then, even the newest version of Acrobat is finding it difficult to understand diacritics, italics and anything else non-standard. I think I'll be better off working with mistakes that follow a regular pattern (such as a≠ always equals ā) and working on a script or something to do mass replacements.
Don't know why Adobe doesn't allow OCR of files with renderable text in them. What could be the harm?

You could try indexing software such as http://www.pdfindexgenerator.com/. But it sounds as if the level of quality and detail to which you aspire would be best handled not so much by software but by hiring a professional indexer who already uses such software and can write a strong index in a timely manner. Of course, if you have more time than money, this may not be feasible.