I'm considering seppuku and there is a little voice in the back of my head whispering "someone on that forum probably has the answer and could offer it in their sleep." I'm hoping that whisper in my head is right.

As I've mentioned in other places I am currently working on a large project importing very old epubs into calibre. As I bring them in I am trying to clean them up as much as my skill set will allow. I find that my skills are expanding with nearly every book! Anyway, many of these books probably started life as scanned images put through PDF OCR, converted a bazillion times using every free conversion software known to man, and have acquired code garbage that is becoming one of my own personal demons.

A lot of these books have gone through Word on their way to epub. In searching the forums I am seeing a lot of options for cleaning up the MsoNormal that I'm looking forward to trying. That's not my issue here.

At some point these books had images with "Top," "Back" and "Next" buttons that were links to previous and next chapters or up to the main TOC. I've seen this in LIT files before. Now, however, there are no buttons but some of the links, which are invisible in WYSYWIG, are still active (or are trying to be) but they point to non-existent files on someone presumably long dead's c drive. Because each one is for a different numbered chapter, image, etc., there is no one universal search. They are unique if only by a couple of letters or numbers.

This is an example of what I am faced with at the beginning of every chapter:

The classes also change frequently, too. The only consistent thing I see in these piles of steaming...

is the "c:/DOCUME~1." I'd like to build a search parameter that would get rid of that entire string in all instances that has "c:/DOCUME~1" but I'm not sure how to write a search for "search for "c:/DOCUME~1" and then delete everything between the span tags" or whatever other solution would work.

Did I just make any sense at all or am I shopping for ritual knives tomorrow?

I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else). For example:

anchors that seem to refer to a local filesystem (dos style) - note your example did not have a closing a tag?

Code:

<a\b[^<>]*?[[:alpha:]]:/[^<>]*?>

images which refer to next/prev

Code:

<img[^<>]*?alt="(next|prev)"[^<>]*/>

empty tags (leaves tags containing nbsp, to be safe)

Code:

<(\w+)\b[^<>]*>\s*</\1>

Added bonus:
The next one is better than the simple anchor example above. When you load an epub into sigil, all of the text will be stuck in the Text directory, links that refer to them will use the relative paths, like href="../Text/Blah.xhtml" . This looks for anything which does not start with the .. (one level up), so it also catches references to external content (sites and such, hello watermarks). It will find any tags, as well as stuff inside them - so be careful and grep first.

I'm sure it was operator error (on my part). I edited it out manually and I'm on to the next book. Here it is again, essentially the same thing but as you can see it's not EXACTLY the same. It exists for the same reason, it is trying to do the same things, it appears in the same place in the book, but the code is slightly different. This is a good example of what I'm up against. Each book has this stuff at the beginning of every chapter but it's never exactly the same code in each book nor is it exactly the same line from chapter to chapter.

I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else).

Thank you. I'm getting no returns on any of those searches and I'm starting to think that it's got to be something I'm doing or missing. Obviously I need to spend more time brushing up on regex and getting to know the software better before I trouble folks.

Frustration, your middle name is Suz.

Thanks for your trouble, folks. I have bookmarked this link and will continue to refer back to it in the hopes that when I figure out what I'm doing wrong I will be able to make good use of these nuggets you've given me.

@Serpentine - there is a little secret trick not many will know about that in the 0.5.905 beta. If you *ctrl* click on the Find/Replace/Replace All/Count buttons then it will force the scope to be current file, without changing the dropdown. So I permanently leave my dropdown set to all files, and then on the rare occasion I want to reduce the scope (like replacing within a stylesheet) I just ctrl+click on the buttons. That way I don't accidentally forget to restore the scope dropdown afterwards...

@Serpentine - there is a little secret trick not many will know about that in the 0.5.905 beta. If you *ctrl* click on the Find/Replace/Replace All/Count buttons then it will force the scope to be current file, without changing the dropdown. So I permanently leave my dropdown set to all files, and then on the rare occasion I want to reduce the scope (like replacing within a stylesheet) I just ctrl+click on the buttons. That way I don't accidentally forget to restore the scope dropdown afterwards...

Here it is again, essentially the same thing but as you can see it's not EXACTLY the same. It exists for the same reason, it is trying to do the same things, it appears in the same place in the book, but the code is slightly different. This is a good example of what I'm up against. Each book has this stuff at the beginning of every chapter but it's never exactly the same code in each book nor is it exactly the same line from chapter to chapter.

Haha, just make sure your search option is set to Regex and not Normal, and that you're searching the current file/s.

I've switched and forgotten a number of times

I can not believe that I didn't know how to switch modes in the search. You guys kept saying to make sure and I kept pilfering through the menus and editors thinking "how the heck do I know if I'm in regex or not? Isn't it all dependent on the search string?"

kiwidude posted about that really kewl ctrl+click feature and I had a lightbulb moment about the drop downs actually in the find/replace box. There it is, on the left where it has always said "normal."

*sigh* I've manually edited all the instances out of the piece I'm working on right now so I can't try it right away, but I have several set aside and marked "format" so I'm sure I'll be able to get to it this evening and try out your wonderful suggestions.