John M. Dlugosz has asked for the
wisdom of the Perl Monks concerning the following question:

Many years ago, I wrote a small Perl program that scans a hand-written HTML file and generates/updates a table of contents at the top of the file, with links to the various H\d tags.

That was fairly crude, being line-oriented and required that the header tags and matching names be just so. But it did recognise the stuff it generated before and replaced it with a refreshed copy.

I'd like something modern that does this. A proper HTML parser would take any HTML without relying on special formatting conventions or restrictions. The generated table of contents can have fancy dynamic-expanding/collapsing features.

I would need two passes; first to identify any headers and possibly modiy them to add an id, and second to insert the generated TOC in the correct position (removing the old one).

HTML::Parser is pretty primitive, but looks like it's enough to spot the header tags easily enough. But what about modifying the HTML? It needs to print out everything it reads, with the same formatting.

I also looked at HTML::TreeBuilder, and it can't output the same format that it read but produces its own re-generation of the text.

Though HTML::Toc is probably what you need, not MS Word, just throw it in as an idea. You can read a html file into Word and have it generate a TOC by "Insert/Index and Tables..." menu option. It will create a TOC with hierarchy according to the header tags. But afterwards Word will totally mess up the html. Only useful for some quick one-time job or getting a quick look-and-feel.