Ok, this was like a horror movie where you think 'now it can't get worse' and then Boom

The Boom thing here was that little line about 'So he did the next best thing- he wrote a “translation” module that would, using regular expressions, convert the new-style XML files back into the old-style XML files.' The horror

And I even work with Perl daily and by and large like it (yes I know, just look for my horns and all :-)

But even I know enough that you should never parse XML with regexes!
(Obligatory stackoverflow link: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)

It's been a while since I looked at it, but I believe if you only use a subset of HTML and only use a well-formed version of that subset, you can create an HTML parser with regex. The issue is that HTML was standardized with tags that can nest out of order or in ways that can trip up a regex. That's not a bad thing, mind. Since HTML is a markup language, it works well for what it does.

If you write HTML that follows strict self-imposed rules created with regex parsing in mind, you can use a regex to parse it.

I also thought I read that XML could be parsed using regex, but the better question for either XML or HTML is: Why on earth would you want that?

I've actually worked with someone who did that for a data import process. Instead of tweaking processes to match the format(s) provided so we'd have ones that worked with those specific processes, the data was imported, then the table structures were massaged to match one specific format so we could just import everything from that process. That kind of ignored the way different systems could handle more information in some cases so that was just lost. I was a bit sad when I saw how all of that data conversion work was being handled, but ... it was being handled and completely outside of my domain. It was the whole "when all you (think you) have is a hammer, everything's a nail" mindset.

When I got to the phrase "...it was a bit of a crapshoot...", my brain read that as "crashpoot". Which I have decided I really like and will have to start using in the future for WTF code that randomly flakes out.

oh, you made me think of a story about Regina Vacuum cleaners: they told their engineers to ignore the fact that Plastic melts at high temperatures, thus producing plastic motors that wore out VERY quickly, THEN tried to lie about their sales figures by NOT counting the numerous returns on their quarterly reports...when they got found out, that was pretty much IT for them!

Did Steven change the name of the system as well? How many people would want to keep working on a system with that name?
How about "Pearl Utility for Systematic Survey and Investigation - Enhanced Synergy"?

It's been a while since I looked at it, but I believe if you only use a subset of HTML and only use a well-formed version of that subset, you can create an HTML parser with regex

So... if you yourself create the HTML and only Netscape 2 HTML at that, in other words, if you have no reason to parse your HTML using regexes since you already know exactly what went into it in the first place, then and only then you can parse HTML using regexes.
Meanwhile, I can write haiku in Finnish, provided someone write one Finnish haiku for me and I stick to copying that haiku.

You sort of can and you can't represent everything in regex. It depends on what you mean by regex. Regex as it's defined theoretically doesn't have any memory other than its current state and current byte then the next combination of bytes. It's common for people to get this wrong either when they are using regex with advanced features or multiple regexes and doing what regex can't in their language of choice.

You can however make a regex representing all permutations that'll be able to match a lot of things you can't match with the normal thinking that comes with regex as long as your domain isn't "infinite". Regex can sometimes represent infinite sequences non-infinitely but not all. For finite sequences such as HTML to a depth of 10 you can represent all possible sequences with that in regex as literals. If you don't want to represent them literally you need something more powerful.

I once someone use regex in such a way using webscale architecture. Essentially generating nearly all valid permutations which took over a petabyte, then storing it on the cloud. You see that's how the cloud works. It doesn't matter how inefficient your solution is anymore. If you need a brazillian gigabytes the cloud's got your back. It brings a new meaning to overhead. I noticed the generator already has the logic needed to validate the text being processed. Five hours later and a brand spanking new 100 lines of code company operating expenses dropped by several million a year. I was then fired for making things all automatic. The CTO insisted that my solution had destroyed the company because it was no longer possible to change something in only one sequence. I learnt that day just how important webscale is and how empty our lives would be without it. I moved on to be a chef.

So, what you are saying is that the important thing here is to pre-select a non-infinite domain, and then choose a flavor of "regexp" that extends the traditional Finite State Automaton by incorporating a theoretically unlimited look-ahead mechanism? Yes, that would work.

However, I beg leave to doubt that you have ever worked for a company that transfers petabytes of data to and from Teh Cloudz whenever such a regexp requires it. Best wishes in your new career as a chef!

There are some cases where it comes in useful to use such a subset. For example, imagine a program that finds prime numbers and prints them, one prime to a paragraph. Each time it detects a prime, it appends the new prime with p tags before and after to the already existing file and then tacks on the closing body and html tags in a footer. You know what goes in the file, and you're using a well formed subset. Other people on the internet can download your file and easily parse it with regular expressions.

It still doesn't explain WHY you'd want that in HTML as opposed to CSV or something else more useful, but the name of the game is Technically Possible.

Now if you'll excuse me, I have some soup waiting for me in a flour strainer.

And the Programmer spake unto his Computer, Three times shalt thou loop, and three times correspondingly shalt thou test thine exit condition.
And on the First loop, He created the Factory and the XML Object
And on the Second loop, He Tested the Return from that Factory, and saw that verily, it was very True.
And on the Third loop, He Rested, and entered a Sleep, and threescore thrice times ten Milliseconds slept He.

And on the Fourth loop, exited He the Loop, and looked upon his works, and saw that indeed, it was Object Oriented.

Worked with a contractor who wrote code like this. It's SOLID you see as in everything becomes a separate class and everything is abstracted such that it is meaningless. He basically admitted to me that the reason for doing so was to keep himself in a job. Something that could be done in a few lines would suddenly become 20 classes and 10 design patterns of abstraction away such that you couldn't understand what was going on. The worst part was trying to help him fix bugs as no simple fix was ever enough, every simple fix to be applied involved creating some other class and using some other obscure design pattern that didn't really fit. The biggest problem was he was badly mismanaged and the projects requirements were constantly in flux meaning he had plenty of excuses to keep writing code in this manner. To quote Dijkstra:

"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise."

Perl has recursive regexes which can e.g. ensure XML tags match, and .NET has balancing groups which can be leveraged to a similar end, and there are probably other extensions around that give you enhanced capabilities beyond the usual regex.