This post has been locked due to the high amount of off-topic comments generated. For extended discussions, please use chat.

26

The funniest thing is that the question was about “matching”, while most of the answers are about “parsing” :}. Hmmmm…
–
thebodzioMar 12 '12 at 2:15

4

@thebodzio: why is that funny? Often you have to analyze (parse) a string in order to find out whether it matches a given pattern.
–
LarsHMay 22 '12 at 10:08

1

@LarsH: “Funniness” is in the fact that question was about matching tag in XML and the answers are about parsing_XML. You're right about “parsing string to get a match”, but I think you'll agree, there's a serious difference between parsing XML and parsing string.
–
thebodzioMay 24 '12 at 23:57

36 Answers
36

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?

Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death.
–
bobinceNov 13 '09 at 23:18

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser. And if you are comfortable with coding regexes, way faster to code than coding xpaths. And almost certainly less fragile to changes in what you are scraping. So bleh.
–
Michael JohnstonApr 17 '12 at 20:47

85

@MichaelJohnston "Less fragile"? Almost certainly not. Regexes care about text-formatting details than an XML parser can silently ignore. Switching between &foo; encodings and CDATA sections? Using an HTML minifier to remove all whitespace in your document that the browser doesn't render? An XML parser won't care, and neither will a well-written XPath statement. A regex-based "parser", on the other hand...
–
Charles DuffyJul 11 '12 at 16:03

15

@CharlesDuffy for an one time job it's ok, and for spaces we use \s+
–
quantumJul 12 '12 at 13:50

27

@xiaomao indeed, if having to know all the gotchas and workarounds to get an 80% solution that fails the rest of the time "works for you", I can't stop you. Meanwhile, I'm over on my side of the fence using parsers that work on 100% of syntactically valid XML.
–
Charles DuffyJul 12 '12 at 16:07

152

I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke, and all their styling was inline or with <font> etc.: no classes or IDs to help navigate the DOM. After fighting all day with the "right" approach, I finally switched to a regex solution and had it working in an hour.
–
Paul A JungwirthSep 7 '12 at 7:14

The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG.
–
LarsHMar 2 '12 at 8:43

54

In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish.
–
Adam MihalcinMar 19 '12 at 23:50

4

@eyelidlessness: the same "only if" applies to all CFGs, does it not? I.e. if the (X)HTML input is not well-formed, not even a full-blown XML parser will work reliably. Maybe if you give examples of the "(X)HTML syntax errors implemented in real world user agents" you're referring to, I'll understand what you're getting at better.
–
LarsHMay 22 '12 at 5:09

39

@AdamMihalcin is exactly right. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl's) are Turing complete. It's true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why.
–
dubiousjimMay 31 '12 at 13:44

7

This is the most "full and short" answer here. It leads people to learn basics of formal grammars and languages and hopefully some maths so they will not wast time on hopeless things like solving NP-tasks in polynomial time
–
mishmashruApr 19 '13 at 12:15

so you do not actually solve the parsing problem with regexp only but as a part of the parser this may work. PS: working product doesn't mean good code. No offence, but this is how industrial programming works and gets their money
–
mishmashruApr 19 '13 at 12:18

Don't listen to these guys. You actually can parse context-free grammars with regex if you break the task into smaller pieces. Your pattern needs to do each of these in order:

Solve the Halting Problem.

Square a circle (use the "ruler and compass" method for this).

Work out the Traveling Salesman Problem in O(log n). It needs to be fast or your regex engine will hang.

The results will be pretty big, so make sure you have another algorithm that losslessly compresses random data.

Almost there - just divide the whole thing by zero. Easy-peasy.

I haven't figured out the last part yet, but it shouldn't be hard. My code keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions lately, so I'm setting up an empty catch block to just consume those and keep parsing. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

Divison by zero is a much easier problem than the others you mention. If you use intervals, rather than plain floating point arithmetic (which everyone should be but nobody is), you can happily divide something by [an interval containing] zero. The result is simply an interval containing plus and minus infinity.
–
rjmunroJun 14 '12 at 10:53

67

Fermat's small margin problem has been solved by soft margins in modern text-editing software.
–
kd4ttcMar 1 '13 at 20:24

7

Fermat's small margin problem has been solved by Randall Munroe by setting the fontsize to zero: xkcd.com/1381
–
heltonbikerOct 16 '14 at 19:55

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid, if they want to use strange words). They are lying.

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like the Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full blown parser, so it will only split the XML in its component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

Good Lord, it's massive. My biggest question is why? You realize that all modern languages have XML parsers, right? You can do all that in like 3 lines and be sure it'll work. Furthermore, do you also realize that pure regex is provably unable to do certain things? Unless you've created a hybrid regex/imperative code parser, but it doesn't look like you have. Can you compress random data as well?
–
Justin MorganMar 8 '11 at 15:23

44

@Justin I don't need a reason. It could be done (and it wasn't illegal/immoral), so I have done it. There are no limitations to the mind except those we acknowledge (Napoleon Hill)... Modern languages can parse XML? Really? And I thought that THAT was illegal! :-)
–
xanatosMar 8 '11 at 15:31

31

Sir, I'm convinced. I'm going to use this code as part of the kernel for my perpetual-motion machine--can you believe those fools at the patent office keep rejecting my application? Well, I'll show them. I'll show them all!
–
Justin MorganMar 8 '11 at 17:55

17

@Justin So an Xml Parser is by definition bug free, while a Regex isn't? Because if an Xml Parser isn't bug free by definition there could be an xml that make it crash and we are back to step 0. Let say this: both the Xml Parser and this Regex try to be able to parse all the "legal" XML. They CAN parse some "illegal" XML. Bugs could crash both of them. C# XmlReader is surely more tested than this Regex.
–
xanatosMar 9 '11 at 15:08

@Kyle—jQuery does not parse XML, it uses the client's built–in parser (if there is one). Therefore you do not need jQuery to do it, but as little as two lines of plain old JavaScript. If there is no built–in parser, jQuery will not help.
–
RobGOct 31 '13 at 6:25

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by
a regular expression. However, the .NET regular expression engine
provides a few constructs that allow balanced constructs to be
recognized.

(?<group>) - pushes the captured result on the capture stack with
the name group.

(?<-group>) - pops the top most capture with the name group off the
capture stack.

(?(group)yes|no) - matches the yes part if there exists a group
with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a
restricted PDA by essentially allowing simple versions of the stack
operations: push, pop and empty. The simple operations are pretty much
equivalent to increment, decrement and compare to zero respectively.
This allows for the .NET regular expression engine to recognize a
subset of the context-free languages, in particular the ones that only
require a simple counter. This in turn allows for the non-traditional
.NET regular expressions to recognize individual properly balanced
constructs.

System.Text is not part of C#. It's part of .NET.
–
John SaundersFeb 2 '12 at 19:07

4

In the first line of your regex ((?=<ul\s*id="matchMe"\s*type="square"\s*>) # match start with <ul id="matchMe"...), in between "<ul" and "id" should be \s+, not \s*, unless you want it to match <ulid=... ;)
–
C0deH4ckerJul 6 '12 at 2:49

1

Not that I really understand it, but I think your regex fails on <img src="images/pic.jpg" />
–
ScheintodSep 27 '13 at 17:05

1

@Scheintod Thank you for the comment. I updated the code. The previous expression failed for self closing tags that had a / somewhere inside which failed for your <img src="images/pic.jpg" /> html.
–
SamSep 27 '13 at 19:00

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

That is very true, and I did think about it, but I assumed the > symbol is properly escaped to &gt;.
–
KobiNov 13 '09 at 23:16

57

> is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use &gt;. (Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.)
–
bobinceNov 14 '09 at 0:15

4

@Kobi: what does the exlamation mark (the one you placed tpward the end) mean in a regexp?
–
Marco DemaioApr 30 '11 at 17:16

+100. I used to "meh" this mantra - "don't use Regexs to parse HTML" , a little regex hasnt killed anyone, right? Then I measured my performance and realised the HTMLAgilityPack approach is about 11 times faster. It "moves" through the string node-to-node, building a DOM object model. Which is a lot easier to operate by the way.
–
Serge ShultzJun 29 at 10:53

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss.
If you only know yourself, but not your opponent, you may win or may lose.
If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has
complexity exceeding
regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

attributes which value is bound either into single quotes or into double quotes

attributes containing single quotes when the delimiter is a double quote and vice versa

"unpretty" attributes with a space before the "=" symbol, after it and both before and after it.

Should you find something which does not work in the proof of concept above, I am available in analysing the code to improve my skills.

<EDIT>
I forgot that the question from the user was to avoid the parsing of self-closing tags.
In this case the pattern is simpler, turning into this:

$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:

$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';

</EDIT>

Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:

the first sub-expression (\w+) matches the tagname

the second sub-expression contains the pattern of an attribute. It is composed by:

one or more whitespaces \s+

the name of the attribute (\w+)

zero or more whitespaces \s* (it is possible or not, leaving blanks here)

the "=" symbol

again, zero or more whitespaces

the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.

the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression

here comes the fun: the \4 part is a backreference operator, which refers to a sub-expression defined before in the pattern, in this case I am referring to the fourth sub-expression, which is the first attribute delimiter found

zero or more whitespaces \s*

the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.

Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.

The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is of course escaped, since it coincides with the regular expression delimiter.

Small tip: to better analyse this code it is necessary looking at the source code generated, since I did not provide any HTML special characters escaping.

Does not match valid tags having attributes with no value, i.e. <option selected>. Also does not match valid tags with unquoted attribute values, i.e. <p id=10>.
–
ridgerunnerJul 25 '11 at 15:01

1

@ridgerunner: Thanks very much for your comment. In that case the pattern must change a bit: $pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/'; I tested it and works in case of non-quoted attributes or attributes with no value.
–
Emanuele Del GrandeJul 25 '11 at 16:41

3

NO sorry, whitespaces before a tagname are illegal. Beyond being "pretty sure" why don't you provide some evidences of your objection? Here are mine, w3.org/TR/xml11/#sec-starttags referred to XML 1.1, and you can find the same for HTML 4, 5 and XHTML, as a W3C validation would also warn if you make a test. As a lot of other blah-blah-poets around here, I did not still receive any intelligent argumentation, apart some hundred of minus to my answers, to demonstrate where my code fails according to the rules of contract specified in the question. I would only welcome them.
–
Emanuele Del GrandeOct 6 '13 at 18:03

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on January: Reference

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used

although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him.
–
aehiilrsJul 5 '10 at 16:33

27

Regular expressions can't work because by definition they are not recursive. Adding a recursive operator to regular expressions basically makes a CFG only with poorer syntax. Why not use something designed to be recursive in the first place rather than violently insert recursion into something already overflowing with extraneous functionality?
–
WelbogJul 6 '10 at 18:38

12

My objection isn't one of functionality it is one of time invested. The problem with RegEx is that by the time you post the cutsey little one liners it appears that you did something more efficiently ("See one line of code!"). And of course no one mentions the half hour (or 3) that they spent with their cheat-sheet and (hopefully) testing every possible permutation of input. And once you get past all that when the maintainer goes to figure out or validate the code they can't just look at it and see that it is right. The have to dissect the expression and essentially retest it all over again...
–
OorangJul 10 '10 at 15:11

11

... to know that it is good. And that will happen even with people who are good with regex. And honestly I suspect that overwhelming majority of people won't know it well. So you take one of the most notorious maintenance nightmares and combine it with recursion which is the other maintenance nightmare and I think to myself what I really need on my project is someone a little less clever. The goal is to write code that bad programmers can maintain without breaking the code base. I know it galls to code to the least common denominator. But hiring excellent talent is hard, and you often...
–
OorangJul 10 '10 at 15:17

There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

I recently wrote an HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes.

For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.