html - RegEx match open tags except XHTML self-contained tags - Stack Overflow
current community
chat
Stack Overflow
Meta Stack Overflow
Stack Overflow Careers
your communities
Sign up
or
log in
to customize your list.
more stack exchange communities
company blog
Stack Exchange
Inbox
Reputation and Badges
sign up
log in
tour
help
Tour
Start here for a quick overview of the site
Help Center
Detailed answers to any questions you might have
Meta
Discuss the workings and policies of this site
Stack Overflow
Questions
Jobs
beta
Tags
Users
Badges
Ask Question
Sign up
&times;
Stack Overflow is a community of 4.7 million programmers, just like you,
helping each other. Join them; it only takes a minute:
RegEx match open tags except XHTML self-contained tags
up vote
1326
down vote
favorite
4475
I need to match all of these opening tags:
&lt;p&gt;
&lt;a&gt;
But not these:
&lt;br /&gt;
&lt;hr /&gt;
I came up with this and wanted to make sure I've got it right. I am only capturing t
he
a-z
.
&lt;([a-z]+) *[^/]*?&gt;
I believe it says:
Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except
/
, then
Find a greater-than
Do I have that right? And more importantly, what do you think?
html
regex
xhtml
share
edited
May 26 '12 at 20:37
community wiki
11 revs, 7 users 58%
Jeff
locked
by
Robert Harvey
&#9830;
Jun 7 '12 at 19:41
This post has been locked due to the high amount of off-topic comments generated. Fo
r extended discussions, please use
chat
.
comments disabled on deleted / locked posts / reviews
&nbsp;|&nbsp;
35 Answers
35
active
oldest
votes
1
2
next
up vote
4427
down vote
accepted
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is
not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-
regex questions here so many times before, the use of regex will not allow you to co
nsume HTML. Regular expressions are a tool that is insufficiently sophisticated to u
nderstand the constructs employed by HTML. HTML is not a regular language and hence
cannot be parsed by regular expressions. Regex queries are not equipped to break dow
n HTML into its meaningful parts. so many times but it is not getting to me. Even en
hanced irregular regular expressions as used by Perl are not up to the task of parsi
ng HTML. You will never make me crack. HTML is a language of sufficient complexity t
hat it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML usi
ng regular expressions. Every time you attempt to parse HTML with regular expression
s, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.
Parsing HTML with regex summons tainted souls into the realm of the living. HTML an
d regex go together like love, marriage, and ritual infanticide. The &lt;center> can
not hold it is too late. The force of regex and HTML together in the same conceptual
space will destroy your mind like so much watery putty. If you parse HTML with rege
x you are giving in to Them and their blasphemous ways which doom us all to inhuman
toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he
comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe,
your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are t
he cancer that is killing StackOverflow
it is too late it is too late we cannot be saved
the trangession of a chi͡ld ensures regex will consume all living tissue (except fo
r HTML which it cannot, as previously prophesied)
dear lord help us how can anyone survive this scourge
using regex to parse HTML has doomed humanity to an eternity of dread torture and s
ecurity holes
using rege
x as a tool to process HTML establishes a brea
ch between this world
and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but
more corrupt) a mere glimp
se of the world of reg​
ex parsers for HTML will ins
​tantly transport a p
rogrammer's consciousness i
nto a w
orl
d of ceaseless screaming, he comes
, the pestilent sl
ithy regex-infection wil​
l devour your HT
​ML parser, application and existence for all time like Visual Basic only worse
he comes he com
es
do not fi
​ght h
e com̡e̶s, ̕h̵i
​s un̨ho͞ly radiańcé de
stro҉ying all enli̍̈́̂̈́ghtenment, HTML tags
lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq
​uid p
ain, the song of re̸gular exp​re
ssion parsing
will exti
​nguish the voices of mor​
tal man from the sp
​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​
he f
inal snuf
fing o
f the lie​
s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T A
LL I​S L
OST th
e pon̷y he come
s he c̶̮om
es he co
me
s t
he
ich​
or permeat
es al
l MY FAC
E MY FACE ᵒh god n
o NO NOO̼
O​O N
Θ stop t
he an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨ
e̠̅s
͎a̧͈͖r̽̾̈́͒͑e
n
​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ T
O͇̹̺ͅƝ̴ȳ̳ TH̘
Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝
S̨̥̫͎̭ͯ̿̔̀ͅ
Have you tried using an XML parser instead?

Thanks, I am familiar with sed. As for the command options: sed -ne's/<([^>"]*("[^"]*")*)*)*>//g;/./p' <infile >outfile ... I am assuming that the first three occurrences of the symbols < and > in the options refer to the symbols that I am trying to extract between? A little more explanation of what each part would be very helpful - I ran into this problem on other posts - wanted to do some substitution - but I couldn't figure out what to substitute - big thanks again.
– speld_rwongDec 26 '15 at 16:06

1

WORKED - I created a bash script file and put in the code you posted above with the "proof of concept" (most appreciated) and substituted the url input for: sed ... test.txt > extracted_text.txt and it worked great - yes I am dealing with a lot of javascript - but it cleaned everything up - I'll remove the highly repeated terms and then remove the duplicates and then - done. Elegent solution and it did exactly what I needed - thanks for the help. I use sed and awk to crunch through combined htmls with 100k lines sometimes.
– speld_rwongDec 26 '15 at 17:15

1

@speld_rwong - if it works, accept it. click the checkmark. as for how it works... please dont do that to me? its... really long. i have a lot of sed answers on this site with much more basic (and far better explained) examples. can you start by looking through them and see if it helps?
– mikeservDec 26 '15 at 18:01

I like the choice of URL. +1 for that alone :)
– SobriqueDec 28 '15 at 14:38

I upvoted you, but @speld_rwong might need to replace .* with [^<]* if his files might have several books per line. All depends on the manner in which the HTML is generated.
– Law29Dec 26 '15 at 16:15

@Law29 - in case you missed it, there is an answer here that already does that.
– mikeservDec 26 '15 at 18:36

@mikeserv no, I don't see an answer that selects the text following book title> which I think is quite useful and corresponds quite well to what the OP requested. I would not have missed it.
– Law29Dec 27 '15 at 0:09

@Law29 - that's not what is asked at all. book title is used in the for example phrase immediately followed by I want to extract the text between > and <. The book title is at best circumstantial, and, at the worst, it is generic filler. what is specifically requested is anything occurring between > and < which can only be assured by removing all between < and > and the [<>] chars themselves.
– mikeservDec 27 '15 at 0:12

@mikeserv- So if book title> is only an example of a tag for which speld_rwong is looking, where is the difficulty in replacing "book title" with the name of any other tag for which he is looking? The ONLY way for anyone to provide a complete and correct answer would be if speld_rwong provided us the exact file from which he is working and the exact data he is seeking.
– rcjohnsonDec 27 '15 at 1:21

Using regexes to extract information from HTML is rarely a good idea, especially if syntactic elements might span lines in your file.

If you only want to do this once, I'd open up the file in your favorite text editor and use search-and-replace macros to pare things down. I did this just today, in fact :) but it took me a relatively long time.

This is only tangentially related to what the question called for. Also, Awk reads files, so the cat is redundant, and you can just set the field separator to both < and >: F'[<>]' {print $2}'...
– jasonwryanDec 28 '15 at 5:26