If you want us to make the effort to solve you problem then you should probably make the effort to ensure that it's formatted in a way to make it as easy to read as possible.
–
Dave CrossNov 26 '10 at 9:55

1

...or, better, use [^>]* instead of .* or .*? because you're looking for any number of non-> characters, not for any number of any character at all. The negated character class version is often faster to process than a non-greedy "match anything", plus it more clearly expresses your intent to human readers.
–
Dave SherohmanNov 26 '10 at 11:36

3

@Chris, No, that answer is NOT enough to answer all such questions. It is a joke; it does not explain things. If you really believe that is the right approach to answering people, go ahead and make it a real answer so we can vote on it. Otherwise it’s just mindless parroting, and it doesn’t help the user one bit.
–
tchristNov 26 '10 at 15:09

2

@tchrist - Disagree. I obviously don't believe it should be an answer, since I didn't post it as one, and as a comment is it perhaps more humorous to others, but it essentially the same thing the OP is doing, and the answer does say "use an HTML parser" which is the correct way to handle the situation (albeit after a lot of perhaps unnecessary setup). Comments are different from answers. That's why the comment I made has 6 upvotes and the answer someone else posted has only one (and two downvotes).
–
Chris LutzNov 26 '10 at 20:23

1

@DVK: I’m mortally uncomfortable with the hypocrisy of giving “do as I say, not as I do” advice. People here nearly always mischaracterize both problem and solution, which niggles me. They use “parse” far too freely, meaning no more than suss out or munge. Querents never explain the true constraints, and folks giving answers nearly never do, either. Rote repetition of soundbytes w/o deeper understanding oversimplifies into lies and leads to mindless cargo-cult programming. Regexes are great for lexing x/ʜᴛᴍʟ, plus work well on well-constrained fragments. Parsing is a different matter.
–
tchristFeb 17 '11 at 17:37

4 Answers
4

SUMMARY

Using patterns on little, limited pieces of reasonably well-defined pieces of HTML is quick and easy. But using them on an entire document containing fully general, open-ended HTML of unforeseeable quirks is, while theoretically possible, in practice much too hard compared with using someone else’s parser that’s already been written for that express purpose. See also this answer for a more general discussion on using patterns on XML or HTML.

If you are quite certain that works for the particular specimen of HTML that you wish it to, then by all means use it. Notice several things that I do which you didn’t. One of them is not dealing with the HTML a line at a time. That virtually never works.

However, this sort solutions works only on extremely limited forms of valid HTML. You can only use it when you can guarantee that the HTML you’re working with really looks like what you expect it to.

The problem is that it quite often does not look all neat and tidy. For these situations, you are strongly advised to use an HTML parsing class. However, no one seems to have shown you the code to do that. That’s not very helpful.

Wizard-Level Regex Solution

And I’m going to be one of them myself. Because I am going to show you a more general solution for approaching what I believe your take to be, but unlike anyone else who ever posts on Stack Overflow, I’m going to use regexes to do it, just to show you that it can be done, but that you do not wish to do it this way:

The Choice Is Yours — Or Is It?

Both those solve your problem with regexes. It is possible that you will be able to use the first of my two approaches. I cannot say, because like seemingly all such questions asked here, you haven’t told us enough about the data for us (and perhaps also you) to know for sure whether the naïve approach will suffice.

When it doesn’t, you have two choices.

You can either use the more robust and flexible approach offered by my second technique. Just make certain that you understand it in all its aspects, because otherwise you won’t be able to maintain your code — and neither will anybody else.

Use an HTML parsing class.

I find it unlikely that even 1 person in a 1000 would reasonably make the first of those two choices. In particular, I find it extremely unlikey that someone who asks for help with regexes as simple as those in my first solution would be a person capable of managing the regexes given in my second solution.

Which really leaves you with only one “choice” — if I may use that word so loosely.

To the downvoters I repeat: if you disagree with my advising them not to use regexes for this, please offer an alternative. I suspect you have all failed to read what I wrote and somehow think I am advising the opposite of what I really am. Please don't be dumb.
–
tchristJan 18 '11 at 20:50

It would be great if its in perl regex
–
user524707Nov 26 '10 at 10:02

I have edited now.This is an html piece..modified
–
user524707Nov 26 '10 at 10:04

1

@user510749: You said it would be great if this were a perl regex. I dunno for sure that that follows, but I have given you two different perl regex approaches in my answer elsewhere on this page. Whether they’re great or not I leave to your judgement. BTW, if you edit your user profile, you can give yourself a name of your choosing, not just @user510749.
–
tchristNov 26 '10 at 15:11

First of all, your code only reads the first line of your input. If you want to iterate through all the lines of your input, you should use this:

while($str = <FILENAME>) {
chomp $str;
}

Assuming your input is well formated, and the href attribute always comes after the 'a' tag, and the src attribute always follows 'img' tag, and you don't have spaces in your URL's, and you don't have more than one strong tag per line, then you could use the following: