12 posts in this topic

Original Poster 324

I am trying to mine microdata from websites based on the scheme.org... but I am having trouble getting everything I want simply ...

as you can see below I am doing this in a very round about way but when I try and google it Im getting loads of trash that isnt helpful or is to much for what I require... I simple want to grab the microdata and thats it for now .... any help?

Share this post

Link to post

Share on other sites

snaphat (Myles Landwehr) 414

the programming is reading each line untill it find itemprop and splits at that point...if not it does nothing .. this works on random sites with these in :)

My point was that it looks like it only searches for servesCousine, and cell1 on the same line as itemprop was found on and I was wondering why that would be a valid assumption -- i.e. why isn't it possible for those tags to start on another line after itemprop is found?

Share this post

Link to post

Share on other sites

snaphat (Myles Landwehr) 414

I believe stackoverflow states this better than me, but regex should not be used to parse HTML.

That link is essentially saying that you can't build a parser using regex for HTML as a language (the marked correct answer is actually doing a very poor job of that). The reason is that HTML has a context-free grammar and regex is strictly not a powerful enough construct for that *. Strictly speaking you would need some form of LR/LR parser for HTML (see wiki for these).

That being said, the OP is not trying to parse the language itself as a whole and so there is a distinct difference. He's just trying to find a few tidbits of information within a document and regex could very well be appropriate for that job. It really depends on the nature of what he wants to parse. In general, speaking from an engineering perspective, you should use the simplest possible solution. He shouldn't build an HTML parser if he doesn't have to. But, as The_Decryptor said, using an already existing HTML Parser is a good (better) alternative (and probably less work for the OP).

* Note: there are languages that can be generated and parsed by regex through.

Share this post

Link to post

Share on other sites

The_Decryptor 1,088

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)

Edit: I actually just remembered something related to this I saw recently, somebody saved a document with "smart quotes" as UTF-8, then loaded that as Latin-1, then converted it to Windows-1252, then saved that as UTF-8, and complained when the browser didn't render them properly.

Share this post

Link to post

Share on other sites

snaphat (Myles Landwehr) 414

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)