If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Enjoy an ad free experience by logging in. Not a member yet? Register.

If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?

Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?

from second question seems that you don't know to much about parser so answer to the first question is yes, will be inefficient. Probably you can learn if you want but will take some time. There are some tools for this:

Users who have thanked oesxyl for this post:

The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

So here's another method I originally did, but thought it... used extra memory?Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?

The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

So here's another method I originally did, but thought it... used extra memory?Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

imo method 2 is better. Some things you can think about:
- relative to 2#1 ( notation method#item ), id's and classes are attributes so you need to have only elements( tagnames ) and attributes
- a page is a tree where the root of the tree is 'html' with two leafs 'head' and 'body', and you can continue this way to describe the whole page( something like what you described in method 1#). That is why i think the best way is to implement a class tree and extend it after.

Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?

there three things in a parser:
1. lexical analyzer, code to identifiy each token in the input, you can use here regex, string functions or whatever you find that is easy and fast to do

Yes well, parsing HTML just wasn't a good idea in my end. I'm going to just use plain old regular expressions to search for things. If you are able to prove that BeautifulSoup is much faster than regular expressions in matching a) for an element of x ID and b) elements of a certain class, and c) can elements before an element + its parent.

Your challenge assumes that you're getting well-formed, even valid HTML source code. But the real world is full of soup. What happens if the source code you're evaluating has two, three elements that have the same ID? What happens when you're parsing elements that have multiple classes that are defined in varying order, like:

It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.

It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.

thank you,
i agree that using regex in a specific case for a given task, is many time faster then a general parser. But are only few parser who implement error recovery and dealing with errors and invalid markup, Samhain13 is right about BeautifulSoup. I just notice that i didn't even mention about this problem, my bad,