I'm new to python, and I'm building a web crawler for funsies/educational purposes. I don't use any recursive functions, but I still get the 'RuntimeError: maximum recursion depth exceeded' error. I'm really confused, and kinda feel like I'm missing something obvious or just misunderstanding something. Am I somehow recursing, or could it be related to my large loops? The idea is to crawl the web until you've crawled 10k pages.

Looks like my issue is trying to call str() on a URL that is not encoded properly. I've tried decoding the URLs and then ecoding them to unicode, but I was never able to do it successfully. Any advice would be greatly appreciated!

Please do include the full traceback of python exceptions. That saves us all having to guess where the error occurred.
–
Martijn PietersFeb 26 '13 at 18:23

2

If you're parsing HTML you are using recursive functions...I suggest that when you fail you dump the page variable to see if the Html is misformed
–
CrisfoleFeb 26 '13 at 18:23

I'm at work now, I'll post the full traceback when I get home tonight. Sorry.
–
BrianFeb 26 '13 at 18:36

1

@ChristopherPfohl: Not necessarily. You can write a purely iterative parser, and instead of recursively crawling any links found just push them onto the end of the queue.
–
abarnertFeb 26 '13 at 18:40

1

Did you really shorten this as much as you could? For example, is opening an mdb database essential to the problem? (That's something most of us can't test—even if we had your database, which we don't.)
–
abarnertFeb 26 '13 at 19:02

1 Answer
1

The code you gave us doesn't actually run (it's missing all the import statements, and it has indentation errors, and so on), and it requires a JET database we don't have and a third-party module to read it, and it's hardcoded to use pre-existing directories in your home directory.

I've attempted to fix all of that at http://pastebin.com/rCJriEu5 (which requires lxml and bs4—if you were using a different parsing library or BS3, I can try it that way).

And when I run it, it seems to work. It's 31.73% complete, with no errors yet. Even if I do a sys.setrecusionlimit(50) at the start of the file, it still seems to work (3.67% complete so far).

So, whatever is wrong in your code is apparently in code you haven't showed us.

It seems to fail for me around 1500 pages scanned. I know that doesn't make sense, it's just my observation. Sorry about removing import statments, I thought that would be helpful (I was wrong). I've uploaded the full code here: pastebin.com/q9KCrQrL (Should get rid of the indentation errors) As for the database, it's just a local mysql db, named python, with a user named and password equaling python. All the log files should auto-create (the directory should too, thanks to your code). I was getting the traceback wrong, so I'm still working on getting that.
–
BrianFeb 27 '13 at 23:24

@Brian: Without the traceback, it's hard to be sure what's going on… but my guess is that the recursion limit is actually happening inside the HTML parser. See the docs for details on the different parsers bs4 can use, but if you're falling back to the default, especially on 2.7.2/3.1.0 or earlier, it sucks in many ways. When I test with 2.7.2, it fails on 4 of the first 100 files (at least two different errors, neither of which is a recursion error).
–
abarnertFeb 27 '13 at 23:36

@Brian: And a really quick way to test that: Whichever link it failed on, put that link at the start of the started deque, and see if it now fails on the first page instead of the 1500th.
–
abarnertFeb 27 '13 at 23:38

Huzzah! I finally have a full stack trace: pastebin.com/9GzAxZM9 This is my updated code: pastebin.com/4v5GT7ft Looks like the problem comes from the log function (like @uselpa stated), specifically trying to convert the url to a string so I could log it properly. This is a symptom of a larger problem I've been having, which is the URL encodings. I tried decoding them and then encoding them to unicode, but I never had any luck. (That's why I check for unicode errors everywhere) Any adivce on how to properly decode/encode would be greatly appreciated.
–
BrianMar 1 '13 at 13:55

@Brian: OK, so which link is this happening on? If it's happening in the same place consistently, but it takes waiting through 1500-odd fetches to repro it, you're making it much harder for people to debug this for you.
–
abarnertMar 1 '13 at 18:50