web spider

If one would like to develop a web spider,one that exhibits fast speeds at run-time the project should also be done at the minimum time possible, would C++ be a language of choice?There are some languages that offer HTTP libraries and web/HTML parsing and I was kind of feeling implementing the same functionality with C++ would proof even difficult than even learning such a language as Perl with the sole aim of developing the web spider.I would like to hear what you think yourselves.

The bandwidth of your internet connection is likely to be the limiting factor. If your DSL connection maxes out at say 1MB/Sec, then my guess would be any language would suffice if the only thing to be done is parsing HTML files for text (for indexing) and links (for fetching).

But if you've got a server farm with multiple giga-bit connections, a different analysis would be appropriate.

EDIT: As for Salem's concern, I'm guessing that since you are posting this question here, a server farm with multiple gigabits of bandwidth are not at your disposal.

It sounds like you are more concerned with having a solid, effective and efficient end product, instead of this being a vehicle to learn a new language. In that light, here's what I think:

It is usually much easier to learn to use a library for a language you are already proficient at, than it is to learn a new language well enough to tackle larger problems (like a web spider) well. So in your case, I would only consider C++ or Python for implementation languages.

Next step I would do is consider how well I know each of those languages. If you're a C++ guru, and only so-so at Python, then C++ is the way to go, hands down. Any benefits Python may have in offering faster development time will likely be offset by you being slower at development, and making more errors and bad design decisions. If you are a Python guru, and so-so at C++, then develop in Python.

Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free). It was originally a C library, but it has C++ and Python bindings. I'm sure there are plenty of others if you Google. Python probably even has a native library. EDIT: Just checked, looks like it does.

The only other thing you might need for the HTML parsing (libcurl only works at the HTTP level) is some sort of HTML/XML library. Python has it's own DOM-based XML library, and a quick Google search will turn up plenty of options for C++.

I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.

I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.

I daresay that the IDEs have less of a role to play concerning a "very conducive environment for development of a web spider" than the availability and ease of use of relevant libraries.[1] In this area, Perl supposedly has an advantage as I have heard it touted for its text processing capabilities. Of course, if you know C++ considerably better than Python, and don't know Perl at all, then for the short term C++ would be the best option even if an experienced developer in those other languages can get started faster with stronger relevant library support.

[1] That said, there are fairly well developed IDEs for Python, and I presume Perl as well.

Originally Posted by Bjarne Stroustrup (2000-10-14)

I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.

Now in C++, if you want to do it all yourself and not use any libraries except the basic socket API, you're looking at hundreds (if not thousands) of lines of code to achieve the same thing. The time to write and debug such a thing is probably measured in weeks.

Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free).

Only a few minutes ago I started writing a web spider in C to download entire forums, using libcurl. Still need to parse and search for links. I think there are C libraries(written years ago) for parsing HTML, but webpages nowadays are made up of HTML, CSS, javascript and god knows what else. Can anyone recommend a good C library? I am probably just gonna write code to do the parsing, probably way quicker than learning some new library.

"regex" is shorthand for "regular expression", so they're the same thing (just in case you were wondering ). It's one of the most useful things you'll ever learn. It's for matching complex patterns in strings and extracting substrings.

"Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes."