Integration with Web?

This is a discussion on Integration with Web? within the C++ Programming forums, part of the General Programming Boards category; How would i go about creating a program that will "read" a webpage and find and save a list from ...

Integration with Web?

How would i go about creating a program that will "read" a webpage and find and save a list from that webpage into variables. i'm looking to create a program that does that and i'm not even sure where to start.

the only way to do it is with a web browser. basically, a web browser (ill just call "client") is just a program that requests and sends information to/from a socket, using (in general) the HTTP protocol.

example: you open your web browser (firefox, internet explorer, whatever it may be) and you go to http://www.google.ca. this is what happens in the background:

- your "browser" creates a socket on your computer and connects to the "web server" socket, in this case: www.google.ca:80 (a socket is an IP address or DNS name and a port, separated by ":").
- the web server (www.google.ca:80) sees your request for a connection, accepts it, and waits for you to tell it something
- your web browser uses HTTP's "GET" command to request a resource on that web server, implicitly it asks for "index.html" (google.ca =~ google.ca/index.html). it sends this HTTP GET command over the socket
- the web server reads your request from the socket and, among other things, it tries to find the resource "/index.html" and if it exists, it sends it back along with other information (i.e. HTTP header)
- your web browser reads from this socket until theres nothing else to read (or it sees some delimiter, i forget). what this socket is reading is the HTTP header, then the resource. so the web browser separates this header and resource, and displays the resource (the html file).

you can do the exact same thing, and its surprisingly easy (at least to send a request and get the response with the file; rendering it's HTML or whatever other language is quite a bit more complicated). also note that the "resource" can be any type of file, an image for example (though it will be binary information and not plain-text, but i dont think you said you are interested in these files).

so, to start: learn basic HTTP protocol and basic C++ socket creation, read/write and your done. once you get the file from the socket ("web server"), you have a copy of the file so theres nothing special working with it (i.e. looking for a list and creating variables for it, whatever).

best thing to do is to learn the 2 basics above, make an attempt and let us know your specific problems if you run into any.

Downloading a HTML page and parsing it, using C++? Sounds like a recipe for intense pain.

I don't think it'd be so horrible if he's looking for, say, a specific tag, such as <ul>. It's certainly doable, considering Webkit and Gecko are written in C++, and they have to do far more than simply read in what amounts to XML.

You can probably use a regular expressions library to find a specific tag. The new C++ standard will have a Regex library, which is actually the Boost.Regex library. VS 2010 probably has it, and GCC 4.3 has it, too. If your compiler doesn't have it, you can install the Boost library and use it.

Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.

it really isnt that difficult, especially if youre using some regular expression library. if you know exactly the possible characters within the tags then you can use basic string comparisons without any libraries or regex. doing it "manually" like this of course may not be as fast/efficient as using a regular expression.

for example, if you know that the format you are expecting is exactly something like:

Code:

some stuff
you dont care about
<tag>
value1
value2
</tag>
other stuff

then iterating over every line until you see the first (assuming thats what your looking for, the first) line that is exactly "<tag>\n" then save all lines until the first "</tag>\n". the problem is when you dont know exactly what can be in it, embedding "tag"s within "tag"s or whatever.

if your restricted to C++ then of course you cant get around using a different language. if you arent restricted to any language, then if you think youre more comfortable with something else then use that. others have suggested it can be done faster and more straightforward using a higher-level language (im biased and would suggest perl).

if your restricted to C++ then of course you cant get around using a different language.

Interesting to note many higher level languages are written in C or C/C++ and the open source libraries can be used from within C/C++. I can use Python, PHP, LUA (and probably Perl) to download/parse a website in very few lines from within C/C++. I never have though, since I've never found found myself restricted and would simply switch languages instead. Plus that seems cryptic and inefficient. Just a note.

Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.

Internally, it uses raw sockets, but unfortunately it doesn't handle SSL either. It's on my TODO list, though...so I should have that up and running sometime within the next decade or so.