Partial web page downloading

This is a discussion on Partial web page downloading within the C++ Programming forums, part of the General Programming Boards category; Iím trying to download part of the html code of a web site. I have attempted to use the libcurl ...

Partial web page downloading

Iím trying to download part of the html code of a web site. I have attempted to use the libcurl library and have managed to achieve this for some web pages using the range command. However I cannot get it to work with the web page that I want to do this to, it would seem that the server does not support ranges. Does anyone know any way of getting round this.

The part of the web page I want to download is at the beginning of the file. If itís not possible to get round the range issue, is it possible to terminate the download after a specified number of bytes has been received. I have looked in the libcurl library but cannot determine how to do this.

> it would seem that the server does not support ranges
IIRC, ranges appeared in a later spec for HTTP, so check the version number returned by the server, and check which RFC (www.rfc-editor.org) ranges got implemented in the HTTP protocol.

From further research i have come to the conclusion that the server i am trying to download from doesn't support ranges for downloading.

For this reason i am now focusing on trying to download only part of the web page from the beginning, for instance the first 1024 bytes, or downloading one byte at a time and analysing it then telling the program whether to download the next byte or not. However i have had no success. If someone could suggest a method, or a library to use, the help would be much appreciated.

I agree I am stuck because the server will not allow me to control the download from its end. But is their not some way to detect what i've downloaded as it downloads and the kill the connection when I've received the data that I want?

An example of what i am aiming for is: when downloading a file the data is recieved sequencially from the beginning. If your connection goes before the whole of the file has been downloaded then you only get part of it. Is there a way to do this in a controlled manner? i.e to terminate the connection after a certain number of bytes have been downloaded or a certain series of characters have been downloaded?

So how big are these files you're downloading, and over what kind of connection?

A few KB or 10's of KB over broadband just isn't worth the effort - just take what you need and trash the rest until the end of the stream. Processing that isn't going to take up too much CPU time or network bandwidth.

A few MB over dialup would be another matter, if all you wanted was the head of the file.

Whilst I'm sure you can unilaterally decide to close the connection, you have to consider that this might have a negative impact on the server at the other end (say waiting for a connection to timeout after a long period rather than closing it cleanly after a short period). Consider that if you do this a lot that the site may regard this as a denial of service attack on your part.

The network is robust enough that connections can drop for random network breakages, but to exploit that for your own sense of urgency is probably not on.

I would suggest you read the RFC in detail to find out what is and is not allowed within the normal HTTP protocol.

Naturally I didn't feel inspired enough to read all the links for you, since I already slaved away for long hours under a blistering sun pressing the search button after typing four whole words! - Quzah

You. Fetch me my copy of the Wall Street Journal. You two, fight to the death - Stewie

Function pointer that should match the following prototype: size_t function( void *ptr, size_t size, size_t nmemb, void *stream); This function gets called by libcurl as soon as there is data received that needs to be saved. The size of the data pointed to by ptr is size multiplied with nmemb, it will not be zero terminated. Return the number of bytes actually taken care of. If that amount differs from the amount passed to your function, it'll signal an error to the library and it will abort the transfer and return CURLE_WRITE_ERROR.

Whilst I'm sure you can unilaterally decide to close the connection, you have to consider that this might have a negative impact on the server at the other end (say waiting for a connection to timeout after a long period rather than closing it cleanly after a short period). Consider that if you do this a lot that the site may regard this as a denial of service attack on your part.

I press the Stop button all the time. I doubt that any HTTP server is going to have a problem with stopping partially complete downloads, given that this functionality is built into every browser.

Salem I understand what you mean about cutting the connection maybe having a negative impact on the server, especially if I did it multiple consecutive times, and certainly donít want to do that. However as Anonytmouse says all web browsers have the ability to stop connections built in, and the ability to click on a different link before all the data has been recieved from the server. In any case is it not just possible to let the server send the data, but just not actually receive it all at my end however tell the server I did, or tell it to stop sending the data part way through the transfer, thus avoiding a time out.

Anonytmouse thank you for the information on the Libcurl CURLOPT_WRITEFUNCTION function however I could not get it to work. In the end I decided I would just try to download the file using sockets to allow me to be more flexible, and so wrote the following code, however I cannot get it to work.

If I try www.cprogramming.com/begin.html I get 404 Not Found
If I try www.google.co.uk/index.html I get an error from the recv() part of the code: 10053 which is ďSoftware caused connection abort. A connection was aborted by the software in your machine, possibly due to a TCP/IP configuration error, data transmission time-out or protocol error.Ē

Note that each line is terminated with a CR-LF and that the request is terminated with a blank line. Although a host field was not required with HTTP1.0, ip addresses which house multiple web sites will fail without it.

Anonytmouse again than you for the response, the code now works well, and allows me to download a web page.

However any help from anybody on achieving my initial aim would be much appreciated. Which was to be able to download part of the beginning of a web page, which is on a server that doesnít support ranges.

From the code is does seem possible to download say the first 1KB of a web page by adding the following code to the while loop.

Code:

if (y >= 1024)
break;

However would this cause the server to time out, as Salem suggested previously? Instead of just closing the connection part way through the transfer is it possible to send a message to the server to tell it to close the connection before it has sent all the data, or send it a new GET message that overrides the previous GET message.

I have been reading about persistent connections and the "Connection: close" header so it does seem that it is possible to tell a server to close the connection. Though this is not exactly the same as I am trying to achieve would it be possible to use this some how, or something similar.

I agree with anonytmouse. Just close the connection on your side. The server will cope.

You cannot send any message to the server that will make it stop sending midway through the document. The connection header is a hint about what to do with the connection when the current request is completed; whether the server should close it or should keep it open, waiting for another request.