Advertisements

"plork123" <> wrote in message
news:...
> hi all ,
>
> can anyone give me an example of screen scraping a website and return
> the result e.g. html as a string
>
> or lead me to some tutorials / examples

Typically, for websites, you don't need to do screen scrapping. It's
much easier to just parse the HTML data.

Advertisements

On 29 Sep 2005 09:30:09 -0700, "plork123" <> wrote
or quoted :
>can anyone give me an example of screen scraping a website and return
>the result e.g. html as a string
Here is a rough and ready screen scraper I wrote to pluck exchange
rates off the Oanda site. They siced the legal people on me so I had
to quickly desist using it. However, you could use it as a model.

This is not beautiful code. It is just to show proof of concept. It
uses unnecessarily low-level socket code. It was easier at the time
that trying to decipher how Sun's classes worked.

/**
* Used to read until EOF on an Inputstream that
* sometimes returns 0 bytes because data have
* not arrived yet. Does not close the stream.
*
* @param is InputStream to read from.
* @param estimatedLength
* Estimated number of bytes that will be read.
* -1 or 0 mean you have no idea. Best to make
* some sort of guess a little on the high side.
* @return String representing the contents of the entire
* stream.
*/
public static String readEverything( InputStream is, int
estimatedLength ) throws IOException
{
if ( estimatedLength <= 0 )
{
estimatedLength = 10*1024;
}

/**
* Reads exactly len bytes from the input stream
* into the byte array. This method reads repeatedly from the
* underlying stream until all the bytes are read.
* InputStream.read is often documented to block like this, but in
actuality it
* does not always do so, and returns early with just a few bytes.
* readBlocking blocks until all the bytes are read,
* the end of the stream is detected,
* or an exception is thrown. You will always get as many bytes as
you
* asked for unless you get an eof or other exception.
* Unlike readFully, you find out how many bytes you did get.
*
* @param b the buffer into which the data is read.
* @param off the start offset of the data in the array,
* not offset into the file!
* @param len the number of bytes to read.
* @return number of bytes actually read.
* @exception IOException if an I/O error occurs.
*
*/
public static final int readBlocking ( InputStream in , byte b[ ],
int off, int len ) throws IOException
{
int totalBytesRead = 0;
while ( totalBytesRead < len )
{
int bytesRead = in.read( b , off + totalBytesRead , len -
totalBytesRead );
if ( bytesRead < 0 )
{
break;
}
if ( bytesRead == 0 )
{
try
{
// no data for now
// wait a while before trying again to see if data has
arrived.
// avoid hogging cpu in a tight loop
Thread.sleep( 100 );
}
catch ( InterruptedException e )
{
Thread.currentThread().interrupt();
}
}
else
{
totalBytesRead += bytesRead;
}
}
return totalBytesRead;
}
// end readBlocking

I think the term come to be used more generically than the original
literal meaning of grabbing chars out of the regen buffer. When
people say screen scraping today what they mean is extracting data
from web pages designed for humans to read.
--
Canadian Mind Products, Roedy Green.http://mindprod.com Again taking new Java programming contracts.

"plork123" <> wrote in message
news:...
> Thanks for your reply
>> It's much easier to just parse the HTML data.
>
> Do you have an example/tutorial of how this is done?

I think Roedy posted an example in another branch of this thread (which
I presume you saw since you replied to it). Otherwise, general assumptions
that need to be resolved include:

Does the structure (as opposed to the content) of the HTML data change
regularly?
Can you be assured that what you get is valid HTML?
Is the author of the site actively trying to discourage "screen
scrapping"?

On Fri, 30 Sep 2005 14:57:54 GMT, "Oliver Wong" <>
wrote or quoted :
> Can you be assured that what you get is valid HTML?
This is a biggy. A parser can get blown out the water pretty easily.
So in practice you use indexOf to look for begin/end marker strings,
and strip out what does not matter. Usually the data part you want is
pretty regular, often machine generated.

You are always at the mercy of the provider. They can make changes at
any time perhaps intentionally to foil you.
--
Canadian Mind Products, Roedy Green.http://mindprod.com Again taking new Java programming contracts.

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!