The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

because that's what screen scraping usually is -- theft of copyright material

There's countless legitimate uses of scrapers - I've written more than a dozen here at work in the last year, none of which are stealing copyrighted material and all of which were the only option available for the purposes we needed. Most of them are scraping content from our own servers (e.g. pulling in dashboard data from a myriad different monitoring utilities that don't provide alternative access means such as SOAP or RSS); the ones that reach out across the internet I took great pains to make as "friendly" as possible - they connect directly to what they need and nothing more, and all of them implement local caching so that at most I'm only sucking down the remote page once per hour (most are cached for a full 24 hours).

bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread

There's countless legitimate uses of scrapers - I've written more than a dozen here at work in the last year, none of which are stealing copyrighted material and all of which were the only option available for the purposes we needed. Most of them are scraping content from our own servers (e.g. pulling in dashboard data from a myriad different monitoring utilities that don't provide alternative access means such as SOAP or RSS); the ones that reach out across the internet I took great pains to make as "friendly" as possible - they connect directly to what they need and nothing more, and all of them implement local caching so that at most I'm only sucking down the remote page once per hour (most are cached for a full 24 hours).

If you are "scraping" your own sites, then you would be better off writing a simple api instead. You would in most cases finish the code faster, and it would be a lot more effective.

bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread

Well, now that we've got this context around this thread, it is sounding pretty shady.

bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread