The third script - the one I am having problems with reads in this text file and searches the first ten pages of google for email addresses belonging to each domain before writing each unique email address to a text file.

This used to work just fine but recently it has begun to bottom out with http error 503 - service unavailable mentioned in the trace back.

You should not be scraping Google search results pages. Instead, you should be using Google Search APIs. The 503 error you're getting is probably due to Google detecting that the requests are being made by a script and refusing to service them.

sheffieldlad wrote:I am aware of the ethics and in this case I believe them to be one sided.

Google is the biggest web scraper in the world and I have never heard of them paying for scraping information from websites.

You are not paying for scraping but for the use of Google's search engine and database on an "industrial" scale, i.e. going beyond what is expected of an average (human) user. You're using bandwidth and putting a load on Google's servers. This is not the same as scraping static content -- you are using a service not just accessing web pages. It is Google's prerogative to charge you for that if they want.

When it comes to ethics of scraping, there is one golden rule: respect the robots.txt for the site. All Google's scrapers follow that rule, so there is nothing one-sided about it. If you examine Google's robots.txt, you will see that searching is explicitly prohibited:

Google's robots.txt wrote:User-agent: *Disallow: /search

Ethics aside, this is about what Google let's you get way with. You can try introducing a sizable delay (of a few seconds) between making the requests -- that might fool Google into believing that requests are being made by a human (obviously, that would also mean waiting much longer for your results).

sheffieldlad wrote:I am aware of the ethics and in this case I believe them to be one sided.

Google is the biggest web scraper in the world and I have never heard of them paying for scraping information from websites.

You are not paying for scraping but for the use of Google's search engine and database on an "industrial" scale, i.e. going beyond what is expected of an average (human) user. You're using bandwidth and putting a load on Google's servers; it is Google's prerogative to charge you for that if they want.

When it comes to ethics of scraping, there is one golden rule: respect the robots.txt for the site. All Google's scrapers follow that rule, so there is nothing one-sided about it. If you examine Google's robots.txt, you will see that searching is explicitly prohibited:

Google's robots.txt wrote:User-agent: *Disallow: /search

Ethics aside, this is about what Google let's you get way with. You can try introducing a sizable delay (of a few seconds) between making the requests -- that might fool Google into believing that requests are being made by a human (obviously, that would also mean waiting much longer for your results).

Thanks for your reply.

There are delays built into the script - anywhere between 1 and 35 seconds.I guess you may be right, Google has smelt a script but again, I would expect captures for legitimate searches from my IP which I don't get.Ethics aside, Does anyone have any advice on how to handle errors in python?

sheffieldlad wrote: Does anyone have any advice on how to handle errors in python?

In Python, you handle errors by catching the exception and dealing with it. For HTTPError this may often involve simply logging it and moving on to the next request. In your case, that won't be of much help, since once you get a 503, subsequent requests are going to return the same thing. So you might what to just terminate the script (or move onto the next part after getting the requests).

If by that question you mean "how to I get around this particular error get Google to give me the results?", then, if the delays aren't helping, you could try using mechanize to emulate a browser more fully (rather than just faking the user-agent header).

sheffieldlad wrote: Does anyone have any advice on how to handle errors in python?

In Python, you handle errors by catching the exception and dealing with it. For HTTPError this may often involve simply logging it and moving on to the next request. In your case, that won't be of much help, since once you get a 503, subsequent requests are going to return the same thing. So you might what to just terminate the script (or move onto the next part after getting the requests).

If by that question you mean "how to I get around this particular error get Google to give me the results?", then, if the delays aren't helping, you could try using mechanize to emulate a browser more fully (rather than just faking the user-agent header).

micseydel wrote:One way scraping is detected is when you don't modify the user-agent in your requests.

Thanks.

I have managed to get around the problem for the time being by minimizing the amount of requests I'm sending. I don't need the first 10 pages of results. between 3 and 5 usually gets me the information I need and by adding looooong delays between requests.I do need to come up with a more permanent solution but what I'm doing isn't a long term thing.I don't intend to scrape Google forever, it's a means to an end but I need to get top side of my code just to satisfy my own mind and hopefully to learn.

Since my last post I have introduced code to inform the user what is happening and handle errors gracefully which is something I wasn't sure how to do before.

I would like to take my little project further but I won't have a real need for it soon (apart from the learning aspect) and there are other things I would enjoy coding a lot more.