Here is a quick hack that I wrote. It's a Python library to search Google without using their API. It's quick and dirty, just the way I love it.

Why didn't I use Google's provided REST API? Because it says "you can only get up to 8 results in a single call and you can't go beyond the first 32 results". Seriously, what am I gonna do with just 32 results?

I wrote it because I want to do various Google hacks automatically, monitor popularity of some keywords and sites, and to use it for various other reasons.

One of my next post is going to extend on this library and build a tool that perfects your English. I have been using Google for a while to find the correct use of various English idioms, phrases, and grammar. For example, "i am programmer" vs. "i am a programmer". The first one is missing an indefinite article "a", but the second is correct. Googling for these terms reveal that the first has 6,230 results, but the second has 136,000 results, so I pretty much trust that the 2nd is more correct than the first.

Subscribe to my posts via catonmat's rss, if you are intrigued and would love to receive my posts automatically!

This code fragment sets up a search for "quick and dirty" and specifies that a result page should have 50 results. Then it calls get_results() to get a page of results. Finally it prints the title, description and url of each search result.

Here is the output from running this program:

Quick-and-dirty - Wikipedia, the free encyclopedia
Quick-and-dirty is a term used in reference to anything that is an easy way to implement a kludge. Its usage is popular among programmers, ...
http://en.wikipedia.org/wiki/Quick-and-dirty
Grammar Girl's Quick and Dirty Tips for Better Writing - Wikipedia ...
"Grammar Girl's Quick and Dirty Tips for Better Writing" is an educational podcast that was launched in July 2006 and the title of a print book that was ...Writing - 39k -
http://en.wikipedia.org/wiki/Grammar_Girl%27s_Quick_and_Dirty_Tips_for_Better_Writing
Quick & Dirty Tips :: Grammar Girl
Quick & Dirty Tips(tm) and related trademarks appearing on this website are the property of Mignon Fogarty, Inc. and Holtzbrinck Publishers Holdings, LLC. ...
http://grammar.quickanddirtytips.com/
[...]

Compare these results to the output above.

You could also have specified which search page to start the search from. For example, the following code will get 25 results per page and start the search at 2nd page.

gs=GoogleSearch("quick and dirty")gs.results_per_page=25gs.page=2results=gs.get_results()

You can also quickly write a scraper to get all the results for a given search term:

fromxgoogle.searchimportGoogleSearch,SearchErrortry:gs=GoogleSearch("quantum mechanics")gs.results_per_page=100results=[]whileTrue:tmp=gs.get_results()ifnottmp:# no more results were foundbreakresults.extend(tmp)# ... do something with all the results ...exceptSearchError,e:print"Search failed: %s"%e

You can use this library to constantly monitor how your website is ranking for a given search term. Suppose your website has a domain "catonmat.net" and the search term you want to find your position for is "python videos".

Here is a code that outputs your ranking: (it looks through first 100 results, if you need more, put a loop there)

importrefromurlparseimporturlparsefromxgoogle.searchimportGoogleSearch,SearchErrortarget_domain="catonmat.net"target_keyword="python videos"defmk_nice_domain(domain):""" convert domain into a nicer one (eg. www3.google.com into google.com) """domain=re.sub("^www(\d+)?\.","",domain)# add more herereturndomaings=GoogleSearch(target_keyword)gs.results_per_page=100results=gs.get_results()foridx,resinenumerate(results):parsed=urlparse(res.url)domain=mk_nice_domain(parsed.netloc)ifdomain==target_domain:print"Ranking position %d for keyword '%s' on domain %s"%(idx+1,target_keyword,target_domain)

Output of this program:

Ranking position 6 for keyword python videos on domain catonmat.net
Ranking position 7 for keyword python videos on domain catonmat.net

Here is a much wicked example. It uses the GeoIP Python module to find all 10 websites for keyword "wicked code" that are physically hosting in California or New York in USA. Make sure you download GeoCityLite database from "http://www.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz" and extract it to "/usr/local/geo_ip".

I am going to expand on this library and add search for Google Sets, Google Sponsored Links, Google Suggest, and perhaps some other Google searches. Then I'm going to build various tools on them, like a sponsored links competitor finder, use Google Suggest together with Google Sets to find various phrases in English, and apply them to tens of other my ideas.

Jorge, it does. got to be careful. put a sleep between calls if you are doing a lot of scraping.

Daniel, me too. I have had this idea for a while as well. :)

Steve, can you tell me the query you used? Google sometimes displays 2 results from the same site (2nd usually indented to the right), that's an ok behavior. One way to escape that is to keep a list or dict of seen urls, then check if you have seen the url already.

can u tell me what is Corpora ? Because i am finding out the tutorials that would help me to extract all base on query. for example if i type 'Programming in c++" in python it should generate all types of link. So help me if u know.

Hi Peter, I was looking at your code. I have done a much simpler utility for searching google. I used to use beautifulsoup also, but now I use lxml and xpath. It produces much quicker and cleaner code... here is an example that returns an array of the urls and text:

The code was just a snippet of some other code I have, but in regards to your points:

1. There are only 3 points where errors can creep in that I see:
1- if the urlopen fails or
2- during the htmlparser() if the html is super-malformed (same w/ BeautifulSoup).
3- if google changes their html format(but that will screw up almost any scraper)

The xpath and rest of the code will be work without problem since xpath will return '[]' if the xpath fails.

4. If you change the line:
tree=et.fromstring(results,et.HTMLParser())
to:
tree=et.fromstring(results,et.HTMLParser(recover=True))
then lxml handles malformed html almost as well as BeautifulSoup.

Anyways, I enjoy your blog, and just thought that I'd throw that out there.

How would you recommend folding this into a script that uses a set google query and set parameters and writes the output to a file? This way it could be used regularly without feeding it all the variables over and over... (Sorry if this is obvious to others out there!)

Hello,
Thank you for this nice library. It is very useful I think. I have got 503 errors, even if I use sleep between search actions. I think this can be related to agent setting. How can we set a custom browser agent in you code?

Doug, about bitbucket or github: Sure. I will. I just have to automate my tools more, to push out changes from my repo to bitbucket or github. I don't want to do anything manually. I haven't yet done this, but I soon will. At the moment the latest version is always at http://www.catonmat.net/download/xgoogle.zip.

Doug, about licensing: All my work is open source. You may use it any way you wish.

Hello,
thanks for a very nice lib.
I have added two more parameters domain and hl.

However changing the language parameters will give no results. It seems to be the html that is slightly different using i.e. hl=sv (swedish). I have been flagging your code for some hours now as this is my first time using python. Would you have the solution for this? Even though google.com?hl=en is probably the most used way, I am interested in the local versions as well.

I've used this library for a variety of things so I thought I would just pop in and say thanks for providing something that works well is easy to use.

I just started writing a replacement google library for my company's internal use. Like chad, I am also a huge fan of lxml for a variety of reasons. I also would like to make the library a bit more "pythonic" in general by adding smart generators so that you can iterate over results without worrying about what page they are on. I wrap your SearchResults objects already so I will probably provide a class/function hook in the constructor so can yield() instances of WhateverClass.

balcon: that 503 error is 99% likely to mean that you tried to scrape google too quickly. Bare minimum time between searches is about 10 seconds if you're doing more than 5-6 requests.

The thing is that Google can show that it has 10 billion results but in reality it will return only 1000 for any search. And if it thinks there are some duplicates in those 1000, then it will return even less. In this case it returned 618 results.

Strange thing is that I can view results through a browser on the same machine(same IP etc) no problem. I am already masquarding the "User Agent Id' to mimic Firefox. I am also collecting in the cookies and feeding these back with the request.

Does anyone have an idea as to how Google will be telling the Python screen scraper versus the browser apart? Perhaps some other header in the request?

I would be very grateful to get anybodys thoughs and experiences on this.

Thanks for sharing this code. I was wondering is there a particular reason why you are packaging the BeautifulSoup module in xgoogle?

Recently I've ran into troubles using a new version of BeautifulSoup with soup2text (http://svn.tools.ietf.org/svn/tools/ietfdb/sprint/73/rjs/ietf/utils/soup2text.py) and using an older version fixed the problem.

Did you encounter a similar situation and if so do you know what additions in BeautifuSoup broke your code?

Yes, I encountered a similar situation. I am packaging BeautifulSoup in my code because it's the most stable version I have ever used. The new BS uses a different parsing engine and when doing tests it would throw unexpected errors such as EncodeError, IndexError and others. And the old one parses it just fine.

Looks awesome, very thorough. In my script I wrote a simple class with a static search method to grab result links from the page source... all I really needed at the time. Though this will definitely be useful to me in the future. Nicely done.

your example code in the readme file works great in the interactive but fails when I put it in a file
this is the code -->
>>> from xgoogle.search import GoogleSearch
>>> gs = GoogleSearch("catonmat")
>>> gs.results_per_page = 25
>>> results = gs.get_results()
>>> for res in results:
... print res.title.encode('utf8')

"Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google."

If every website adopted this policy, Google themselves would be out of business tomorrow. Or maybe compete with DMOZ. There probably exists no company that sends out more automated queries than Google. It's not exactly the height of hypocrisy, because they do respect robots.txt and if you don't want to be there you don't have to. However, we all know that if you are not to be found on google, you don't exist. Descartes famously said, "Cogito ergo sum" I think therefore I am. Today he would say, "Above the fold on google ergo sum."

Yes， Google search redesigned the structure of the sum of search results which is located on the right top of the search page.
Google would return the begin and end number of the all search result in plain text at before, while now this numbers have been changed by script.

And these changes could influence the xgoogle's function "_extract_info(self, soup)" in the file "search.py"

The crude solution I have taken as following：Change the function get_results(self)" in the file "search.py"

Hi. Xgoogle is currently broken and I don't have enthusiasm to fix it. Someone pasted a patch above but it's not indented right and again I don't have enough enthusiasm to figure out how to indent it right.

Well, I am trying out the correctly tabbed patch that Andrew posted now. My command line test worked (Andrew, I failed to post the rest of the commands that showed my work). Since Goo has me pegged already, I am very careful about how I run my program now.

Would you by chance need a hand in the development of this Library. I am semi skilled in Python (self taught) and can offer help to keep it a float if you need web hosting I can offer that for free also :D I have two dedicated Linux servers available at my disposal.

It's a good library and I am in the process to develop some simple seo tools with it for a company my web design team works for

I'm also getting 0 results. The above patch seems to just comment out error checking. Can anyone fix this for the new format? (The new format says "About XXXXX results" instead of "Results X-XX of about XXXXX results")

Hi, thanks for this. It's amazing and working perfectly for me. Any idea how you would construct the url in class GoogleSearch in search.py to do an google image and google news search? Is it possible. Sorry if this is obvious. I am a Python newbie.

Very nice program! I'm finding google hates being scraped unless I put in a huge delay. I have a need to thoroughly sift one single domain name for tens of thousands of pages of data. This is going to take weeks at this rate. Does anyone know where I can get / buy archived search data so I could sort it locally without lag and terms of service issues?

Is there a way to handle the new Google UI ?
I'm looking for a way to count the number of pages indexed for my website in Google, this library is awesome.
The fix that the person gave above just removes all of this.... any plan on fixing the regexp ?
I've tried all I could, but I'm very bad at using regexp ....

its working but you have to change this part in search.py
=====
matches = re.search(r'%s (\d+) - (\d+) %s (?:%s )?(\d+)' % self._re_search_strings, txt, re.U)
if not matches:
return empty_info
return {'from': int(matches.group(1)), 'to': int(matches.group(2)), 'total': int(matches.group(3))}
=====
cos regex below doesnt suit current google template. I do not post my cos it is only for parsing all the results of site:blabla.com . Made a quick fix for this

Hi, I think that the google is blocking the search... every time that I run the program google_fight or google_fight2, they answared 0 to all words. Can someone say me what is happing? and can I resolve this?

Hello!
This piece looks very nice. However, few people asked about language and domain restrictions. I know that domain can be defined in the query itself, like: "%s site:.edu" % word. However, this is not possible for language, since lr=? is defined elsewhere. That can be handy, don't you think? How can this be made?

I had this problem first time I tried to ran the library. I turned out I was calling the library from the wrong folder because I didn't realise the xgoogle folder that had the library was inside another folder called xgoogle.

Hi Peter, I am new to python, can you tell me how I make this work on my python instance? I had thought it was something like 'python setup.py install' however, there is not a setup.py with your zip file.

Hello,
I was running into issues calculating the number of search results (num_results) returned from Google as they changed their layout and formating. The following is one solution I found to the problem. I have edited the regular expression to accept the new format along with small other tweaks.

Hello, you did a great job and your code really helped me. For the results number I have a very dirty solution that works with the current google version, and that might be improved by someone with regexp skills.
Just add this function (adapted from the previous _extract_info):

Hi , i do all the patches and gets the num_results equals zero , could you send your code to my mail : googcheng at gmail.com or the wensite where you host it , Hope you help me and wanna calculate the PMI

after changing all the patches mentioned above, I got the search.py working but I don't think it is stable. Sometimes, it gives me 10 results from the google result page but sometimes, it just give me two reference entries

In search.py in the xgoogle folder, look for two methods named _extract_results and _extract_description .
Change the assignment variables for results and desc_div to the new ones given. Old code has been commented out below and new code is present. Thanks securda!!! This works like a charm. :)

Same problem, probably google has change the output of their datas, to bypass tiers solutions, and to force people to use their front-end.

People behind google don't propose library to use them search engine. However they propose a lot of libraries to use others applications like calendar, maps, videos. Google does that to collect as many data as they can.

At the end we have google propose no solution, because we haven't source code, we can't owned the application. In the view of google we are just users who give more information :S

After some searches Google redirects me to their captcha page, probably because no cookies are set... Anyone has a solution?
If not I will code a small search tool using scoogle, I think that should work without cookies and the html response is much easier to parse ;)

How do I change the search country. It seems like the the basic search returns New York data. I want to be able to choose which state or country. could someone update this code to handle that.
from xgoogle.search import GoogleSearch, SearchError
try:
gs = GoogleSearch("quick and dirty")
gs.results_per_page = 50
results = gs.get_results()
for res in results:
print res.title.encode("utf8")
print res.desc.encode("utf8")
print res.url.encode("utf8")
print
except SearchError, e:
print "Search failed: %s" % e

i hope that some one fix the xgoogle problem for the reslut i still get 0 result from google search , coild any body help me how to caluclate the google distance similarity .
or fix the problem of get result 0 .
tnkx

It seems that Spiros fixed the problem however I've never been in the position of needing a nightly build for python and when I look for one now, I don't find anything.

How does one apply (or even find) patches like that indicated in issue 16119?

Although even if I figure out how to fix this (maybe just back my installation up to 2.7.2) my guess is that from a) a previous Django exercise (that I'm trying to double-check with xgoogle and b) the most recent posts above that google is now able to detect programmatic access and doesn't return any results in order to discourage (prevent) programmatic access.

I went download the codes. So first I chmod 755 the english.py and got error line 24 from xgoogle.search import GoogleSearch, SearchError, ParseError. So then I went into the folder xgoogle and also chmod 755 search.py but still the same. What else can I do ?

trying to use this lib to search for pdfs on the internet.. the problem am having is that e.g if i search for "Medicine:pdf" the first page returns to me is not the first page google returns,i.e if i actually use google.... dont know whats wrong

I am seeing the same thing just using the first example above (quick and dirty). If I type it into Google I get one result but when I run the script I get different results. It there a way to get the same result as a Google search returns?

Hi, I tried xgoogle with a list of words to get the number of hits (that is only thing I really need; number of hits meight be used as a proxy for word frequency, as proposed by Grefenstette and Nioche (2000)). I am experiencing so many problems, mainly because my word lists contain about 250 words. Google returns nothing -- xgoogle gives me a list with all zero-values... Any idea? I have a code for Bing frequencies, but those are weird. Especially if I want to get frequency estimates for "small languages", such as Croatian or Slovakian etc.

Hi can you please let me know if is possible to modify the code to set the different googles as a parameter and returning ranking from them? e.g. google.com returns different results from google.co.uk, etc Also I get very different results if I run a search on google.com from a web browser, why?