When scraping the Google search engine, we need to be careful so that
Google doesn't detect our automated tool as a bot, which will redirect
us to a captcha website, where we'll need to enter the captcha in order
to continue. We don't want that, since then Google will block us and we
won't be able to perform any more searches without entering the captcha.
And we certainly won't take the time to check out if the Google captcha
can be broken, so we can automatically send captcha strings to the
server to unblock us. We just need to be careful enough not to overdo
it.

GGGoogleScan

GGGoogleScan is a Google scraper which performs automated searches and
returns results of search queries in the form of URLs or hostnames.
Datamining Google's search index is useful for many applications.
Despite this, Google makes it difficult for researchers to perform
automatic search queries. The aim of is to make automated searches
possible by avoiding the search activity that is detected as bot
behaviour [1]. Basically we can enumerate hostnames and
URLs with the GGGoogleScan tool, which can prove a valuable
resource for later.

This tool has a number of ways to avoid being detected as a bot; one of
them is horizontal searching, where we're searching for multiple search
words in parallel without requesting the contents of, for example, 1-50
results found by that search query. Rather than that, we're making a
large number of search queries, saving the results and only requesting a
small number of web pages that were found as a result of scanning.

When we first run the tool, the help page will be displayed like this:

We can see that there are a number of options we can use with this tool.
Usually, we want to use the -d option to return the specified number of
pages. That number shouldn't be too large, so that we're still in
horizontal search mode where we're not detected as a bot. We can also
use the -s option to sleep the specified number of seconds between
requests, which further hides our activity. The options -e and -i can be
used to specify the input files that contain a whole and a part of the
search query. At the end we can use a command like this:

# gggooglescan -l output.log -d 1 -e wordlist "test"

The wordlist file is provided by the GGGoogleScan by default and
contains 97,070 arbitrary words. The above query will save all the found
URLs in the output.log file and will search with a queries like the ones
presented below:
- test entry1
- test entry2
- test ...
- test entryN

Where the entry1, entry2, ..., entryN are the line entries in the
wordlist file. One of the queries of the above command was also the
"test aaliyah's" query, which returned the following results:

But how can we be sure that the tool presents the right queries? We
can just google for that in the search engine. We can see the results of
the same query on the picture below:
We can see that the links in the picture are the same as those
obtained by the GGGoogleScan tool. Now we're ready to use some of more
advanced functions of the GGGoogleScan tool. We can search the results
by country by using the -c option the GGGoogleScan tool provides. So if
we wanted to search for only the sites in the United Kingdom we would
form the search as follows (notice the -c uk option being added to the
command):

We can also use the -x option to use the proxy if we need that. Keep in
mind that the format of the -x option is the same as that of the
curl command line program that is capable of transferring data from
and to the server using a number of supported protocols, including HTTP
and HTTPS.

If we would like to get the first 100 links of some web page, we could
use a command like the following:

We can do more than that. We can download a list of Google dorks and
scan with those. What we will gain is automatic enumeration of hostnames
and URLs that are detected with one of the Google dorks on a specific
site. But first we must obtain a list of Google dorks. We know that
Google dorks are located on a webpage like
http://www.exploit-db.com/google-dorks/, but the site doesn't provide
the download button to download them all; we need to go through the
pages one by one and download them by hand or write a script that will
do it for us. But of course, we don't want to go to all that trouble,
since there's an easier way. We can download the Search Diggity tool.
After the download and installation of the tool, we can use it to
basically do the same, except that we're limited by the Google API,
which doesn't allow us to do many search requests per day. But when
we're doing a penetration test, we would like to check for all Google
dorks at once for the new customer. With the use of GGGoogleScan this is
possible, if we're careful. To obtain the list of Google dorks, we can
go to the C:Program FilesStach & LiuSearchDiggity folder and copy the
"Google Queries.txt" file from there. This file contains all the Google
dorks that the SearchDiggity uses to do its job. The Google dorks in
this file are represented as follows:

In order to understand that list, we also need to view the Queries menu
of the SearchDiggity tool. That menu is presented in the picture below:

We can immediately see how the "Google Queries.txt" file is being parsed
to provide that menu. First there's the database name followed by the
';;' separated, followed by the category and at the end the actual
Google dork for the current category (again separated with the ';;' from
its category). We can use this knowledge to quickly throw away the
database name and the category, which will leave us just the Google
dorks that we can use as input to the GGGoogleScan tool.

To parse the "Google Queries.txt" file, let's first rename it as
queries.txt for easier manipulation. To quickly parse the file and only
grab the Google queries we can use the following command to split each
line by the ';;' separator and taking the last column and saving it into
the file queries2.txt:

The reason why this works is because usually the www.target.com doesn't
return any pages for most of the queries being submitted to the Google
search engine. This is because the search queries are too specific to
certain environments to return any results; and the target can't use all
of the existing technologies, just some specific ones. An example of a
search query is:

Is the above query really going to return some results? Probably not,
because it's way too complicated and way too specific to the Adva
guestbook. But if the www.target.com uses the Adva guestbook it can
certainly return some results, which are more than welcome.

With the above command we won't know which queries were submitted that
found a certain hostname or URL, but it doesn't really matter. We have
the URL, which we can enter into the web browser to inspect it, and most
of the time it will be immediately clear what the problem is; therefore
we don't need to know the search query that was used, since the whole
point is to find a vulnerability or an inconsistency.

We can also search the www.target.com site for any common extensions. A
list of common extension is provided below:

We can save the extensions in the ext.txt file and then run the command
as follows, which will get all the URLs where the resources with any of
the above extensions is located:

# gggooglescan -l output.log -d 10 -e ext.txt "site:www.target.com"

There's only one problem with the GGGoogleScan: it doesn't tell you when
you've been blocked, so you don't actually know whether you've been
blocked or not. The scan will keep going on, but whenever the request is
made, we'll be redirected to the captcha site, and won't be able to get
results. We can detect this if there are no results being written to the
output file; in such case we can be pretty sure that Google has detected
our automation tool and blocked us. A request that requested
"site:google.com php" that was blocked was intercepted with Wireshark
and can be seen below:

We can see that we were redirected to the http://www.google.com/sorry/
page, where the captcha is waiting to be filled out. The captcha can be
seen on the picture below:

It would certainly be a good thing if the GGGoogleScan could detect this
and wait for us to fill in the captcha and continue or at least detect
this and stop the program, so we would know when it was blocked and
could continue afterwards.

After careful observation of the script, we can determine that the
script has a simple way of handling the case when Google blocks the
script and it can't continue because a redirect is occurring. That piece
of code is presented below:

This looks right: if the returned request contains the string
"http://sorry.google.com/sorry/?continue=" then the script will sleep
for BOT_SLEEP time, which is 60 minutes, because it needs to wait for
Google to unblock us. At that time, an error message "# You're acting
like a bot. Wait now." will also be displayed to let us know that we've
been blocked and the script can't continue. But when running our own
GGGoogleScan scenario and after being blocked, the following is printed
on the screen:

# next page of results link missing

This certainly isn't the error message that we should get, so what's
going on? The problem is that the script hasn't being updated lately and
Google changed the redirection URL from
"http://sorry.google.com/sorry/?continue=" (known by the script) to
"http://www.google.com/sorry/?continue=" (known by Google) and this is
the reason the script can't detect when we've being blocked. We must
change the above string to the one Google returns when being blocked, so
the script will detect it and wait 60 minutes before continuing. After
rerunning the script, we can see that we have indeed been blocked and
that the script will wait now:

# You're acting like a bot. Wait now.

Conclusion

We've seen how we can use the GGGoogleScan tool to prevent being limited
by the 100 queries per day, which is as much as Google allows if we use
the Google search engine API.