In desperate need to organize my collection of scientific papers, I had
a look at various tools which could help me organizing them. Probably
one of the best out there is Mendeley.
Mendeley seems to be a very good tool to keep your massive collection of
pdfs under control. Unfortunately a very basic function, namely looking
up a newly imported paper in Google
Scholar to get attributes like: Authors,
Year, etc. right, is bundled with a Mendeley account. I guess that's
their way of forcing the user to participate to their community stuff,
since without the Google Scholar lookup Mendeley is pretty useless
unless you want to fill all the attributes manually.

So I decided to write my own tool to make the lookup. Unfortunately
Google does not really want to give away that precious data: they don't
provide an API and even block certain User-Agents from accessing the
page. Then, there is also the problem of scraping the results page to
get the right data.

The first problem can be trivially solved by setting a common User-Agent
string, the second one can be elegantly circumvented by using the bibtex
files provided in the search results. The bibtex entries are however
only showed if you enabled them in the settings, which are stored in a
cookie. After a few tries, I figured that the CF attribute (citation
format?) controls which bibliography format should be offered in the
results page and CF=4 corresponds to bibtex. Generating a fake cookie
is easy, but you have to know what must be included. In this case it
looks like a 16 digit hex as ID and the CF attribute is sufficient. The
ID is probably supposed to be your id, but a randomly generated one
also works like a charm.

The resulting cookie looks like this: GSP=ID=762a112b5c765732:CF=4

All you have to do now is to query Google Scholar using the user string
and the cookie:

And Google Scholar will offer you links to the bibtex files of the
results. Getting those links is easy since they all start with
"/scholar.bib". Just search for those and download the targets.

The complete code is available on
github. It can be used as a python
library or a standalone application, you just call it like this:
gscolar "some author or title" and it will print the first ten results
in bibtex to stdout.