Retrieve bibtex entries with Python

The INSPIRE search engine provides
an API to do
automated searching of physics bibliographic
references. Information can be retrieved through a
programmatic query interface. Several
applications
use the API interface. For instance, some look up all the
\cite{...} references in a file and output the
corresponding
\bibitem{...}'s. INSPIRE itself provides an online
references extractor and a bibliography generator.

Pyinspire
(modified BSD license) retrieves results from the INSPIRE HEP
database from the command line. A complete list of options is
available through pyinspire.py --help. Let's consider
the following query from command line:

We instructed pyinspire to download bibtex references
(flag -b) resulting from the query passed within the
string after the -s flag. The string can include any
option accepted by the INSPIRE
query
format. In this case we looked for references by author
Maldacena, dating 1997 and having been cited more than 1000
times. This matched one entry from the INSPIRE database and
printed on screen the corresponding bibtex entry. Options to
return formats more manageable by databases, such as JSON,
are also available.

While pyinspire is supposed to work as a command, its
main function can be easily imported in Python scripts. This
allows to treat programmatically particular cases. While the
interface is very simple, it allows to retrieve bibliographic
references automatically with a good flexibility thanks to the
feature-rich INSPIRE API query format.

Let's suppose that a bibtex reference file has to be created
starting from inhomogeneous reference lists. For example, a
reference list may be a .bib bibtex file. Another one
may be a list of bibitem's. The two lists may have
overlapping references, but the citation labels are different
and do not correspond to those from the INSPIRE
database. However, let's say that both lists report
the arXiv numbers for each
entry. Then, we can parse the files to lookup for all arXiv
numbers, and send a query to INSPIRE based on that. The query
will return a consistent .bib bibtex file.

First, we have to match all arXiv references. This is easily
done using regular expressions (regexps). The following
function reads a string (that will be the content of a file)
and returns a list containing arxiv numbers. It matches
patterns such as arXiv:1234.5678
or arXiv:gr-qc/1234567 ignoring the case, since the
prefix may appear in any case combination (e.g., arxiv, arXiv,
ARXIV).

(The variable PREFIX does not need to be global and
should be better passed as an argument of the function
instead. Here it is declared as a global variable just because
an additional function argument is not needed to clarify this
simple example.)

Then, we define a funtion that prints to stdout the INSPIRE
query result based on arxiv numbers listed in a file.

frompyinspire.pyinspireimportget_text_from_inspiredefget_bibtex(myfile):"""Print to stdout the inspire query result based on arxiv numbers listed in myfile. """withopen(myfile,'r')asf:string=f.read()arxiv_ids=set(get_arxiv_ids(string))resultformat='bibtex'tags=Noneforarxivinsorted(arxiv_ids):result=get_text_from_inspire(search=arxiv[len(PREFIX):],resultformat=resultformat,ot=tags)print(result)

The function above starts by reading the given file. It calls
the function get_arxiv_ids() to retrieve arXiv numbers
listed anywhere in the input file. The list is converted to a
set not because we need to operate on a set, but just as an
easy way to remove duplicates. For each arXiv number sorted
alphabetically, call the
function get_text_from_inspire() to send a query to
INSPIRE. This function is provided by the pyinspire
package. It is simple enough to use, as it receives just three
clear arguments:

search: search string to use in the query. In our
case it coincides with the arxiv number. Note that we remove
PREFIX (i.e., 'arxiv:') from the query, as it may be
problematic with older arXiv number formats. To do that we
use list slices instead of Python replace() method
since the prefix may appear in any case combination (arxiv,
arXiv, ARXIV, ...).

resultformat: string containing the name of the
format ('brief', 'bibtex', 'latexEU', 'latexUS'). In our
case it is the bibtex format.

ot: tags to be included in MarcXML or JSON output. We
don't need it.

The function can be called as get_bibtex('refs.tex')
from Python, where refs.tex is a text file containing
all the arXiv numbers (it does not need to be a Tex file or
any other specific format, it just has to contain arXiv
numbers to be parsed). If the input file contains Unicode
characters, using Python3 is recommended over Python2.

The INSPIRE API does not seem to mention limits on the number
of requests allowed per time interval. Of course, querying
simultaneously large reference lists in a small time should be
avoided. While a concurrent download would improve
significantly the speed of the script above, we used a
sequential implementation to avoid to inadvertently launch a
DOS (Denial of Service) attack.