Navigation

This Page

Examples

The output from all the example programs from PyMOTW has been
generated with Python 2.7.8, unless otherwise noted. Some
of the features described here may not be available in earlier
versions of Python.

Navigation

robotparser implements a parser for the robots.txt file format, including a simple function for checking if a given user agent can access a resource. It is intended for use in well-behaved spiders or other crawler applications that need to either be throttled or otherwise restricted.

Note

The robotparser module has been renamed urllib.robotparser in Python 3.0.
Existing code using robotparser can be updated using 2to3.

The robots.txt file format is a simple text-based access control system for computer programs that automatically access web resources (“spiders”, “crawlers”, etc.). The file is made up of records that specify the user agent identifier for the program followed by a list of URLs (or URL prefixes) the agent may not access.

An application that takes a long time to process the resources it downloads or that is throttled to pause between downloads may want to check for new robots.txt files periodically based on the age of the content it has downloaded already. The age is not managed automatically, but there are convenience methods to make tracking it easier.

importrobotparserimporttimeimporturlparseAGENT_NAME='PyMOTW'parser=robotparser.RobotFileParser()# Using the local copyparser.set_url('robots.txt')parser.read()parser.modified()PATHS=['/','/PyMOTW/','/admin/','/downloads/PyMOTW-1.92.tar.gz',]forn,pathinenumerate(PATHS):printage=int(time.time()-parser.mtime())print'age:',age,ifage>1:print're-reading robots.txt'parser.read()parser.modified()else:printprint'%6s : %s'%(parser.can_fetch(AGENT_NAME,path),path)# Simulate a delay in processingtime.sleep(1)

This extreme example downloads a new robots.txt file if the one it has is more than 1 second old.

A “nicer” version of the long-lived application might request the modification time for the file before downloading the entire thing. On the other hand, robots.txt files are usually fairly small, so it isn’t that much more expensive to just grab the entire document again.