The spider middleware is a framework of hooks into Scrapy’s spider processing
mechanism where you can plug custom functionality to process the responses that
are sent to Spiders for processing and to process the requests
and items that are generated from spiders.

The SPIDER_MIDDLEWARES setting is merged with the
SPIDER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to
be overridden) and then sorted by order to get the final sorted list of enabled
middlewares: the first middleware is the one closer to the engine and the last
is the one closer to the spider. In other words,
the process_spider_input()
method of each middleware will be invoked in increasing
middleware order (100, 200, 300, ...), and the
process_spider_output() method
of each middleware will be invoked in decreasing order.

To decide which order to assign to your middleware see the
SPIDER_MIDDLEWARES_BASE setting and pick a value according to where
you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.

If you want to disable a builtin middleware (the ones defined in
SPIDER_MIDDLEWARES_BASE, and enabled by default) you must define it
in your project SPIDER_MIDDLEWARES setting and assign None as its
value. For example, if you want to disable the off-site middleware:

If it returns None, Scrapy will continue processing this exception,
executing any other process_spider_exception() in the following
middleware components, until no middleware components are left and the
exception reaches the engine (where it’s logged and discarded).

This method is called with the start requests of the spider, and works
similarly to the process_spider_output() method, except that it
doesn’t have a response associated and must return only requests (not
items).

It receives an iterable (in the start_requests parameter) and must
return another iterable of Request objects.

Note

When implementing this method in your spider middleware, you
should always return an iterable (that follows the input one) and
not consume all start_requests iterator because it can be very
large (or even unbounded) and cause a memory overflow. The Scrapy
engine is designed to pull start requests while it has capacity to
process them, so the start requests iterator can be effectively
endless where there is some other condition for stopping the spider
(like a time limit or item/page count).

Filter out unsuccessful (erroneous) HTTP responses so that spiders don’t
have to deal with them, which (most of the time) imposes an overhead,
consumes more resources, and makes the spider logic more complex.

According to the HTTP standard, successful responses are those whose
status codes are in the 200-300 range.

If you still want to process response codes outside that range, you can
specify which response codes the spider is able to handle using the
handle_httpstatus_list spider attribute or
HTTPERROR_ALLOWED_CODES setting.

For example, if you want your spider to handle 404 responses you can do
this:

classMySpider(CrawlSpider):handle_httpstatus_list=[404]

The handle_httpstatus_list key of Request.meta can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key handle_httpstatus_all
to True if you want to allow any response code for a request.

Keep in mind, however, that it’s usually a bad idea to handle non-200
responses, unless you really know what you’re doing.

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren’t in the
spider’s allowed_domains attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule www.example.org will also allow bob.www.example.org
but not www2.example.com nor example.com.

When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to
this one:

To avoid filling the log with too much noise, it will only print one of
these messages for each new domain filtered. So, for example, if another
request for www.othersite.com is filtered, no log message will be
printed. But if a request for someothersite.com is filtered, a message
will be printed (but only for the first request filtered).

If the spider doesn’t define an
allowed_domains attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the dont_filter attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.