In Scrapy, objects such as Requests, Responses and Items have a finite
lifetime: they are created, used for a while, and finally destroyed.

From all those objects, the Request is probably the one with the longest
lifetime, as it stays waiting in the Scheduler queue until it’s time to process
it. For more info see Architecture overview.

As these Scrapy objects have a (rather long) lifetime, there is always the risk
of accumulating them in memory without releasing them properly and thus causing
what is known as a “memory leak”.

To help debugging memory leaks, Scrapy provides a built-in mechanism for
tracking objects references called trackref,
and you can also use a third-party library called Guppy for more advanced memory debugging (see below for more
info). Both mechanisms must be used from the Telnet Console.

It happens quite often (sometimes by accident, sometimes on purpose) that the
Scrapy developer passes objects referenced in Requests (for example, using the
meta attribute or the request callback function)
and that effectively bounds the lifetime of those referenced objects to the
lifetime of the Request. This is, by far, the most common cause of memory leaks
in Scrapy projects, and a quite difficult one to debug for newcomers.

In big projects, the spiders are typically written by different people and some
of those spiders could be “leaking” and thus affecting the rest of the other
(well-written) spiders when they get to run concurrently, which, in turn,
affects the whole crawling process.

The leak could also come from a custom middleware, pipeline or extension that
you have written, if you are not releasing the (previously allocated) resources
properly. For example, allocating resources on spider_opened
but not releasing them on spider_closed may cause problems if
you’re running multiple spiders per process.

By default Scrapy keeps the request queue in memory; it includes
Request objects and all objects
referenced in Request attributes (e.g. in meta).
While not necessarily a leak, this can take a lot of memory. Enabling
persistent job queue could help keeping memory usage
in control.

As you can see, that report also shows the “age” of the oldest object in each
class. If you’re running multiple spiders per process chances are you can
figure out which spider is leaking by looking at the oldest request or response.
You can get the oldest object of each class using the
get_oldest() function (from the telnet console).

The fact that there are so many live responses (and that they’re so old) is
definitely suspicious, as responses should have a relatively short lifetime
compared to Requests. The number of responses is similar to the number
of requests, so it looks like they are tied in a some way. We can now go
and check the code of the spider to discover the nasty line that is
generating the leaks (passing response references inside requests).

Sometimes extra information about live objects can be helpful.
Let’s check the oldest response:

If your project has too many spiders executed in parallel,
the output of prefs() can be difficult to read.
For this reason, that function has a ignore argument which can be used to
ignore a particular class (and all its subclases). For
example, this won’t show any live references to spiders:

trackref provides a very convenient mechanism for tracking down memory
leaks, but it only keeps track of the objects that are more likely to cause
memory leaks (Requests, Responses, Items, and Selectors). However, there are
other cases where the memory leaks could come from other (more or less obscure)
objects. If this is your case, and you can’t find your leaks using trackref,
you still have another resource: the Guppy library.

If you use pip, you can install Guppy with the following command:

pipinstallguppy

The telnet console also comes with a built-in shortcut (hpy) for accessing
Guppy heap objects. Here’s an example to view all Python objects available in
the heap using Guppy:

Sometimes, you may notice that the memory usage of your Scrapy process will
only increase, but never decrease. Unfortunately, this could happen even
though neither Scrapy nor your project are leaking memory. This is due to a
(not so well) known problem of Python, which may not return released memory to
the operating system in some cases. For more information on this issue see:

The improvements proposed by Evan Jones, which are detailed in this paper,
got merged in Python 2.5, but this only reduces the problem, it doesn’t fix it
completely. To quote the paper:

Unfortunately, this patch can only free an arena if there are no more
objects allocated in it anymore. This means that fragmentation is a large
issue. An application could have many megabytes of free memory, scattered
throughout all the arenas, but it will be unable to free any of it. This is
a problem experienced by all memory allocators. The only way to solve it is
to move to a compacting garbage collector, which is able to move objects in
memory. This would require significant changes to the Python interpreter.

To keep memory consumption reasonable you can split the job into several
smaller jobs or enable persistent job queue
and stop/start spider from time to time.