After an item has been scraped by a spider, it is sent to the Item Pipeline
which processes it through several components that are executed sequentially.

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a
Python class that implements a simple method. They receive an item and perform
an action over it, also deciding if the item should continue through the
pipeline or be dropped and no longer processed.

This method is called for every item pipeline component. process_item()
must either: return a dict with data, return an Item
(or any descendant class) object, return a Twisted Deferred or raise
DropItem exception. Dropped items are no longer
processed by further pipeline components.

If present, this classmethod is called to create a pipeline instance
from a Crawler. It must return a new instance
of the pipeline. Crawler object provides access to all Scrapy core
components like settings and signals; it is a way for pipeline to
access them and hook its functionality into Scrapy.

Let’s take a look at the following hypothetical pipeline that adjusts the
price attribute for those items that do not include VAT
(price_excludes_vat attribute), and drops those items which don’t
contain a price:

This example demonstrates how to return Deferred from process_item() method.
It uses Splash to render screenshot of item url. Pipeline
makes request to locally running instance of Splash. After request is downloaded
and Deferred callback fires, it saves item to a file and adds filename to an item.

importscrapyimporthashlibfromurllib.parseimportquoteclassScreenshotPipeline(object):"""Pipeline that uses Splash to render screenshot of every Scrapy item."""SPLASH_URL="http://localhost:8050/render.png?url={}"defprocess_item(self,item,spider):encoded_item_url=quote(item["url"])screenshot_url=self.SPLASH_URL.format(encoded_item_url)request=scrapy.Request(screenshot_url)dfd=spider.crawler.engine.download(request,spider)dfd.addBoth(self.return_item,item)returndfddefreturn_item(self,response,item):ifresponse.status!=200:# Error happened, return item.returnitem# Save screenshot to file, filename will be hash of url.url=item["url"]url_hash=hashlib.md5(url.encode("utf8")).hexdigest()filename="{}.png".format(url_hash)withopen(filename,"wb")asf:f.write(response.body)# Store filename in item.item["screenshot_filename"]=filenamereturnitem

The integer values you assign to classes in this setting determine the
order in which they run: items go through from lower valued to higher
valued classes. It’s customary to define these numbers in the 0-1000 range.