Scrapinghub Support Center

How can we help you today?

Cache and Pricing

c

chops

started a topic
over 1 year ago

Hello,

I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:

1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?

2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?

2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.

nestor

2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.