IMNSHO, most web scraping 'frameworks' are pretty stupid, in the sense that they expect you to get everything right the first time. Of course you don't, so you have to re-query quite a lot of times. You have actually three distinct phases in web scraping:

getting all your data from the web site, and preferably storing it for offline processing

parsing the offline data and getting the information out of the html

storing the parsed data in a database-friendly format.

Offline processing has the advantage that you don't harass/overload the site with your broken queries. Also, if you find that you've forgotten to scrape some items, you can always re-parse the stored html documents. This is too trivial, of course, but for example scrapy doesn't allow you to write requested url's to the filesystem and work from your 'cached' pages.

This is the way I tend to write scrapes with scrapy - just implement the "get all the pages" part which handles traversing listing pages/generating requests for content pages and leave that running while working on data extraction from a few sample pages, then it's super-fast to re-run while tweaking the data extraction part.

BF is not really needed unless the page is broken in a way only BF's parser can handle it: lxml has lxml.html which generally does a pretty good job, and if it's not good enough you can plug in html5lib for parsing which should be identical to modern browsers (for instance html5lib will automatically inject a <tbody> in tables as it's "implied" per spec)

Did a lot of data parsing with PHP, Domdocument, XPath etc. There was a caching layer in-between in MySQL so I wouldn't need to grab the data repeatedly. PHP GD Image for the image resampling where needed, and Amazon S3 including the respective PHP API wrappers for storing of the results. (I much prefer Python's syntax and language structure to that of PHP. I'm just writing above as a random side note in case anyone's interested.)

Sometimes, you can create new historical archiving websites out of Fair Use re-tooling of historical data that's online, combined with offline scanning of books, crawling of DVD data, data people sent in etc. For instance, for http://VintageAdBrowser.com and http://CoverBrowser.com, I've sent several catalogues to a scanning company in another country, where they ripped apart the binding of the book for mass scanning, then I applied manual tagging and cropping tools as well as automated cropping tools for the image export. Some of these endeavours are years of continuous work (on and off), so the scripts of course only take you this far.

Once that is available, one can apply e.g. site wide color search algorithms etc., and others can continue to mine the data for what they find interesting. One group did an analysis with Mechanical Turk workers of the race of people on the 2800 covers of Sports Illustrated going back to 1954.

If you are interested in this field, one thing you can do is to look at which data, imagery, music etc. is in the public domain, for instance, or would be Fair Use to republish and mash. Just the amount of public domain paintings, illustrations, and comic books is mind blowing.

Sometimes sites exist dealing with this data, but sometimes they don't open it up and make it easily discoverable and searchable. For instance, one site has tons of public domain comic books, but they are all in individual zips, without structured galleries, you need to login to download, there is no OCR searching across the speech bubbles, there is no crowd tagging of content etc. Having a comic book search engine would be such fun, and it might even be Fair Use for modern material that is still copyrighted -- because in results you'd only see a panel or two! There's even full DVDs of scanned, copyrighted material of works such as Amazing Spider-Man (goes back decades) which could be OCRd/ databased/ made discoverable and searchable for such purposes...

It's fucking GET-ing a fucking URL with no cookies, no session state and no auth, know how much code doing that with requests will remove? None whatsoever while adding a completely unwarranted and unneeded third-party dependency.

You're just trying to justify unwarranted fanboyism and senseless assertions.

Maybe, I started out using requests and only later dived into urllib, so it could just be preference. However I find it's much more intuitive to work with requests's Response objects than be handed a file object by urllib