Quote of the Day

What I’ve been reading:

Version 1.02

Creating a recursive web scraper with Node Crawler

So you want to scrape lots of data from the internet using Node Crawler? No problem. You’ll need to know JavaScript, jQuery, and a bit of Node.

First things first:

1) goto Terminal and create a new javascript folder called node crawler and save a file called “craigslist.js”
2) in your folder, run
npm install crawler
3) Make sure you know how to select elements from the DOM using jQuery
4) Pick a site you want to scrape from. In our example, I’m using craigslist.

Lets jump into it and I’ll show you the code in full and then break it down piece by piece:

We want to append that data to the JSON array, and when all is complete — we want to find the next page.
Go recursion go!

Line 26: Here we look at the an item on the on the HTML document that we can pass into our crawler and find the the next page. Thankfully, craigslist gives us a range element that we can pass into the URL to find the next 100 elements. We extract that of course.varrangeNumber=$($('.range')[0]).text().split(' ')[2]

Line 30: Create a ‘link’ to pass into the crawler based on the range we just scrapedvartoQueueUrl='http://sfbay.craigslist.org/search/bia?s='+rangeNumber

Lines 32-38: We want our base case — which is “If the range is less than 1000, run the next link into our crawler. Otherwise, save all the data to our system.”