In this series of case study tutorials for Amazon, we will learn how to deal with various difficult-to-handle situations when scraping data from Amazon. In this tutorial, I will show you how to scrape Amazon product reviews and ratings.

List of features covered in this case study:

Build a URL Loop List

Set Regular Expression

Pagination

Run Local Extraction

Step 1. Set up basic information

Click "Quick Start"

Create a new task in Advanced Mode

Complete the basic information

Step 2. Create a loop list

It's easy for Octoparse to retrieve data from these particular web pages by entering a given list of target URLs and traverse these URLs to open each web page .

First, drag a "Loop Item" action into the Workflow Designer pane to create a loop for a given list of URLs.

Copy a list of URLs which you'd crawl data from and paste this given list of URLs into the "List of URLs" text box.→ Click "OK".→ Click "Save".

Now, you can see the given list of URLs have been saved as the "Loop Item". After finishing building the URL list, we will be directed to the first URL automatically.

(Note: These web pages should be in similar layout so that the extraction action below set up for the first web page could be automatically applied to the rest of the list.)

Step 3. Select the data to be extracted and rename data fields

Now we will begin to extract the overall reviews and ratings from the first web page.

Click the movie title in the first web page.→ Select "Extract text".

Follow the same steps to extract the other data fields(ratings, reviews).

Rename the field names if necessary.→ Click "Save".

Step 4. Set up regular expression

Regular expressions describe patterns to look for in the data. Octoparse allows to use regular expression to reformat captured data.

In this case, if you look closely at the “Ratings” field extracted, you would find that the format is a bit messy with unnecessary information “out of 5 stars”. To fix this, we could use the RegEx Tool to capture the exact data.

Select data field "Ratings", click the icon for "Customize Field"

Choose "Re-format extracted data"

Click "Add step"

Select "Replace with Regular Expression"

Click “Try RegEx Tool”to remove the “out of 5 stars” suffix from the string(or you could just input "(.+?)(?=out of 5 stars)" for "Regular Expression” if you know how to write a regular expression)

Check “End with”and paste “out of 5 stars” in the text box to identify the “out of 5 stars” string and then capture the remainder of the string

(Note: Octoparse Cloud Extraction allows you to run the task without keeping your machine turn on. And the speed would be much faster (see the screenshots below). Also, features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud Service here.)

Speed of Octoparse Cloud Extraction

Speed of Octoparse Local Extraction

Step 9. Check the data and export

The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

Now you have learn how to crawl ratings and reviews from Amazon, get started with your own crawling task to extract any data you want.