How Does the ﻿Simple﻿﻿ ﻿Scraper Work?

The Simple Scraper combines an automated web browser with XPath, a kind of syntax that parses XML and HTML. When you highlight parts of a website using the scraper, it uses AI to determine what the XPath is to locate the element(s) that you selected. The automated browser then loads the pages you've entered and the XPath determines the data to scrape.

Of course, you don't need to know any of that - you just need to click a few buttons and it will do the rest. It's that simple!

Single-Level ﻿Navigation

This first tutorial will go through scraping a Single Level of web pages. You can also think of this as a flat, or horizontal, scraping template. Examples of this would be scraping data from the search result pages on Google, without loading the results themselves.

>>Choose the "Single Level" Template if you want to scrape data without going inside result pages. For a better understanding of what template to choose based on what data you're looking for, watch the full tutorials at the bottom of the page.<<

1. Let's Start Scraping:

Step 1: Load the Outscrape Simple Scraper, and choose the "New Template" Button in the top right. Choose the Single-Level Template.

In the window that appears, enter a URL that contains data you want to scrape. In this example you can see a Google search results URL.

TIP: The scraper cannot enter text or click buttons, when it's running, so you need to use the URL that you get AFTER you type text or click buttons if that's necessary. Just get there in another browser, copy the URL, and paste it here. If you're trying to scrape search results, or scrape data that appears after you enter a search term or click a button, use the URL that you get. Don't use, for example, www.Google.com, if you really want the results page.

​

Tip: Click the images to enlarge!

Choose the Single Level Navigation Template to start..

2. Choose the Next Button (Or Arrow):

Step 2: Click the drop-down to choose a command. Choose the "Select Next Page Button" .

Locate the button, text, or element that you click to get to the next page. For example, on Google this is either the word "Next" or the ">" arrow.

RIGHT CLICK this. It will highlight red, to show that it has been selected.

TIP: If there is no next button, watch the tutorial on scraping from lists of URLS and come back after you've seen it!

Choose the "Next Button" that advances to the next page of results.

3. Choose the Data's Region:

Step 3: Click the drop-down to choose a command. Choose the "Select Region Around Data To Scrape" button.

Move your mouse cursor over some of the data that you want to scrape. You'll notice an orange, rectangular highlight. Right click when you highlight the correct area of the data you want to scrape. A red highlight will appear on all the samples of data that match your selection.

Your goal here is to move the mouse until you highlight all of an example of the data that you want to scrape. For example, if you want to scrape the URL and description of a Google search result, you could highlight any of the 10 results on the page, as long as the rectangle covers the data entirely. ​ If there is only a SINGLE example of the data you want to scrape on this page, you probably chose the wrong template.

Choose the region of the data you want to scrape.

TIP: You can do this highlighting more than once, so if you make a mistake, you can simply re-select the appropriate area. ​And, if you cannot highlight all of the data at once, that's fine too. Remember - you are only selecting a sample of the data you want.

4. Choose the Data:

Step 4: Click the drop-down and choose the "Select Sample Data" button. Inside the region that you selected, use your mouse to create an orange highlight around an example piece of data that you want to scrape. Right click.

In this step, you should imagine each piece of data that you want to select as a separate row in a spreadsheet. The results will appear that way. You will want to highlight individual pieces of data and right click them.

In the window that appears, use the checkboxes to choose which pieces of data inside your selection you want to scrape. Use the "Field Name" window to change the name of your "Column" of data in the CSV. The result will be your field name DOT the attribute element that is listed beside the checkbox. (For example, Description.Inner_Text.)

Highlight the data inside your selection that you want to scrape and right click.

TIP: You can hit cancel if your selection isn't accurate. You can also go back to the previous step after choosing your Data in this step if you'd like to select more than one Region of Data.

5. Save and Run Your Template:

Step 5: Ready to run your template? Click "Save Template", name it something you will remember, and click OK. Then click Run Template and choose your saved template.

TIP: Your scraper will run until it can no longer find a "Next" button. Sometimes, the HTML code that the scraper uses to recognize the next button continues to exist after it is no longer clickable. In those instances, the scraper will need manually shut down. (You will notice it repeating the final page.)

Click save, name your template, then click run and choose it.

6. View Your Results:

Step 6: Load your results by visiting the Outscrape_Sessions folder.

The Sessions folder is located inside your Outscrape directory. It will usually be the last (newest) folder, but the folders are named according to the time and day they were created:

Session_170630104525 = 2017/06/30/ 10:45:25

The ​file to open will be called "output_[TemplateName].csv". This file can be opened while results are still being saved, even if it appears to have 0kb. Don't edit and save the file, however, unless the scraper has stopped.

​

Your results are accumulated inside the Sessions folder, in a CSV file.

Multi-Level ﻿Navigation and Scraping

This second tutorial will go through scraping a Multi-Level tier of web pages. You can also think of this as a vertical, or deep template. Examples of this would be scraping the data from products or pages INSIDE of a search result on amazon, craigslist, or a real estate site.

>>Choose the "Multi-Level" Template if you want to scrape inside result pages. For a better understanding of what template to choose based on what data you're looking for, watch the full tutorials at the bottom of the page. <<

1. Let's Start Scraping:

Step 1: Load the Outscrape Simple Scraper, and choose the "New Template" Button in the top right. Choose the Multi-Level Template.

In the window that appears, enter a URL that contains data you want to scrape. In this example you can see a Craigslist results page URL.

TIP: The scraper cannot enter text or click buttons, when it's running, so you need to use the URL that you get AFTER you type text or click buttons if that's necessary. Just get there in another browser, copy the URL, and paste it here. If you're trying to scrape search results, or scrape data that appears after you enter a search term or click a button, use the URL that you get. Don't use, for example, www.craigslist.com, if you really want the results page.

​

2. Choose the Next Button (Or Arrow):

Step 2: Click the drop-down to choose a command. Choose the "Select Next Page Button" .

Locate the button, text, or element that you click to get to the next page. For example, on craigslist this is the word "next".

RIGHT CLICK this. It will highlight red, to show that it has been selected.

TIP: If there is no next button, watch the tutorial on scraping from lists of URLS and come back after you've seen it!

3. Choose an Example Result Link:

Move your mouse cursor over a link that goes to a page that you want to scrape data from. You'll notice an orange, rectangular highlight. Right click when you highlight the correct link. A red highlight will appear on all the links that match your selection.

Your goal here is to find a sample link that, when selected, creates a red highlight on all the links you want the scraper to visit.

Once you've done this, click to load one of the sample link pages you'd like to scrape data from.

TIP: You can do this highlighting more than once, so if you make a mistake, you can simply re-select the appropriate area. ​

4. Choose the Data:

Step 4: Choose the "Select Data" button. Use your mouse to create an orange highlight around an example piece of data that you want to scrape. Right click.

In this step, you should imagine each piece of data that you want to select as a separate row in a spreadsheet. The results will appear that way. You will want to highlight individual pieces of data and right click them.

In the window that appears, use the checkboxes to choose which pieces of data inside your selection you want to scrape. Use the "Field Name" window to change the name of your "Column" of data in the CSV. The result will be your field name DOT the attribute element that is listed beside the checkbox. (For example, Description.Inner_Text.)

TIP: You can hit cancel if your selection isn't accurate. If you don't like the page results, for some reason - for example, if there is data on some pages that you'd like to scrape that isn't on this page, you can hit the "Back" button and load a new sample result page as long as you haven't highlighted and chosen any data.

5. Save and Run Your Template:

Step 5: Ready to run your template? Click "Save Template", name it something you will remember, and click OK. Then click Run Template and choose your saved template.

TIP: Your scraper will run until it can no longer find a "Next" button. Sometimes, the HTML code that the scraper uses to recognize the next button continues to exist after it is no longer clickable. In those instances, the scraper will need manually shut down. (You will notice it repeating the final page.)

6. View Your Results:

Step 6: Load your results by visiting the Outscrape_Sessions folder.

The Sessions folder is located inside your Outscrape directory. It will usually be the last (newest) folder, but the folders are named according to the time and day they were created:

Session_170630104525 = 2017/06/30/ 10:45:25

The ​file to open will be called "output_[TemplateName].csv". This file can be opened while results are still being saved, even if it appears to have 0kb. Don't edit and save the file, however, unless the scraper has stopped.

​

Video Tutorials and Walkthroughs

Have more questions? Watch each 10-20 minute video to get a better understanding of how the software works.

Basic Introduction:

The Simple Scraper grabs data from Craigslist and Google, two popular sites, in this quick feature demo.

In-Depth Walkthrough

Demonstration of scraping TedX (an event site), Yelp (a review site), and Slickdeals (a forum). Watch this in-depth tutorial to see more of how this Simple Scraper works.

Scraping from a List of URLs

Just load a list of URLs from a CSV while building your scraping template, and as long as all the pages are formatted the same way, you can scrape every single one of them!

Scrape Data from a Link on a Result Page (3rd Tier)

Outscrape's Simple Scraper can even scrape data 3-tiers deep. For example, you could load a series of product search pages, visit the pages for each resulting product, and from there enter the author's profile and scrape data from their page.

Troubleshooting, Tips, and Tricks

While it is easy to use, the Simple Scraper isn't perfect ! It sometimes requires a little extra work on certain pages. This project is essentially an alpha version of the final product - I am working on much more robust version. When that is released, you will get first dibs (and a discount equal to how much you paid for this scraper).

My Virus Scanner is Reporting the Simple Scraper

Because the Simple Scraper uses an automated browser to scrape according to the template that you build, some virus scanners may report it as a potential threat. I assure you it is not a threat. This error may appear during installation, or when you first load the software.

You may need to whitelist the software to let it run. If you'd like instructions on this, type "Whitelist [your virus scanner]".

The Simple Scraper isn't Getting the Data I Expect!

This is usually caused by minor changes in the formatting of the pages, or of how the data is structured from page to page, which isn't obvious to you but makes a big difference to the Simple Scraper. Sometimes the result is that you see big sections missing in your data, or almost no data at all.

To make sure you get the correct data, follow these steps:

Try selecting different samples of your data. Select different regions, on different pages, as well. A good example of an issue you might experience: Some Google result pages include videos or images at the top, especially on the first page. Try starting on the second page to limit the variations in formatting.

Try multiple templates. If there is a huge change in formatting that means you can't scrape all the data with a single template, just create two separate templates! You can see an example of this in the ﻿﻿in-depth tutoria﻿﻿l﻿﻿, around 5 minutes in.

If you see big sections of data missing, make sure it's actually on the pages that you are using! Often times, a selection of pages will contain information that you want, but others won't. The result will be a CSV file with gaps in the data, which is to be expected - you can't scrape what isn't there!

The Scraper is Taking a Really Long Time / Uses a Lot of Memory

The scraper takes a long time between pages! You may have found a page that loads a lot of images or javascript. It might not be possible to scrape any faster, but there may be workarounds, such as using a different version of the site. Contact me if you're unsure.

The scraper loops on the last page! This might be because it doesn't recognize that the "Next Page" button no longer exists. Check the data that it's scraping - is this data repetitive, especially at the very end? If so, you may simply need to manually turn off the scraper when it's finished. Unfortunately there is no way around this.

The scraper can grow in size when you have extremely large websites or a long list of pages. I HIGHLY recommend running it on a Virtual Private Server so that it does not interfere with your regular work. You can get a free Amazon Virtual Private Server. For instructions, see this blog post.

Additional Suggestion: Make additional templates where your bot has stopped running after manual shutdown. If you find that the software moves slowly, or you check the task manager and it's using a lot of your memory, I recommend using the log file (which you can find in the Sessions folder for the appropriate template) to see what the last relevant URL visited and scraped was.

Then, you should be able to copy the template and make a duplicate of it. Change the URL that the template starts on to the appropriate last URL, and run that new template. The URL you're looking for is in the