Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.You should run these to discover the Hub.

This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. Many more features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.

In many cases the automatic data extraction functions: tables, lists, guess, will be enough and you will manage to extract and export the data in just a few clicks.

If, however, the page is too complex, or if your needs are more specific there is a way to extract data manually: Create your own scraper.

Scrapers will be saved to your personal database and you will be able to re-apply them on the same URL or on other URLs starting, for instance, with the same domain name.

In our present example, the data could be extracted simply using the ‘List’ view in the data section.

If you don’t see anything in the list view,

reload the page.

In the ‘Lists’ view, like in most other views, right-clicking on selected rows gives you access to a wealth of features to edit and clean the data.

If the data, as extracted in the list view, is not structured enough for your needs you will have to create a customized scraper for this page.

The Scraper Editor is on the right side of the ‘Source’ view, with the colorized HTML source of the page.

The text in black is the content actually displayed on the page. This colorization makes it very easy to identify the data you are interested in.

Building a scraper is simply telling the program what comes immediately before and after the data you want to extract and/or its format.

So let’s create a scraper for this list.

Click on ‘New,’ type in the URL of the page and a name for your new scraper.

Fill the cells with the most logical markers you find around the different pieces of data (don’t look below for the solution… your computer is watching and you would loose ten points.)

Your first version should logically look like this:

Hit ‘Save,’ and that’s it! You are ready to run your first scraper.

If you now go to the ‘Scraper’ view and hit refresh, the results are there.

They are not bad… but not totally satisfying:

The first row contains text instead of the Coordinates, and the City is missing.

Another look at the source code explains it. The parenthesis ( which is used as the Marker Before Coordinates, appears in a comment hidden in the source code:

You must, therefore, be a little more precise and define the format of the first character that must be found after the marker.

Here, a good way is to use the Regular Expression syntax in the Format field. RegExps can become pretty tricky if you need to find complex patterns, but here, what you want to say is simple: “a string that starts with a digit”.

For this, you need to type \d.+ (a digit \d, followed by a series of one or more characters .+)

Hit Save.

Back to the scraper view, the new result is pretty good.

Reload to see the updates.

One last problem, though, the first city took its continent along with it…

Let’s have a look at the source code one last time.

<li>, our Marker Before City, also appears before the continent.

A simple way, here, is to select all the characters between the beginning of the line and the city name, and copy them into the scraper editor. It makes the marker more specific, and it will keep working because all cities are at the same indentation level:

Our final scraper looks like this:

Don’t forget to hit ‘Save’ for indeed we did it!

OK, the present example is not all that exciting and the figures are already out of date. It would almost be faster to do the 15 rows manually.

But, what if the data filled 20 pages and we decided to update the population figures tomorrow?

Better: what if the data was changing every morning, like job ads, sport results, or stock market indices?… No problem, you would simply re-apply your new scraper.

If you want some tips concerning Scrapers and Regular Expressions, you can refer to the “Help” menu.

This entry was posted
on Friday, August 22nd, 2008 at 4:05 pm and is filed under Tutorials (Web Scraper).
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

43 Responses to “Create your First Web Scraper to Extract Data from a Web Page”

Here’s my question: The tutorial says you can apply a scraper to a list of URL’s.

But I can’t figure out how to actually LOAD a list of URL’s into the program.

Do I have to individually browse to each page, if I already have my list of URL’s? Or is there a way to load a list of URL’s [say, if they were in a text file, for example] and send the scraper out to browse those pages?

Thanks a lot for your comment. You are perfectly right! These have been in the to do list for months. You just made their priority climb. Here is what we plan to do:
– Drag & drop to the Page (so that you can drag items from the catch for instance and use them as if it was a Web page),
– Allow to work on documents without a DOM (like a simple text file) as we work on HTML pages,
– Progressively add scripting capacities, so that one can rapidly perform a series of tasks on batches of sources….

I would second the request for a list of URL capability. More importantly I would second the compliment at this fantastic tool. This really fills a hole in tasks that are too large for manual extraction, but too small for building a Perl script from scratch. I think with even basic scripting capability/ ability to apply against a list of URLs, this will be a powerful tool.

ie. the guess formats the data into columns,
but there is one column I would like to expand into 3 columns,
can I access the scaper created by the guess and extend it?
That way I can do less work.

Thanks in advance.

>>
2. From the guess there is a column which contains a URL, can we get outwit to follow the url and then further extract the data in the following page.
ie. like an “explore links of this page” but applied to only one/two columns of the table.
A bit like a recursive find with a depth of 1.

I am sad to say to it, but the answer is no. We agree that this is a necessary feature, however, and are discussing how to add it in a future release. It’s no easy task so this could take awhile. Thank you for your patience and please, keep the questions and/or suggestions coming. We love hearing from our users and couldn’t do this without you.

its really a fantastic tool i sincerely thank the developers of this tool who has created
its a mind blowing tool thank you a lot giving this world a better tool which this world always in need of. apart from all this its free thank you and god bless you

IT WORKS !!! AMAZING !
In my previous question, i didn’t look the right place !
In the Extraction menu (on the left), i had my results !
So for people, if you check “move tho the catch” and “save the files in the catch”, it’s all good !
Men, it’s an incredible work !

Thanks for the amazing program. I do miss one feature though.
The emails view is good, but lacks the text that is displayed for those emails (Usually the name of the person for which this email belongs).

It could be solved either in the emails view or in the scrapper. But the scrapper lacks the ability to catch the HTML tags, so I can catch the name of the person but not its email.

Hello and thanks. (Please, do use our feedback form in outwit.com for this type or message, it allows us to keep our bug/wish list up to date.)
We are replacing returns by “;” (after having replaced all “;” by “,” ), to make cut and paste to a spreadsheet easier. It is not a completely satisfying solution but it is a better choice in most cases than simply removing the returns, as it allows you to recover a little of the original layout in the destination program, replacing back “;” by returns. We are adding a “clean text” option (checked by default) in all views of the next version. It means that, if you uncheck it, you will be able to keep html tags like <br> in the scraped text.

Actually, this gave us a pretty simple idea: we will add a find/replace function in all datasheets of a next version. It should help in these cases.

Since the last version, it’s possible to keep HTML tags in scrapers (also in tables, lists). You’ll see a “Clean data” checkbox in the bottom panel, just uncheck it.
We’ll probably work later on getting text in the email view.

For this, you can use the right arrow (next) or the double right arrow (browse) if they are active. You can also collect a list of URLs in your catch for instance, select them and use the right click menu option ‘Browse through selected URLs’.

(NOTE: For technical questions, please, do use the feedback link rather than posting a comment to the blog for support tickets to be followed.)

@Jim: The Apply to URL in the scraper can be a partial URL, or a regExp or just a string to be found in the current URL for the program to decide if the scraper applies to the page. The idea is that once on a page, simply going to the ‘scraped’ view will execute the process. Imagine you put ‘mySite.com/search?’ in the apply to URL, then the scraper will apply to any result pages on mySite ; in this case, we cannot load the page as we only have a partial URL. Macros, however will allow this in the Pro version.

The automatic scraper does a great job of only placing the text that would be seen on the web page instead of the html code. However, when I build my own scraper, I have to build it where there are numerous html fields in a single capture. hence, the scraper returns both the html code and the text that would appear on the website. I do not want the html code in my catch.
Is there a button when you make your own scraper that will not catch the html code but will select the text that appears on the website?

If you have the pro version, you can use the ‘queries’ view and create a directory of URLs to be used manually or called in a macro. There are several ways to add URLs to a directory of queries, the drag and drop is the easiest.
If you are using the light version, the best way is simply to put your URLs in a .txt file and open the file with the Hub.
Cheers,
JC

You’ll be able to do all sorts of tweakings after reading the help
Go to links, select all your URLs, right click on one of them and choose “Browse through selected URLs…” or “Apply scraper to selected URLs…” These are the simplest ways.

But, please, do read the help and do not use this blog for support tickets: We have a perfectly good support system on the site… As it says in red in the comment form you used to type this

The last post was last week, in fact. And yes, it’s still very much operational and supported. This page, however, as it says in red at the top, is about a very old tutorial. You should use the ones built in the Hub instead. Feel free to send us bug reports or suggestion with our contact form.