web scraping, screen scraping, data parsing and other related things

In this post, I’ll start to share our experiences with different web scrapers on the Testing Ground project. The first test, which I thought would be the simplest one, proved to be irksome and discouraging. With a struggle, I completed the test drive on Table Report. This test evaluates the ability of different scrapers to cope with difficult tables, like merged tables, missing values and so on. See the test table layout:

Overview

Basically, the scrapers are well able to extract a simple HTML table layout, but extracting a sophisticated table layout is much more difficult for them.

The OutWit Hub did the best, far superior to its competitors, except Visual Web Ripper, in table scraping. It features grabbing tables or lists as they are and it immediately allowing them to be exported. Therefore, no time expenditures on project creation is one of its basic merits. OutWit Hub does outwork the merged cells, but it failed to keep table header column names (Product, Items, Amount, etc.) in separate columns. See the complex header:

Quarter

Product 1

Total Amount

Items

Amount

The program could extract only a whole table, not certain rows from it. But the upcoming release, Version 3.1, will allow multiple selection criteria in the bottom panel of all pages of the software and then it will be possible to tell it to choose rows after “x” and before “y”. OutWit Hub also does not create merged cells (like for ‘Total for 2000’ value) in the final table, though it grabs the values from merged cells, multiplying them into separate cells. Yet, unlike most other scrapers, it doesn’t shift values after merging cells. Instead, it generates needed cells and auto fills them. It also excellently parses vertically aligned merged cells. The merged cell is considered as one in a row for most scrapers, so data shift is unavoidable:

Visual Web Ripper performed better than OutWit, with much time wasted, though. The only way Visual Web Ripper is better than OutWit is it can be set to extract only a certain year’s data from a table. Initially, I lost much time trying to go across columns in PageArea patterns while creating a project. Visual Web Ripper support worked on it, providing results I never imagined would be possible. The support team suggested “users would simply use a direct copy/paste into Excel which would yield a perfect result.” However, it didn’t only grab HTML data and save it into Excel; data may be saved to a structured DB or other format, as well.

Support team: It’s not impossible to use the software to copy the HTML table into Excel, but it’ll not yield a perfect result like a direct copy/paste into Excel.

You need to follow these steps to extract a simple copy of the HTML table:

Use a Transformation element to denormalize the table.

Use two PageArea templates, one for each direction in the table (rows and columns).

Configure the second PageArea template to extract data into the parent table instead of a separate table.

The “Save data method” is on the Advanced options tab. Visit here if you want to know how to configure it:

The VWR could be programmed to get part of a table from a whole table. For this, you need to use an Input Parameter to get a certain year, so you can specify the year you are interested in before running the project. Define yearArea template with input parameters right after the table denormalization template:

The scraping the table, especially with certain years took me more efforts.

To set up all the following commands you need to switch the Content Grabber layout to Expert Layout in the Application Settings ribbon menu. (Otherwise, all the advanced features are hidden to not confuse novice users).

Choose a case (link)

First we create a Navigate Link command that will click on a Case link (ex.: 1 prooduct, 1 quater). We can see its xpath in the command config: the Properties tab:

You might have mentioned also the position() function in it, the custom Content Grabber xPath function. The custom Content Grabber xPath functions might be found here.

We may change xpath expression of the command at the Selection tab of upper ribbom menu (there you might also add a second, third xpath for the command):
So now we set an input parameter case at the Agent Settings tab -> Input Parameters. This case number will be later fetched when Case command is executed:

Denormalize table

After you’ve defined Case command, you definitely need to do a table transformation. As you select the table you apply transformation command, which should be Denormalize HTML table. This removes all the joined columns/rows and make it well scrapable!

To extract all the table info we make two commands: Rows and Columns.

How to choose rows of a particular year?

Inside a Rows command let’s reconfig it. For this we change its xpath expression (at the Selection tab) for the following:

The first xpath for selecting certain year (when year input parameter is set) looks like this:

Mozenda isn’t convenient to scrape complex tables. It failed not to shift due to merging cells and processed indefinite table columns, though you can tune it to get certain rows from the table.

Mozenda support explained that unstructured data (like merged/unmerged cells) don’t fit a pattern, but “it could capture data indefinitely if the data followed the same pattern… Mozenda can select all the items in an infinite grid if the pattern holds true…. but I don’t think that we will be able to gather the data in these tables without some distortion or having a finite amount of columns or rows”.

Mozenda proved to be able to scrape data from tables that may change dimensions. The support serviceman has sent us an Agent that does it. Yet, output is as list of items, rather than a table.

If you want to get only a certain year’s data scraped in a project, as you build an Agent in Actions area (below left corner), choose ‘Begin item list – <custom name>’ and, then, in Tools area (upper left corner) in ‘Selected action‘ sub-area, click ‘Refine list‘ and tick ‘Only include items at the following indexes‘ (there you define those rows pertaining to a certain year). See the image below:

With this scraper, I failed to make a project that could extract something, and soon I gave up and asked for help. The support replied: “The scraper is for fixed dimensions data only”. They composed a project to extract, but it wasn’t up to standard.

They created 3 patterns for extracting Year, Quarter and the Items & Amount respectively.

The Helium Scraper did well, after I asked support for help, although it was kindly given. Helium Scraper outworked merged cells. For this, the special premade JS gatherer needed to be uploaded first (ask support). Then, check mark Project –> Options and Select Property Gatherers under the kind defining tab.

After some communication with the support team, they sent me the project file that completes the extraction.

The Helium Scraper Team: “The database-way to store this kind of information is to use multiple related tables.” Two tables are related by year: the Details (Items & Amount) data table and the Year (Totals for each year) for a year table (see the image below). This information can then be exported as an XML, HTML or any other format by using the Custom Export feature.

Multi-join is easy. After extraction of data in tables, in the database panel, go to ‘Queries‘ tab, write a query and run it. The following image shows how to run an SQL query on two tables extracted to create a joined table:

Basically, you’d need to create kinds for every product. As far as setting up the indefinite table columns for scrape in a project, you need to create additional/extra JS gatherers, which does require you to have some advanced skills.

When I inquired for year-specific information to be extracted from a table, the support proposed the creation of a new kind, kinds intersection (with the Create Set kind button in the kinds editor). This kind will only select cells under the year that matches the year in the YearFilter kind, which only has one property, JS_Year, equal to a year of interest. After that, another actions tree will extract only one year, because it uses the filtered version of a kind:

My impression is that for a difficult scrape with Helium Scraper, you’ll need to ask the support center.

A Screen Scraper project is a challenge to compose. I requested help from the support and soon Jason (a support member) had sent me the project with patterns for the extraction.

To get the indefinite column number, the support team composed 2 patterns to be applied sequentially and the following script to save data to a log file. This regex-like approach with unfolding patterns improved my opinion of Screen Scraper. See the patterns:

The Screen Scraper does extract the data, but also has the merged cells issue; totals for each year and year designation, themselves, were not defined. Truly, this task is not for a casual user. The result is in the session log (see in red boxes):

Screen Scraper support promised “to map each row to account for rowspans (colspans are easy)”, but hasn’t replied yet.

Web Content Extractor is good to scrape fixed width tables, but without some expected features (merging, etc.). The support team admitted: “Sorry, our software is not able to extract data from this type of table.” See the obvious bugs after the sophisticated table layout scrape:

After some struggle to harvest a table, I emailed SysNucleus and developers replied, “…Support for scraping data from HTML tables is low in WebHarvy. The tool has been designed with more focus on ability to scrape data from normal listings in webpages (search results), again focusing on ease-of-use.” It better works to “extract data from a site like yellow-pages where data is displayed in a well formatted manner”. See the poor result:

Web Data Extractor is focused on collection of lists of email, URLs and etc. The output was always as list of items. I inquired of the developers about tabled output and the Spadix software specialist admitted, “Information in any case will be given as list of items. Current version doesn’t have this ability.” See the result at right.

Easy Web Extractor did poorly as well. The data shift is obvious and it also failed to get the info for totals. The data were not found in some tables (too large and too small) as the project was applied for several tables-links. The developers’ team said: “Easy Web Extract is not designed for excel-like table extracting. That means EWE doesn’t try to detect and convert any number from the table.” In spite of that, I think, for a simple table layout extraction, you can use this tool. See image below:

Conclusion

After the first attempts in a Scraper Test Drive, we might nominate OutWit Hub and Visual Web Ripper for consistent HTML table data scrape, but Content Grabber has done the best of all. Content Grabber has manifested as the visual yet highly developer adjustable scraping software. It has supported tables with indefinite number of rows and columns, tolerated merged columns and rows, parsed currency formats and was able to extract only a certain set year from a given table.

Most of the scrapers require composition of a new project for each tablewith a different column number that diminishes a project value. Since most of the scrapers are designated to extract data from commercial web catalogs (e.g., yellow pages or eBay), their scope of work doesn’t include tabled data. Three web scrapers openly admitted not being able to extract such tabled info. OutWit Hub stands out due to its capability to scrape tabled and listed data. Content Grabber and Visual Web Ripper have proved to be a monster scrapers in comparison with the others.

2 Comments

Vibhu BansalOct 30, 2012 @ 12:50

Your comparision is indeed praiseworthy but it missed an important parameter. The ease of use for a layman. Most of these scrapers required complex configurations that require programming skills and high level of HTML understanding. It would have also been useful if you had considered and quoted the costs involved when support was needed. I am aware they charge for creating the project scripts too and the price is astronomical.

Comparitvely, our service is much cheaper and does all the dirty work and provides the client with data in format they can readily use.

Dear Vibhu,
You’re right, the difficulty of composing scraping projects in the off-the-shelf products differs and i’ve slightly mentioned it in the post.
But I’d also like to emphasize that here we talk about off-the-shelf web scraping software and consider it as “buy once, scrape forever”. You speak about service, of course scraping services can do anything on demand but we do not consider them here.
Some projects are charged, yet it might not be the case with all the software vendors.
Later we share more light on the matter of difficulty in a scraping project. Get in touch. Thanks.