Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

With this scraper, which does extraction by Regex, one needs to set the extraction pattern with Regex expressions or something similar. That might be fitting for an invalid HTML scrape, but one cannot predict what the mistake will be with your target for scrape. So I did a scrape with one general pattern. Result:

The Unmatched quotes (<a href=“scrapetools.com‘>) are not noted by it. Since you can see “проверка (utf-8) wrong header” but not “(windows-1251) wrong meta” this scraper pays attention to the http-header more than to the meta tag.

The scraper paid more attention to the http-header, rather than to the meta tag.

As far as the extracing of the unmatched quotes links, Content Grabber might be programmed to do it. Just grab the whole area, choose Inner HTML and to use the following regex in the transformation script to refine a link:

Summary

The scrapers have generally done satisfactorily, passing 5 out of 7 tasks. The Web Content Extractor, Visual Web Ripper and Content Grabber (6 out of 7 tasks rate) did the best. WCE could scrape the unmatched quotes links and VWR and CG are good in regex application to the deliberate page area (text transformation). The rest failed with unmatched quotes (single quote ‘ instead of double one “) recognition. The attention to the meta tag or the http-header differentiated the scrapers. See the table above.