Diffbot API uses visual learning to parse web content

Diffbot making its visual learning application programming interface (API) available to developers who want to create apps that need to understand the structure of web pages in an efficient way. Traditionally, developers who wanted to understand a web page would start by looking at its code. From there, many algorithms can be used to extract information like an article content, an author name etc… However, because each page’s HTML code is different, it is often difficult to get consistent results by just looking at the HTML code.

On the other hand, every web page is built for “humans”, and that’s precisely what Diffbot is using as the foundation of its technology. Instead of looking at the HTML code, Diffbot uses computer vision technology to determine the nature of the content. For example, a title is often using larger text and the author name is usually near the top of the article. Of course Diffbot’s algorithm can handle a variety of situations, but you get the point.Diffbots has two APIs:

1/ On-demand processing of web pages. For example, this can bu used to extract elements of a web page that can be of interest, like a title content and images of a page, while ignoring other features like ads or navigation elements.

2/ A Follow API, which is used to detect changes in a webpage and extract relevant information that can be used to illustrate the change.

Information extracted from an Ubergizmo page, each data chunk can be accessed independently

It’s really up to developers to use these building block to create great applications, but I can tell you that if this works as advertised (I haven’t had time to try it yet), it is something that should add a lot of value because it’s hard to build. For instance, AOL Editions is already using Diffbot’s technology.

The API is free within a relatively large limit in the number of API calls that one can perform. Beyond that, developers will have to pay “per API call”, which means that they will have to monetize their application. Companies that have sensitive information can also get a license that run on a private server inside their firewalls.

Using computer vision technology to look at web page is a great idea and one that would bypass a lot of tricks designed for “bots”. Of course, you can expect to have some glitches here and there, but for most developers who need this type of functionality, this looks like a gold mine.