Lightweight HTML data-extraction in regex way

"HTML-structured-engine", an engine to parse html into structured data.

Introduction

"HTML-structured-engine" is an open-source project of this site. It uses regex technology to scrape web pages. With a set of rules, it will follow links, get html contents, and then parse to get structured records.

This project is:

Light weight: It will not store the whole HTML, but parse and get the wanted data directly. You can use it like a RPC caller.

Configurable: With a set of rules and patterns, the engine can follow links to get data.

Multi-level: For a complex situation, when it is not easy to write a simple pattern, you can write more patterns to parse the result of previous step.

Able to follow links: It can follow links from one page to another during multi-level process.

Download

The source code is controlled by Google Code. You can download the stable archive, or get source code through svn.