About

Background

The Bixo project came about because two different companies needed the same thing – a web mining toolkit that could easily fit into an existing Cascading-based workflow.

In discussing various ways to solve this problem, it became clear that refactoring Nutch to work in this environment would be a painful and error-prone process. In addition, the known limitations of Nutch would still need to be worked around, while the resulting massive fork would have little to no chance of being rolled back into the main Nutch codebase.

So the shortest distance between the two points was a new, slimmed down implementation that satisfied the following constraints:

Used Cascading to manage internal workflow as well as integrating with external data sources and sinks (outputs).

Supported only http and https protocols, at least initially.

Efficiently yet politely crawled white lists, with a limited number of discrete domains.

Testable at multiple levels (unit, integration, simulated web crawl)

Powered By

The following is a partial list of companies using Bixo, along with any public details of use cases.

YourKit, for providing a free license to their excellent Java Profiler. YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit’s leading software products: YourKit Java Profiler and YourKit .NET Profiler.