Using in Python

Output

NASA's Curiosity rover is continuing to help scientists piece together the mystery of how Mars lost its
surface water over the course of billions of years. The rover drilled into a piece of Martian rock called
Cumberland and found some ancient water hidden within it. Researchers were then able to test a key ratio
in the water with Curiosity's onboard instruments...

Using as a command line tool:

Note: Window’s users may have to add the C:directory to your
“path” so that the
command line tool works from any directory, not only the ..directory.

Requirements

requests
lxml

Motivation

After searching through the deepest crevices of the internet for some
tool|library|module that could effectively extract the main content
from a website (ignoring text from ads, sidebar links, etc.), I was
slightly disheartened by the apparent ambiguity caused by this
content-extraction problem.

The number of research papers I found on the subject largely outweighs
the number available open-source projects. This is my attempt at
balancing out the disparity.

In the process of coming up with a solution, I made two unoriginal
observations:

XPath’s select all (//), parent node (..) queries and functions
(‘string-length’) are remarkably powerful when used together

Unnecessary machine learning is unnecessary

By making an assumption on sentence length, and this is trivial, one can
query for text-nodes satisfying said sentence length, then create a
frequency distribution (histogram) across the parent-nodes, and the
argmax of the resulting distribution is the xpath that is shared amongst
likely sentences.

The results were surprisingly good. I personally prefer this approach to
the others as it seems to lie somewhere in between the purely rule-based
and the drowning-in-ML approaches.