For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:

Differences between the PHP port and the original

Arc90’s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90’s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90’s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90’s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

I’m a french speaker, and I’ve tried it on a french page, and found a bug using accent (I know a lot of language use characters as À É È and so on) on a website not using the normal É . Is there a way to fix it?

thinkery: PHP Readability does not automatically convert relative URLs to absolute ones, but it’s not difficult to do. Is that what you’re trying to do?

Regarding github contributions: the source code has now moved to code.fivefilters.org using Indefero. You can now grab it with git. I hope that makes it easier for those of you who’d like to fork it and modify it. If you do make changes, please share them – I’ll consider incorporating any changes once tested.

Hi,
this is great tool. it works like a charm, I will use it with combination of Bing Search engine. Is there any algorithm for creating such information. I just want to learn about different algorithms for this purpose.

Al: sorry, there was an error in the example code on this page (the example in the repository should work fine). tidy_parse_string expects the character encoding in the third argument but in the code I’d posted up it was being passed as the second argument. I’ve fixed it now by adding an empty array() as the second argument – please try copying again from the code on this page and let me know if you still get an error.

Al: that will work if you expect $url to have exactly one redirect. But if the URL returned by get_redirect_url() has further redirects, you might not catch the hash bang.

A more robust solution would be to follow redirects one by one, resolving relative URLs and rewriting any hash-bangs you encounter, or a simpler option is to use cURL, let it handle redirects but grab the effective URL (the final URL it fetches) – see http://www.php.net/manual/en/function.curl-getinfo.php – and if that contains a hash bang, rewrite that and fetch it again. Although I haven’t tested to see if cURL preserves the fragment identifier when it returns the effective URL – if it doesn’t, then this solution will be no good.

Al: I’ll soon be collecting URLs of pages which fail extraction in an effort to improve PHP Readability. I’ve deliberately held off the desire to change the code because at the moment I don’t have a decent test framework in place to allow me to see the impact of the changes on sites other than the one in question. Once I have something in place, I’ll post up here and hopefully get help from anyone interested in improving the PHP Readability code.

I think with the new readability.com service Arc90 are unlikely to continue developing their open source version, so perhaps a community effort can keep it alive.

Just want to say thank you for this code. There are other php readability libraries popping up on git hub but none seem to work as well as this one. Would you be willing to create a repository for this on there too?

I’m curious if it’s possible to use this to extract content from custom comment tags. For example, if the page I’m looking at contains something like: <code><!-- MYTAG myname=myvalue --></code>, is there a clear way to pull out the name and value using Readability?

<em><strong>By Emily Kaiser, Asia economics correspondent</strong></em>
SINGAPORE – The roller-coaster ride for Asian currencies, which saw only the yen and yuan post significant gains for the year against the U.S. dollar, is set to continue in 2012.
While Japan actively sought to stem the yen’s rise — drawing U.S. criticism last week — China intervened to ensure the yuan ended the year at a new high. Both currencies appreciated roughly 5% in 2011 against the dollar.
The opposite approaches illustrate a dilemma facing Asian policymakers as they try to smooth out foreign exchange rate volatility, which shows no sign of abating in the new year. If the currency is too strong, exports get more expensive. Too weak, and imported inflation spikes and domestic buying power fades.

Keyvan, another heuristic to consider is adding elements with an explicit style=”display:none” to the unlikely candidates list. I ran into some examples where a hidden DIV contained a bunch of text that the user would never see, and modified my copy of the library to throw these out.

Hey Keyvan,
Thanks for sharing! Quick question about images.
In a comment above, you were saying you were going to look into it as they seem to get killed in the process. Have you worked on a fix?
I’d like to be able to use them along with text in a small app I’m building.
Thanks!

Frank: regarding images, we’ve made a few changes to PHP Readability that will go into the release of Full-Text RSS 3.1. The changes should preserve more images and embedded videos. Once we’re ready with that release I’ll update the PHP Readability code linked here.

Hello. I’ve tried the class you posted and it’s really great! I’ve tested it on different types of web pages and on most of them it gives awesome results!

But there’s type of web pages that holds multiple blocks of content of similar size. I turned debug on and it showed me scores of 42 to 52, and I think instead of grabbing one of them it’s better to get some.

I was thinking about having some threshold, say 20-30% of top candidate’s score and take all candidates that fit in it, so in my case 20% of 52 is 10.4, so all candidates with (score > 41.6) would be included in the output.

Before I dive into rewriting it for my needs I wanted to ask this: do you have a version that extracts X top candidates instead of one or the way I described with the threshold? I’ve looked into the grabArticle code and it looks like it won’t be a quick fix to implement something like that.

Hi Michael, that’s interesting. I’m not aware of a version that does that. For use on FiveFilters.org, we write custom extraction rules for sites where PHP Readability doesn’t extract what we want. In cases where we want to extract multiple elements, we use XPath to select them.

I’m afraid I can’t help with what you’re trying to achieve, but it does sound interesting.

[…] Tayyar Beşik, software developer @Nokta If you want to use php for this job PHPReadability http://www.keyva… (more) Sign up for free to read the full text. Login if you already have an account.This answer […]