Introducing Iris: A Big Leap Forward in Drawing Meaning from the Web

Today we’re incredibly proud to be launching the next generation of our content normalization engine, codenamed Iris. It’s live right now and you’ll start to see improved results as you use Readability over the next few weeks.

For the past year, while Readability’s adoption and growth accelerated, we began thinking about how we wanted the core engine to evolve. Iris is the culmination of months of learning, re-architecting, design and development.

The foundational principle behind Iris is this: information on the web today exists in myriad contexts—and the sphere of differing contexts is growing. Iris is designed to untangle the web content you want and bring it forward.

The Context of Content

Much of the Readability engine’s success is attributed to the fact that a large portion of the reading web happens to share a common context: the article. It’s what we’ve focused on to date and we’re very proud of the work we’ve done there.

Still, today’s web includes far more than the generic article. Wikipedia is not just an article. A forum thread is not an article. A YouTube video, of course, is not an article. Not even articles themselves are all created equal.

With Iris, we’ve built an engine that you might call abstract—inspired by IBM’s Watson, the machine that beat contestants on Jeopardy!, Iris’ first order of business is to figure what type of content source is at hand. It analyzes a page, determines the likely context based on a number of factors and extracts what a human would expect as meaningful information from that source. Each context is fully malleable, and can be modified and improved upon individually.

This results in a dramatically improved user experience across the myriad apps, services and tools built on Readability. Expect better results on all sites, even non-article pages on sites like Wikipedia, YouTube, Vimeo and many more.

A Richer Readability Experience

Once the content type is determined, there’s still the complex task of knowing precisely what to tease out of a web resource. Even web articles—Readability’s wheelhouse—are comprised of much more than just a headline and body text. With Iris, Readability gains the ability to glean a whole new level of insight into what facets of a web resource matters to readers and developers: titles and headlines. Subheadlines. Lead images. Videos. Excerpts. Authors. Languages. Captions. Beyond just a great end-user experience, Iris represents a powerful bridge to the new ways content is being consumed beyond the browser.

Faster Engine, Faster Improvement

Finally, we wanted an engine that evolves and improves as quickly as the web. Rather than tethering improvements to software updates, Iris enjoys an extensible framework that allows for improvement from week to week and day to day. Users and developers will immediately benefit from updates—whether introducing support for new content types or better targeting.

As an added benefit, Iris is also significantly faster than the previous iteration, as determining a context before parsing allows us to skip a lot of unnecessary work. We’ll be able to scale the platform more effectively now, and that means a better experience for our users.

Live Now

Readers will start to see the benefits of Iris starting today. Developers will start to see the new fields (currently “dek” (subhead), leading image URL, text direction, and better content distillation including image captions) immediately and those fields will be populated over the next couple of weeks as we bring new contexts online.

We’re incredibly excited about this step in our growth as a platform. We couldn’t have made such dramatic strides without the fantastic developers and partners we’ve had long conversations with about the value and diversity of the web. We’re happy to provide all we can for the network we’ve become a part of.

With Readability’s newfound flexibility, we’re now in an even better position to work with publishers and developers to make sure their content looks absolutely impeccable within Readability. If that sounds like you, please get in touch.

This is both fascinating and extremely vague. I’d love to hear more specifics.

It’d be particularly interesting to hear about how the new system treats images and videos. I suppose there’s two choices:

1. It takes them from the source and re-hosts them. This seems less likely, as it is probably something the legal team said “Uh, don’t do that!” about.

2. It embeds such files, hosted in their original location, and re-serves them to users, thereby utilizing them without the offset of the advertising that is intended to pay for their hosting. This is essentially what the world used to call “hotlinking” in simpler times. (But, to be sure, providing a “dramatically improved user experience” along the way!)

David, what happens when a large chunk of your audience is using ad-blocking/cookie blocking extensions, or apps like ad-muncher? You could foam about them being a thief or maybe those people could be monetized with a model like this.