Behind the Scenes at News Aggregator Topix.Net

Topix.net combines an excellent news search engine with two other hot technologies: local search and personalization.

The Topix database includes full text news stories from over 4,000 sources, including a great deal of content that's difficult to quickly access elsewhere. The real power of this nifty news search engine comes from its easy-to-use pre-built pages that aggregate news and other information into more than 150,000 topic-specific pages.

These specialized pages cover local news and information for every zip code in the United States. There are also news pages dedicated to specific companies, industries, sports teams, actors, and many other subjects.

We interviewed Rich Skrenta, CEO of Topix.Net, via email.

Q. Where did the idea for Topix.Net come from? What made you decide that this service was needed in the current marketplace? What does Topix.Net offer that's not available from other companies?

In 1998 we did a project called NewHoo, which was acquired by Netscape/AOL, and is now called the Open Directory Project (ODP). It used a massive group of volunteers to build the web's largest human-edited directory. The ODP now has 60,000 volunteer editors, and the data powers Google Directory.

Our team left Netscape/AOL in 2002, and rather than using human labor again, we wanted to explore emerging AI (artificial intelligence) techniques for classifying and extracting structured data from the web. The goal for Topix.net is to make a web page about everything -- every person, place, and thing in the world -- constantly machine-summarized from the Internet. Since the web can be a messy place, surfing a well-constructed encyclopedia based on live content from the web would be a win for users.

Rather than starting with a full web crawl, which has 4 billion+ pages, we started with news, which has 4,000 sources, and is very dynamic and high quality content. We don't cover everything in the world yet, but we do have every place in the U.S., every sports team, music artist, movie personality, health condition, public company, business vertical, and many other topics.

Q. Can you share some background about how Topix.Net builds a page? Are pages built automatically or is there some human intervention? Is the technology your own? How long did it take to get it up and running?

We developed separate software modules to crawl, cluster and categorize articles. The heart of our system is a proprietary AI categorizer that uses a massive Knowledge Base (KB) to determine the geographic location and subject categorization for each story.

The final step is the Robo-Editor, which picks the best stories for display. For example, our 2004 Presidential Election page may have seen 1,000 articles for the past hour. The Robo-Editor's job is to pick the 10 best articles to show the user to give them a good overview of the news.

Our system is fully automated, there is no human involvement at any stage. We developed the technology in-house over the past two years. The AI was particularly tricky to get right, since an accuracy rate in excess of 99% was necessary to make the system useful.

Q. Do you have any plans to market your crawling and categorization technology as a source of revenue or providing your services to create Topix.Net pages for companies and other organizations?

We have a commercial feed business for companies that want to enhance their own website offerings with deeply categorized news content. Topix.net offers an extremely rich newsfeed -- in addition to the standard URL, title, and summary, we have the latitude/longitude of the news source, the latitude/longitude for the subjects of the story, the prominence of the news source, the subject categorizations, and more. We can also "geo-spin" any subject category, to produce a locally focused version. These features give us a lot of flexibility to customize feeds for clients.

We're also excited about using our categorization technology to apply to other areas beyond news, such as local web search.

Q. Are you crawling and aggregating web content other than news sources? Do you include press release material?

In addition to newspapers, Topix.net is crawling radio and TV station websites, college papers, and some high school papers and weblogs. We're also crawling government websites with "newsy" public information, such as police department crime alerts, health department reports, OSHA violation announcements, coast guard notices, and news releases from other city, county and state level government entities. We are crawling and including press releases too.

Our focus is on hyperlocal deep coverage of the U.S.. We love police blotters and little papers with extremely local coverage. If your local PTA has online meeting minutes, that's the kind of source we want to add.

Q. Does Topix.Net offer any type of RSS/syndication options?

We have an RSS feed for each of our 150,000 categories. This includes an RSS feed for every ZIP code in the U.S. Topix.net is the largest publisher of non-weblog RSS on the net.

Each of our pages also has an "Add to My Yahoo" button, which drops Topix.net headlines onto your My Yahoo desktop. We worked with the My Yahoo team to pre-load 35,000 of our newsfeeds into their new RSS reader module.

In addition to the RSS feeds, we also have free javascript headline syndication. Website owners can easily add a Topix headline box from any of our categories to their site by including a bit of HTML.

Q. What are Topix.Net's current sources of revenue?

Website advertising and commercial newsfeed sales.

Q. What do you have in the pipeline to further enhance Topix? In other words, what will Topix.Net offer in a year that's not available today? What about local pages for areas outside of the U.S.?

Expanding beyond the U.S. to full worldwide coverage is something we'd like to do. We're also looking at adding personalization features to the site, and using our categorization technology to apply to content beyond just news.

The U.K. Supreme Court has granted permission in part for Google to appeal against a ruling relating to a dispute over the user information through cookies via use of the Apple Safari browser.
0 Comments