Streaming Fact Extraction for Wikipedia Entities at Web-Scale

Wikipedia.org is the largest online resource for free information and is maintained by a small number of volunteer editors. The site contains 4.3 million en- glish articles; these pages can easily be neglected, be- coming out of date. Any news-worthy event may re- quire an update of several pages. To address this is- sue of stale articles we create a system that reads in a stream diverse web documents and recommends facts to be added to specified Wikipedia pages. We devel- oped a three-stage streaming system that creates mod- els of Wikipedia pages, filters out irrelevant documents and extracts facts that are relevant to Wikipedia pages. The systems is evaluated over a 500M page web corpus and 139 Wikipedia pages. Our results show a promising framework for fast fact extraction from arbitrary web pages for Wikipedia.