Research Directions in the UF DSR Lab

I recently wrote an article for the CISE department newsletter describing various research work we do in the DSR Lab. In particular, I highlighted the knowledge base expansion work we are doing in collaboration with Google Research, and the TREC knowledge base acceleration competition we participated in the summer. Use this I kick off the UF CISE Data Science Research blog! Enjoy!

———————–

Turning Big Data to Big Knowledge

Daisy Zhe Wang is an Assistant Professor in the CISE department at the University of Florida. She is also the director of the UF Data Science Research Lab. She obtained her Ph.D. degree from the EECS Department at the University of California, Berkeley in 2011 and her Bachelor’s degree from the ECE Department at the University of Toronto in 2005. At Berkeley, she was a member of the Database Group and the AMP/RAD Lab. She is particularly interested in bridging scalable data management and processing systems with probabilistic models and statistical methods. She currently pursues research topics such as probabilistic databases, probabilistic knowledge bases, large-scale inference engines, query-driven interactive machine learning, and crowd assisted machine learning. Her research is currently funded by DARPA, Google, Greenplum/EMC, Survey Monkey and Law School at UF.

One of the main research projects Dr. Wang is currently working on aims at constructing and maintaining large-scale knowledge bases extracted from a plethora of text data, from the Web or other domain-specific sources. This research is motivated by the Google Knowledge Graph project. Knowledge Graph (KG) is Google’s attempt to improve search engines by understanding the concepts (e.g., entity and relations) in documents and in queries to provide answers beyond keywords and strings. As far as we know, the current Google KG contains 580 million objects and 18 billion facts about relations between them. While this is the largest knowledge graph constructed, it is also a very sparse graph: on average, only ~30 relations for one entity.

Large amount of relations are missing because (1) only high-confidence data sources are used to constructed Google KG; (2) some of the relations are never recorded explicitly in any of the data sources. Dr. Wang and her students are working on new algorithms and systems to expand the knowledge graph by interpolating missing links using two methods: First, design a probabilistic knowledge base that can incorporate uncertainty data sources in addition to the high-confidence data sources. The key techniques used here is Markov Logic Networks (MLN) and Markov-chain Monte-Carlo (MCMC) inference algorithms. Second, develop a scalable statistical inference engine that can probabilistically deduce missing links based on an existing KB and a set of first-order rules. The key techniques used here is parallel inference algorithms over graphs on multi-core and/or distributed framework and parallel query processing.

In addition to developing technologies important to Google Research, the DSR Lab also participate in NIST’s TREC competition this year on Knowledge Base Acceleration. The problem is that: it takes 1 year on average for a fact (e.g., the spouse or occupation of a person, the location of a company) to be updated in Wikipedia after it has been changed. The challenge is to update the Wikipedia as new information streams in the form of text in News, Tweets and Blogs. The team from the UF DSR Lab processed 5 Terabytes of compressed text data from the web and social media in order provide updates to 13 attributes of 170 entities specified by the competition. The techniques used including large-scale parallel stream processing, keyword search engines, named entity extraction, relation extraction, and cross-document co-reference. The models used include Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs).

Dr. Wang’s group also looks at (1) using crowd-sourcing as a way to combine the power of human intelligence and machine intelligence to improve the quality of a probabilistic knowledge base, (2) generating visualizations for search over knowledge bases, and (3) translating text analysis and knowledge base construction techniques to image analysis and retrieval.

Finally, DSR Lab is working on apply the system and algorithm built from this core probabilistic knowledge base research to solve Big Data challenges in other domains, including health informatics, education, law enforcement and ecology research. Early research results obtained by DSR Lab in collaboration with UF Health show that better outcome prediction model can be built using the knowledge extracted from text data in Electronic Health Records (EHR).