Big data analytics has become an extremely important and challenging problem in disciplines like industrial engineering, computer science, biology, and medicine. As massive amounts of data are available for analysis, scalable integration techniques and knowledge bases are becoming important. For example, the Google search engine integrates knowledge from various sources in order to provide direct answers to users. At the same time, as an adverse effect of integration, new privacy issues arise where one's sensitive information can easily be inferred from a large amount of data.

In my talk, I will first focus on the problem of entity resolution (ER), which identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. I will address the problem of keeping the ER result up-to-date when the ER logic "evolves" frequently. A naive approach that re-runs ER from scratch may not be tolerable for resolving large datasets. I will show when and how we can instead exploit previous "materialized" ER results to save redundant work with evolved logic. I will also briefly explain how I recently used crowdsourcing techniques to enhance ER.

Next, I will briefly introduce my work on managing information leakage where one must try to prevent important bits of information from being resolved by ER in order to gain data privacy. As more of our sensitive data gets exposed to a variety of merchants, health care providers, employers, social sites and so on, there is a higher chance that an adversary can "connect the dots" and piece together our information, leading to even more loss of privacy. I will explain our information leakage model and propose using "disinformation" as a tool for reducing information leakage.

Finally, I will talk about how knowledge bases are impacting search engines in understanding data and explain a new ontology (called Biperpedia), which was developed at Google Research and is specialized for search applications. For example, given the query "brazil coffee production 2016," a search engine can use Biperpedia to understand that the user is asking for some numeric attribute (called coffee production) of the country Brazil in the year 2016. I will also show how Biperpedia attributes can be used to find “latent” subsumptionrelationships among concepts on the Web. For example, although the concept “coffee shop” is not strictly subsumed by the concept “restaurant”, we can use attributes to infer that Web users “consider” coffee shops as if they were restaurants.