I'm an associate professor in the InfoLab affiliated with DAWN, Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. An application of our work is to make it dramatically easier to build machine learning systems to process dark data including text, images, and video. Our latest project is Snorkel, our code is
on github,
and there are blog posts about our work. By pushing the limits of weak supervision and data augmentation, we hope to make it radically easier to build machine learning systems and deepen our understanding of machine learning's underpinnings.

The DeepDive (one pager) project was commercialized as Lattice. As of 2017, Lattice is part of Apple. A messy, incomplete log of old updates is here.

I am an assistant professor
in Computer Science at Stanford University. I'm in the InfoLab and affiliated with the PPL and SAIL labs. My
interests are theoretical and practical problems in data
management. Details of my work can be found in my papers and somewhere on github. I believe that the
future of computing is in data management. If you agree, send me a note!

I'm an associate professor in the InfoLab affiliated with DAWN, Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. An application of our work is to make it dramatically easier to build machine learning systems to process dark data including text, images, and video. Our latest project is Snorkel, our code is
on github,
and there are blog posts about our work. By pushing the limits of weak supervision and data augmentation, we hope to make it radically easier to build machine learning systems and deepen our understanding of machine learning's underpinnings.

﻿Christopher (Chris) Ré is an associate professor in the Department of Computer Science at Stanford University in the InfoLab who is affiliated with the Statistical Machine Learning Group, Pervasive Parallelism Lab, and Stanford AI Lab. His work's goal is to enable users and developers to build applications that more deeply understand and exploit data. His contributions span database theory, database systems, and machine learning, and his work has won best paper at a premier venue in each area, respectively, at PODS 2012, SIGMOD 2014, and ICML 2016. In addition, work from his group has been incorporated into major scientific and humanitarian efforts, including the IceCube neutrino detector, PaleoDeepDive and MEMEX in the fight against human trafficking, and into commercial products from major web and enterprise companies. He cofounded a company, based on his research, that was acquired by Apple in 2017. He received a SIGMOD Dissertation Award in 2010, an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data Driven Investigator Award in 2014, the VLDB early Career Award in 2015, the MacArthur Foundation Fellowship in 2015, and an Okawa Research Grant in 2016.

(1) DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQL-style databases. In turn, these databases can be used to support both SQL-style and predictive analytics. Recently, some DeepDive-based applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.

DeepDive is our attempt to understand a new type of database
system. Our new approach can be summarized as follows: the data,
the output of various tools, the input from users — including
the program the developer writes — are observations from which
the system statistically infers the answer. This view is a
radical departure from traditional data processing systems, which
assume that the data is one-hundred percent correct. A key problem in
DeepDive is that the system needs to consider many possible
interpretations for each data item. In turn, we need to explore a huge
number of combinations during probabilistic inference, which is one of
the core technical challenges.
Our goal is to acquire more sources of data for DeepDive to understand more deeply to change the way that science and industry operate.

(2) Fundamentals of Data
Processing.
Almost all data processing systems have their intellectual roots in first order logic. The most computationally expensive (and most interesting) operation in such systems is the relational join. Recently, I helped discover the first join algorithm with optimal worst-case running time. This result uses a novel connection between logic, combinatorics, and geometry. We are using this connection to develop new attacks on classical problems in listing patterns in graphs and in statistical inference. Two threads have emerged:

The first theme is that these new worst-case-optimal algorithms are fundamentally different from the algorithms used in (most of) today's data processing systems. Although our algorithm is optimal in the
worst case, commercial relational database engines have been tuned to
work well on real data sets by smart people for about four
decades. And so a difficult question is how does one translate
these insights into real data processing systems?

The second theme is that we may need new techniques to get theoretical results strong enough to guide practice. As a result, I've started thinking about "beyond worst-case analysis" and things like conditioning for combinatorial problems to hopefully build theory that can inform practice to a greater extent. The first papers have just been posted.

Demos, Examples, and Papers.

Worst-case Optimal Joins. We have posted a survey for SIGMOD record about recent advances in join algorithms. Our goal is to give a high-level view of the results for practitioners and applied researchers. We also managed to simplify the arguments. A full version of our join algorithm with worst-case optimal running time is here. The LogicBlox guys have their own commercial worst-case optimal algorithm. Our new system, EmptyHeaded is based on this theory.

Beyond Worst-case Joins. This work is our attempt to go beyond worst-case analysis for join algorithms. We (with Dung Nguyen) develop a new algorithm that we call Minesweeper based on these ideas. The main theoretical idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun.

Almost to one algorithm to rule them all? We have a much better description of beyond worst-case optimality with a resolution framework and a host of new results for different indexing strategies. This paper supercedes many of the results in Minesweeper and in a much nicer way!. We also hope to connect more of geometry and resolution... but we'll see!

A first part of our attack on conditioning for combinatorial problems is in NIPS and on Arxiv.

It is not difficult to get me interested in a theory problem. Ask around the Infolab if you don't believe me.

Our goal is to understand the fundamentals of data processing systems.

Our course material from CS145 intro databases is here, and we'll continue to update it. We're aware of a handful of courses that are using these materials, drop us a note if you do! We hope to update them throughout the year.