The last decade of database research and its blindingly bright future. or Database Research: A love song.

by Michael Cafarella and Chris Ré11 Apr 2018

To go by Twitter and many hallway conversations, the database research
community has been unsettled lately in a way that we have never seen
before. Many people are unhappy with the review process, many types of
useful work seem to be more difficult to pursue, and our relationship
with adjacent fields such as machine learning is unclear. Turing award
winner – and giant of the field – Mike Stonebraker made some (though
not all) of these points in a recent
talk
that, like everything Mike says, are worth taking seriously.

All of these points of view have merit and deserve consideration. But we
think it is worth reflecting on a different viewpoint.

Data management has had an impact that has surpassed our
wildest dreams, and it is arguably the most exciting time for data
management research. Ever.

The Hallowed Recent Past (The Enormous Burrito)

What has happened in data management in the last ten years? Try this on
for size:

Structured data in billions of pockets. The iPhone came out
in 2007. Every iPhone and Android device — billions of them
— has an SQL engine in it.

Hadoop, Spark, and other open source triumphs. The first Hadoop
Summit was in 2008. Now it powers Facebook, Twitter, NSA (we
think!), and has 3B+ in market cap from Cloudera and Hortonworks.
The terrific Spark and SparkSQL projects have had huge impact.
Hadoop and Spark aren’t the end of it. According to
http://projects.apache.org/statistics.html,
8 of the 10 busiest (by number of commits) Apache open source
projects in the past year are data-oriented: Ambari, Ignite,
Hadoop, Beam, HBase, Flink, Lucene-solr, and Spark. Spark and
Flink even have founding members from the DB research community.
Some might object that these projects are not from the SIGMOD
community per se; we would answer that they embody many (not all!)
of our community’s ideas. We should pitch a big tent around as
much of data management as possible.

Information extraction from intellectual project to mainstream.
In 2008, information extraction was a weird corner of AI and
database conferences. The database community took a huge
leadership role on this topic, with Yago, WebTables, DeepDive, and
many more systems. The technology advanced enough to allow the
authors of this post to found Lattice Data, which was purchased by
Apple last year.

Cloud and Infrastructure DB people run or are influential in
many groups at data-centric companies, including Google, Microsoft
including Office and the Cloud, Twitter, Amazon, and many more!

Analyticsgo mainstream. OLAP used to be an obscure database
research topic, and an add-on for certain Oracle products. Now
Actian Vector (VectorWise) and MonetDB are high-quality analytics systems, Tableau
is worth 6.5B, Facebook and Google are unimaginable without
analytics, and these analytics may have the power to shape
democracy1.

The Juicy Middle.

Second, the field hasn’t just sat on its laurels, exploiting past
discoveries. The last decade of database research has made progress on a
lot of hard intellectual problems that underpin our technical world

Approximate query answering

Data management for machine learning primitives

Distributed RDBMSes (with transactional guarantees) on huge
clusters

Transaction processing in peer-to-peer (Blockchain)

Improved models of data privacy

New and asymptotically improved algorithms for graph, relational
querying, and parallel query processing2

That’s not even including all the interesting data work going on in
machine learning and visualization conferences. It’s not in SIGMOD, but
it’s very relevant to us. It’s a good thing we have connections to
other relevant fields. It’s amazing that the field has obtained truly
international reach as industry and research roar together in the US, China,
Europe, the Middle East, and the world. We should be proud of our contributions and
thrilled that we are able to contribute to the most exciting problems of
the day.

The Blindingly Bright Future

That said, many of the points we hear about – bad paper reviews,
projects that are worthy but hard to pursue, too many papers – have a
grain of truth. In many cases, reviews are not what they should be. It
is true that we are no longer the only game in town for data management.
There are more conferences, more intellectual threads, and our big claim
to fame – the RDBMS – is now a much smaller fraction of the data
management systems picture. Maybe we have to pick and choose our
intellectual agenda more carefully than we used to, in order to make an
impact. And, yes, it’s harder to build (and get funding for) really big
software projects than it used to be. These are all problems, but most
are also symptoms of data management’s incredible ongoing success.

All in all, the horizons and opportunities for data management are far
broader and more exciting than they were 10 (or 20 or 30 or 40) years
ago. Our broader field is at the forefront of many problems. Here is a
woefully incomplete list of threads that we think are insanely exciting
(apologies for inevitably missing so many other threads!):

The golden age of ML data management is upon us, both
intellectually and in terms of commercial investment.

Programming is changing. The world builds ML in every data
product, but has essentially no compiler and debugging
infrastructure. Projects like
Snorkel
re-examine fundamentally how to program the ML stack.

Hardware is changing the core of data processing. Projects like
quickstep, the use of FPGA in
data processing,
rethinking query architectures like
Hyper or massively influential projects
like column-store pioneer
MonetDB.

Data cleaning has been a hugely important problem with great
progress from companies like Tamr and Trifacta. Also research
projects like BoostClean and
HoloClean–and many more! Much of this
work is built on methods for managing uncertainty from Lise
Getoor and others.

Data science is an organizing principle at many campuses with
impacts on almost every aspect of society including:

Database people have leadership roles in new data-centric
institutions. The Moore-Sloan centers are spearheaded by core DB
people in the e-Science institute at UW (like Bill
Howe) and at
NYU (like Juliana). The initiative at UChicago is run by Mike
Franklin,
who also cofounded the Berkeley initiative. Hector Garcia-Molina
and Chris were among the cofounders of the Stanford Data Science
Initiative. Internationally, QCRI is led by Ahmed
Elmagarmid.

We may not have the same level of ownership over these topics as we did
over the RDBMS3, but the chance for our ideas to have impact is many
fold greater than it ever has been. It is thrilling to participate in
such a wide range of societal-scale problems.

A different take on the challenges facing the field.

We have some problems that are “good problems to have,” but are still
problems. The field can always improve, and in particular, we agree with
many of Mike Stonebraker’s concerns–the guy knows his stuff. We think
an effective mindset when considering these problems is: how can we
continue to attract the best minds on the planet to our area, and how
can we build a community that allows those people to do their best work?
Here are some ideas:

The paper-as-a-prize model is broken. We agree with Mike: no
paper counting! But that’s not because we want people to save up
their paper writing for a few large tomes. Conference papers
should be a way to share progress on shared endeavors, not a
reward at the finish line.

a. LPUs aren’t the issue. Surajit Chaudhuri once persuasively
argued we need papers closer to LPUs to share progress more
rapidly. We agree! The field is larger, and frequent
structured communication is a good way to disseminate good
ideas efficiently and quickly. We should look for ways to
disseminate good ideas even more quickly, perhaps by
encouraging different paper lengths, or by having paper
deadlines immediately prior to conferences, or creating more
high-visibility venues (a la CIDR) outside the summertime
conference season.

b. It’s true there’s no single center, and papers are much harder
to track. That’s not because people are bad or lazy, but
because the world is bigger and better. This is one reason why
reviews have probably gone down in quality, despite many
quality controls around the review-writing process itself:
shared related work is declining.
More focused subtracks is one option. We should also consider
simply admitting a lot more papers, and recognizing that the
average paper’s fit-and-finish will go down, the career value
of a paper acceptance will go down, but the stress of a single
publication decision will also go down, and average utility to
the reader will go up. Papers will become less like a
high-stakes grant application, and more like a careful note to
peers4. This path seems better than the current cycle of
violence, and more practical than reducing the number of
papers (thereby forcing the career value of a single
acceptance even higher).

Projects should live fully, then die explicitly. If we are
reconciled to a world of lots of papers, researchers can at least do
everyone a favor and intellectually organize their efforts into
a small number of projects. A project should have an online
presence that supports the goal of effective peer communication and
shared progress. A few ideas on what that could include:

For systems, a downloadable VM that allows someone to test-drive with a minimum of fuss

A Viking funeral when there are no more updates coming

These ideas are hardly breakthroughs, but they are observed
only haphazardly today (including by the authors of this post!), and they
would help enormously.

We have done a better job than most fields at recognizing impact
via software and startups, not just papers. We have started doing
that for reproducing results. Let’s extend that same generosity to
datasets, models, and data science findings. Building the RDBMS
provided fantastic focus for the community, and room for people to
contribute via software, often commercialized via startups. It’s a
tradition we can be proud of. But many intrinsically interesting
(if currently 0-billion dollar) problems don’t have much of a TAM
and don’t lend themselves to successful startups. The community
has taken steps in the right direction when it comes to
recognizing reproducible results, with appropriate awards and an
almost-standard practice of open-sourcing experimental code. We
can follow the same playbook and recognize a new class of
interesting problems by taking data science outputs seriously. We
should consider adding SIGMOD awards to recognize the best
dataset, the best data science analysis, and so on.
This is the right thing to do intellectually, but it’s also
pragmatic. Zero-billion dollar problems don’t always stay that
way.

We need both theory and systems work. Theory publications shouldn’t come at the expense of systems-centric ones, but neither should we miss out on theoretical advances. We need them! A lot of critical data management topics – including data privacy, machine learning, and data cleaning – remain relatively poorly understood. These curiosity driven investigations have a way of attracting the best minds and opening new vistas to explore, and more pragmatically, right theory will make the systems sing. One cannot imagine building privacy or ML tools without at least basic guidance from theory.

Let’s define our field by intellectual challenges, not tools. We
should not focus on the RDBMS binary, but the ideas it contains:
the world’s first massively successful DSL, one of the maybe four
data models that ever saw wide adoption5, optimizations,
transactions, recovery, and more. We may well miss the next grand
challenge if we insist that database research involve the binary
or all of the previous ideas.

We should pitch a big tent. Data management is huge and
exciting, and we have a better shot than anyone else at making
contributions. We should avoid the tendency to argue about what
really constitutes data management research. It is worth noting
that the Machine Learning community is massive, has huge impact,
and is not homogeneous at all. OSDI and NSDI have also cast a
large tent. If we enforce a purity test, we will alienate the
brightest young minds on the planet.

This is in some ways the golden age of data management–but is also
fraught with real and public risks. Let’s take our task seriously, do
the hard work to change the world for the better with data, and have a
good time doing it.

Thanks to many other unnamed folks who helped read and contribute to this post!

Having a lot of impact doesn’t mean the impact will always be
positive. This is something for us to work on. ↩

See the pioneering work of
Ngo and others at
LogicBlox, Ullman’s work on MapReduce or Suciu, Koutris, and
Salihoglu’s new
book) ↩

Though Larry Ellison might disagree with you about how much
ownership we truly had. ↩

In economics, the bar for journal publication has become so high
– publication of an article can take years – that it has ceased
to function as an effective method for rapid peer communication.
Instead, researchers share (privately and publicly)
non-peer-reviewed paper drafts, with the expectation that the drafts
will be revised and improved. It’s great! ↩