Meteor, Volt, and webalchemy frameworks run on top of databases, so they’re ultimately constrained by the realtime functionality and scalability of existing database systems. We’ve been collaborating with the Meteor team to ensure our design will work well with these and other similar projects.

But this says nothing about data driven backends, which are no less “forward-thinking” platforms than Meteor.

VoltDB has push functionality as well.

Do we need new databases to support the growth of data science?

DevOps engineers and data scientists translate data from a dashboard or Tableau into value by proposing business actions inducted from that data.

The more integrity that data has, the more value the data scientists produce.

Data integrity is a function of the speed of that data pipeline and the structure of it. The presentation layer (dashboard or Tableau) can render the data with more fidelity if the data has better in-database shape.

The databases covered this week exemplified the above in several ways.

How does the movement from batch to streaming affect database architecture?

While MemSQL and Hadoop are both data stores, they fill different roles in the data processing and analytics stack. The Hadoop Distributed File System (HDFS) enables businesses to store large volumes of immutable data, but by design, it is used almost exclusively for batch processing. Moreover, newer execution frameworks, that are faster and storage agonistic, are challenging MapReduce as businesses’ batch processing interface of choice.

Lambda Architecture

A number of MemSQL customers have implemented systems using the Lambda Architecture (LA). LA is a common design pattern for stream-based workloads where the hot, recent data requires fast updates and analytics, while also maintaining long-term history on cheaper storage. Using MemSQL as the real-time path and HDFS as the historical path has been a winning combination for many companies. MemSQL serves as a real-time analytics serving layer, ingesting and processing millions of streaming data points a second. MemSQL gives analysts immediate access to operational data via SQL. Long-term analytics and longer running, batch-oriented workflows are pushed to Hadoop.

The VoltDB blog discusses the implementation of a batch + streaming data pipeline.

Are databases only ever written in highly performant languages like Rust, Go, and C++?

Garbage collection is both a blessing and a curse.

InfluxDB is written in Go, and leverages GC. RethinkDB is written in C++ partially to avoid GC. Neo4j is written in Java.

What are the biggest database pain points for high throughput, data driven application developers?

A recent talk from a Twitter engineer described a problem his team solves with Storm. They count unique users of mobile apps for given time periods. The tricky part is the volume: reportedly 800,000 messages per second, making this a poor fit for more traditional systems. The stack they use involves Kafka, Storm, Cassandra, and of course, ZooKeeper.

We can assume ZooKeeper, Kafka, Storm and Cassandra all use at least three nodes each to run with reasonable safety. Three is a bit of a magic number in distributed systems; two node clusters have a harder time agreeing on the state of the cluster in failure scenarios. So now we’re operating and monitoring four systems, at least twelve nodes, and the interops/glue between all of the above. If a network between two Cassandra nodes fails, are the symptoms the same as if the network between Storm nodes failed, or ZooKeeper nodes failed? Each of these systems has different failure semantics, and can cause different symptoms or cascading failures in other parts of the stack.

While four systems wouldn’t be odd in an HDFS-based batch system, the crucial difference here is that user data ownership is passed between systems, rather than being wholly the responsibility of HDFS. Sometimes state only has meaning if data from multiple systems are combined. Development and testing stacks with four systems isn’t 33% harder than stacks with three systems either; it can be as much as four times as hard, depending on how the systems connect with each other.