Saturday, 13 May 2017

Smart Memory: How tight will the timing loop need to be?

Smart memory systems that integrate scalable data capture, computation and storage with machine-learning technologies represent an exciting opportunity area for research and product development. But where should a smart memory live in today's data center computing "stack"?

Cornell's work on Derecho has been focused on two design points. One centers on Derecho itself, which is offered as a C++ software library. Derecho is blazingly fast both in terms of data capture and replication/storage speeds and latency, but can only be used by C++ programmers: unlike other C++ libraries that can easily be loaded by languages like Java or Python, modern C++ has many features that don't currently have direct mappings into what those languages can support (variadic template expansion and constant expression evaluation and conditional compile-time logic), hence it will be a while before this kind of very modern C++ library is accessible from other languages.

The other main design point has simply been a file system API: POSIX plus snapshots. Derecho incorporates an idea from our Freeze Frame File System in this respect: the file system snapshots are generated lazily (so there is no "overhead" for having a snapshot and no need to preplan them or anything of that kind), and our snapshots are temporally precise and causally consistent.

Today, Derecho v1 (the version on GitHub) lacks this integration, but fairly soon we will upload a Derecho v2 that saves data into a novel file system that applications can access via these normal file system APIs. One can think of them as read-only snapshots frozen in time, with very high accuracy. Of course the output of these programs would be files too, but they wouldn't be written to the original snapshot: they get written to some other read/write file system.

This leads to a model in which one builds a specialized data capture service to receive incoming sensor data (which could include video streams or other large objects), compute upon the data (compress, deduplicate, segment/tag, discard uninteresting data, etc). The high-value content needs to be replicated and indexed. Concurrently, machine learning algorithms would run on snapshots of the file representation, enabling the user to leverage today's powerful "off the shelf" solutions, running them on a series of temporal snapshots at fine-grained resolution: our snapshots can be as precise as you like, so there is no problem accessing data even at sub-millisecond resolution (of course in the limit, the quality of your GPS time sources does impose limitations).

The puzzle is whether this will work well enough to cover the large class of real-time applications that will need to respond continuously as events occur. In the anticipated Derecho smart-warehouse the performance-critical step is roughly this: data is received, processed, stored and then the machine learning applications can process it, at which point they can initiate action. The key lag is the delay between receipt of data and stability of the file system snapshot: at least a few milliseconds, although this will depend on the size of the data and the computed artifacts. One can imagine that this path could be quite fast, but even so, the round-trip cycle time before an action can occur in the real-world can obviously easily grow to 10's of ms or more. We plan to do experiments to see how good a round-trip responsiveness we can guarantee.

The alternative would move the machine learning into the C++ service, but this requires coding new ML applications directly in C++. I'm assuming that doing so would be a large undertaking, just considering how much existing technology is out there, and might have to be duplicated. The benefit of doing so, though, would be that we could then imagine response times measured in milliseconds, not tens or hundreds of milliseconds.

I would be curious to hear about known real-time data warehousing use cases, and the round-trip response times they require.