A Fast Method to Stream Data from Big Data Sources

Dan Kuster -
March 14, 2016

Maybe you’re training a machine learning model on a really big dataset. Perhaps you’ve got a big database dump and you want to extract some information. Or maybe you’re crawling web scrapes or mining text files. Modern computers are really quite powerful for processing streams of data. You shouldn’t have to resort to a Hadoop cluster just to process data you want to use locally. There has to be a better way, right?

Why yes, there is! You can:

Increase sequential processing speed enough to make it feasible on a single machine (i.e., speed hacks).

Introduce an abstraction to decouple processing speed from the size of your data.

Let’s do both. In this post, we’ll show you how to sample lines from big data sources, out-of-core, as efficiently as possible on a laptop or workstation. No MapReduce required!

Why should you care?

If you process datasets in sequential batches (e.g., using spreadsheet programs like Excel), you can do much better. For example, the maximum batch size for Excel is roughly 1 million records/rows. We routinely process datasets that are more than 5 orders of magnitude larger, at throughputs exceeding 1M records per second.

You’ve already spent a lot of effort optimizing your machine learning models to train as fast as possible, and now you want to scale up to more data? Let’s rid your pipeline of sequence bias and make sure your models aren’t waiting on disk I/O.

Using UNIX command-line tools like cat, head, tail, sed, awk, grep as filters in a data processing pipeline? Those tools traverse a file sequentially—you can do better.
Sorting and hashing things? Computing hashes seems fast, until you do it billions of times…

If you are me, in the future, and you’ve searched for this post online because, hey, it beats remembering everything…scroll down for the code!

This post has two parts.

If you’re here for “the answer” or the first two bullets resonate, try Part #1 first, and come back for Part #2 when you need more scale or speed! If you have already optimized your data pipeline and are looking for new tricks — and the latter bullets sound familiar — I recommend scanning Part #1, then digging into Part #2.

And if you’re me…

Image credit: Imgur/anonymous

Part #1

Don’t have an efficient data processing pipeline? This is the place to start. There is no downside to improving performance…but incremental improvements probably aren’t enough to help you scale several orders of magnitude.

Goal:

Write a simple reusable module that streams records efficiently from an arbitrarily large data source. Something like Python’s for line in file: idiom.

Requirement: Works on any delimited, serializable data type

So that streaming data is not the slow/limiting step on any downstream processes.

Requirement: Modular

So it is easy to put in a Python module or class, and reuse everywhere with predictable results.

Solution #1

Let’s start with the straightforward pythonic way to read a sequence of records from a file:

def get_data(input_filename, delimiter = ','):
with open(input_filename, 'r+b') as f:
for record in f: # traverse sequentially through the file
x = record.split(delimiter) # parsing logic goes here (binary, text, JSON, markup, etc)
yield x # emit a stream of things
# (e.g., words in the line of a text file,
# or fields in the row of a CSV file)

Here we exploit Python’s lazy evaluation and iterable comprehension, slurping a sequence of records sequentially (i.e., line after line) from the file on disk. By reading binary data, we can handle any arbitrary data type. However, you’ll need some knowledge about how to split the stream into records; since we assume text data above, the easy thing is to split on whitespace (i.e., a record is a word) or commas (i.e., a record is a field in a CSV file). You might also want to parse the records further, into fields, words, etc. This solution avoids reading the whole input into memory, is readable, and simple enough there’s really no point to wrapping it for reuse. Depending on your system/context, you can probably emit a stream of data at 100+ MB/s or better throughput. This is the straightforward pythonic way to do it. Yay, Python!

The solution from Part #1 traverses the data source sequentially, yielding a predictable/ordered/biased stream of records. When you are feeding a machine learning model, this is not good. If the data source is small, you can use UNIX tools to pre-sort the data…but what if you can’t afford to pre-sort everything?

Part #2: Decouple processing speed from data traversal

By definition, the state of a sequential process depends on the previous state. For a random process, the state is independent of any previous state. Thus, by sampling randomly from the data source, we decouple the process of reading one record from the process of reading any other record. This is powerful because it enables two things:

When order doesn’t matter, scalability becomes a function of how many (concurrent) streams you instantiate; a stateless/randomized reader is a textbook case for concurrency.

How to do it?

Quickly scan the file to identify the location of each record. Don’t actually load any data (for max speed).

Store the locations (i.e., offsets) in an array. This is the sequential traversal path.

Randomize the traversal path, by shuffling the array of offsets.

(Optional) Divide the randomized path into N chunks to be sampled by N concurrent workers.

Walk the path, reading and yielding the data at each location.

First we need to separate the function of constructing a traversal path through the file, from the function of emitting a stream of data. Constructing a traversal path can be done many ways. We’ll implement a straightforward one that is fast enough for anything hosted on a single machine. It runs in linear time, but does need to scan the whole file once up front. There may be faster ways to guess the right locations to seek in a file, but those are beyond scope here.

Requirement: As fast as possible

So you can handle data sources that can fit on a single machine (i.e., up to trillion sof unique records, < 10 TB). Unless you are a tech giant with your own cloud/distributed hardware infrastructure (looking at you, Google!), this should cover the vast majority of cases where you are feeding machine learning models. It’ll be much faster than most models can compute.

Requirement: Works on input larger than available RAM

Because that’s when scalability really starts to get painful. We’ll solve this by memory-mapping the data source into a 64-bit address space.

Requirement: Avoid sampling bias

If you are training a machine learning model, then you should be aware of the distributions of data you emit as input to your models. Biased distributions of training data can give undesirable results, even when the only difference is the point in training when the model gets to evaluate a given type of record. For example, tweets and Wikipedia articles may both be text data, but the distribution of words is very different between Twitter and Wikipedia. Training a model first on tweets, then on articles could have a different outcome than training a model first on articles, then on tweets. If your data sources are so big you cannot afford to hash, sort, or randomize your data on disk…how do you control the curriculum of data you are using to train your model?

By yielding a random sample from the data source, the problem of sampling bias is solved—in the limit of big data, random samples become unbiased. For reproducibility, simply pass a random seed to the randomization method.

Requirement: Works in any data ingestion pipeline

The UNIX philosophy is a gift that keeps giving:
> Write programs that do one thing, and do it well.
> Write programs to work together.
> Write programs to handle text streams, because that is a universal interface.

We embrace the fundamental API of UNIX shell using operations that read from a data source and emit delimited text records. Practically speaking, the solution described here can be a drop-in replacement for cat in a UNIX processing pipeline. But where cat yields a sequential stream of text from a file, here we yield a random stream of samples from a file.

Requirement: Linear scaling

Overall, the solution here is scales linearly, as O(n). This is no worse than an efficient sequential traversal, and we have earned important benefits: samples are unbiased and the get_data method is essentially stateless. Here’s the breakdown of runtime complexity:

Scanning a file (sequentially) to find locations is O(n), with a relatively small constant. My workstation can scan 100M records in less than a minute. Do this once up front.

Storing locations in an array. Negligible impact on runtime.

Shuffle an array of locations (using Fisher-Yates method): O(n), with a very small constant

Yield data for each record: O(n), with a relatively large constant. Probably dominated by parsing/processing logic. Do this once for each time you traverse your dataset.

def get_offsets(input_filename):
offsets = []
with open(input_filename, 'r+b') as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) # lazy eval-on-demand exploits POSIX OS filesystem to map file from 64-bit address space
for record in iter(mm.readline, ''): # sentinel value comparison
loc = mm.tell() # get the current location, at the start of the record
offsets.append(loc) # store this position as another point on the traversal path
i += 1
return offsets # alternatively, convert to a numpy `uint64` array for compactness and return the numpy array

First (sequential) pass through the data source, to get positions for the beginning of each line. Using memory-mapped virtual addressing allows us to read files much larger than available RAM.

Cons:

Works great on solid-state drives, where random access is fast. Spinning disk drives (i.e., HDD’s) will incur a slowdown due to increased seek time on spinning disk drives.

Randomly seeking to binary locations in a large file may prevent your disk cache (OS/firmware) from exploiting low-level optimizations.

It’s in Python. That’s a pro or con depending on your perspective. For pure pipeline processing, a multithreaded C implementation would be nice.

Obviously this is bad idea if order between records must be preserved, for example, time-series data. However, even in these cases, granularity of data is important…you may have many independent time-series records. For example, a time series of stock prices over time for AAPL, versus a time series of stock prices over time for MSFT.

Notes & Caveats:

Context matters. Of course, “more data than you can process locally” has different meanings in different contexts. Using a desktop workstation with Samsung 850 Evo solid state drive, my real-world I/O limit is 500-560 MB/s. On a MacBook Pro with an older SSD, it’s somewhat slower. With the new PCIe SSD devices, you might be able to hit 2000 MB/s or more. Since it is very difficult to do complex processing (e.g., train a machine learning model) of any sort at 500 MB/s, practically speaking, most readers will find that sampling data at several hundred MB/s is “fast enough”. Anything faster than that, and computation, memory, and bandwidth are probably the limiting factors.

How do I iterate more than once? Multiple possible solutions: (1) instantiate a new generator, with a different random seed, and call each one an epoch; (2) copy the list of offsets some number of times, concatenate, and shuffle. If you want an infinite stream of randomized samples from a fixed data source, the offset array could be a generator function 😉

Operating systems and tools. If you are using a POSIX-compliant operating system like Linux, Mac/OSX, BSD, etc…then everything should work as written above. If you are using a Windows operating system, we recommend installing Cygwin to access a shell environment.

Efficient arrays. If you don’t mind adding an extra dependency to the code, consider using Numpy data structures to store record locations, and the Numpy random method to shuffle the array. The runtime speedup will probably be negligible, but you’ll save memory, which could be significant on very large files.

Concurrency. The solution here is well suited to a divide-and-conquer concurrent processing solution. Divide the array of offsets into N chunks and distribute amongst N workers.

MapReduce. If you were expecting to see a “big data” essay about using MapReduce to process silly amounts of data in parallel, well, this isn’t that kind of thing, because there are already many of those lurking about online. This post was written for folks who are working on a local data source, and need to do more complex operations than map and reduce. Unless you are operating in the cloud for everything, sending data over the wire to a remote cluster may be prohibitive compared to a local read. Plus, unless you are deploying at scale, the mental overhead of configuring a MapReduce cluster just to process a bunch of data is probably unnecessary.

What about _______? (big binary data, protobuf, Avro, etc.) A number of purpose-built tools exist for handling big streams of data. An old-school/hard-core programmer (you know, the folks who name their programs in ALLCAPS and write ANSI C) might put everything into big binaries, and slice directly into the data. But oh man, bugs can be painful, especially in a collaborative environment…to say nothing of maintainability or portability. Google’s protobufs and Apache’s Avro aim to solve the data serialization problem, but at the expense of forcing you into another abstraction or schema. Also, its probably somewhat foreign to your data science/machine learning workflow. The solutions above are not a full-on engineering solution for data serialization. We present a nice little hack—a fast, lightweight, readable, reusable way to pull unbiased streams of data from your local data sources.

Does this work for my ___ data? If you can store it on your local filesystem, chances are you can modify a class to implement whatever custom parsing logic you might need.

Wait, there’s no magic here! Aren’t all these tricks just sensible engineering?