Wise Technology

Despite extensive use of distributed databases and filesystems in data-driven workflows, there remains a persistent need to rapidly read text files on single machines. Surprisingly, most modern text file readers fail to take advantage of multi-core architectures, leaving much of the I/O bandwidth unused on high performance storage systems. Introduced here, ParaText, reads text files in parallel on a single multi-core machine to consume more of that bandwidth. The alpha release includes a parallel Comma Separated Values (CSV) reader with Python bindings.

For almost 50 years, CSV has been the format of choice for tabular data. Given the ubiquity of CSV and the pervasive need to deal with CSV in real workflows — where speed, accuracy, and fault tolerance is a must — we decided to build a CSV reader that runs in parallel.

We conducted extensive benchmarks of ParaText against 7 CSV readers and 5 binary readers. Please refer to our benchmarking whitepaper for more details. In our tests, ParaText can load a CSV file from a cold disk at a rate of 2.5 GB/second and 4.2 GB/second out-of-core from a warm disk. ParaText can parse and perform out-of-core computations on a 5 TB CSV file in under 30 minutes.

Why CSV?

The simplicity of CSV is enticing. CSV is conceptually easy-to-parse. It is also human readable. Spreadsheet programs and COBOL-era legacy databases can at least write CSV. Indeed, it has become widely used to exchange tabular data. Unfortunately, the RFC standard is so loosely followed in practice that malformed CSV files proliferate. The format lacks a universally accepted schema so even “proper” CSV files may have ambiguous semantics that each application may interpret differently.

In spite of CSV’s issues, the community needs robust tools to process CSV data. We set out to build a fast, memory-efficient, generic multi-core text reader. Our CSV reader is the first to make use of this infrastructure.

Here, the expand keyword forces ParaText to use strings to represent categories, rather than integers. The forget causes the iterator to free each column’s memory from the parser after it has been visited. This avoids doubling the memory usage.

Data

The files used in our benchmarks ranged in size from 21 MB to 5.076 TB. The whitepaper describes the characteristics of each data set and how to download them.

File sizes for each data set and each format. Binary files are more compact than CSV files.

1. ParaText is fast!

ParaText had a higher throughput than any of the other CSV readers tested, on every dataset tried.

What makes a reader fast? A fast reader exploits the capabilities of the storage system. The throughputs of each CSV loader on four data sets: car (6.71 GB, categorical heavy), floats4 (25.5 GB, float-heavy), mnist8m (14.96 GB, small integers), and messy2 (2.1 GB, text-heavy). The I/O bandwidth is shown for comparison in black. Bars are omitted due to either a crash, an error, or an incompatibility.

2. ParaText is memory-efficient!

ParaText had the lowest overall memory footprint. Dato SFrame had very low memory usage on text data as long as the data frame stayed in Dato’s kernel. Spark reserves a large heap up-front. It is therefore difficult to make claims about its memory efficiency to better inform how to provision resources for Spark jobs.

However, the binary readers we tested perform significantly below the I/O bandwidth compared to ParaText.

Throughput matters. Throughput gives insight into how well each method uses the available bandwidth. Though ParaText has higher runtimes, it also has higher throughput over the binary methods. H5Py (HDF5) may need better defaults for parallel reads.

4. Fast conversion of DataFrames

Spark DataFrame, Dato SFrame, and ParaText can convert from their internal representations to a Python object in one line of code. This enhances the interactive experience of the data scientist.

# Sparkdf = spark_data_frame.toPandas()

# Datodf = dato_sframe.to_dataframe()

This conversion is an important part of the data scientist’s interactive experience. ParaText can convert a multi-gigabyte data set in seconds while Spark and Dato take minutes.

Interactive Experience? It took many minutes to convert Spark DataFrames and Dato SFrames to an equivalent Python representation.

5. ParaText is cheaper!

Costs matter. The pro-rated cost of each method as a multiple of ParaText’s cost.

6. ParaText approaches the limits of hardware!

We defined two baseline tasks to establish upper bounds on the throughput of CSV loading: newline counting and out-of-core CSV parsing. ParaText achieves a throughput that is very close to the estimated I/O bandwidth depending on the task.

How much overhead? We compared CSV file loading, out-of-core CSV parsing, and newline counting with the bandwidth of the storage system.

7. ParaText is Medium Data

ParaText can handle multi-terabyte data sets with ease. In our tests, ParaText and Spark were the only methods that successfully loaded and summed a 1+ TB file on a single machine.

Have TB+ data? We tried to load medium1.csv (1.015 TB) with each method.

Interested?

Dr. Damian Eads is a co-founder of Wise.io and main creator of its core machine learning technologies. He spent a decade as a machine learning researcher at Los Alamos National Laboratory. After his PhD in Computer Science at UC Santa Cruz, he was a visiting scholar at UC Berkeley and later a postdoctoral scholar at Cambridge University.

ps. We’re looking for amazing engineers to help us build out our novel infrastructure to orchestrate massive machine learning pipelines. If you’re the one, get in touch!

I think all of the above numbers are very impressive. It's a tragedy that the differences are so large, you need to use ylog-plots which make it look at first as if differences weren't actually large at all.
What I found a tad misleading is section 3. In particular, the relevance. So if I understand this correctly, imagine I had data in an HDF5 file of size x and a CSV file of size 10x. Your plot allows me to infer that reading the former file would be approximately 4 times slower than reading the latter in relation to its size. But that would still make reading it 2.5 times faster overall. The factor of 10 that I suggested here may be far from the truth for HDF5 but it's difficult for the audience to get an idea of the kind of factors one'd see in the wild. So this section gives me the impression that you're trying to say "we're slower than HDF5, naturally, but there's this way of looking at it that makes us come out on top" where this new way seems a bit contrived. You make such a convincing argument, I'm not sure it's even necessary to go down that road. But if it is, would you mind also showing how much slower it would be to read the same data in CSV rather than HDF5, in wallclock time? Because I think reading CSV cannot reasonably be expected to be as fast as reading a binary format. But CSV has many strengths, like human readability (you can put CSV files under version control and run diff on them if you feel like it. Maybe not for a 5TB file, but for smaller ones certainly). So if you manage to get within a reading spead of, say, a a factor of 3, I think that's already the kind of number that'd convince you not to go with HDF5, which is a complex, cumbersome library that'll happily corrupt your data if you let it.

Damian Eads

Thanks for your feedback. The whitepaper shows the runtimes of each binary reader in detail. This blog post is meant to summarize the whitepaper. In light of your feedback, we have updated the post to include information about the file sizes and the run times of the binary readers.

Ivan

Agreed, the HDF5 "benchmark" seems to be completely obscure in this context. What was the file size, dataset shape, type of data? Was chunking used? Was shuffling used? Was compression used (because using compression filters like blosc often times speeds things up)? In other words, give me your dataset, and believe I'll read it a few times faster from HDF5 than what's claimed here, both runtime-wise and throughput-wise.

Damian Eads

Thanks for your feedback. The paratext/bench/convert.py script converts CSV files to HDF5, Feather, and NPY. For HDF5, we did ds=f.create_dataset("mydataset", X.shape, dtype=X.dtype) followed by ds[...] = X. If you can suggest a better way to generate the HDF5 file, please let us know.

statquant

This looks very impressive, given that I understand the the implementation is a C++ implementation, do you have any plan to provide a R package?

Damian Eads

Thank you for your interest in ParaText. We welcome pull requests to add support for R bindings and other languages. A few people have already expressed interest in helping with this.

Bill Gale

Hi Damien,
Very impressive and clever. But you have to read to the fine print on the Github to see that datetimes are unsupported. That is the bulk of our processing time for loading CSV data. When you support the full set of data types that will be a truly interesting head-to-head comparison. I guess the approach now would be to load datetimes as strings and process them internally.

Thanks for your feedback. Support for DateTimes is something that others want badly, but there simply was not enough time to support all features in the first go. None of the data sets in our benchmarks had DateTime data. The type-checking step effectively runs in the cache so it is not a dominant bottleneck compared to parsing and storing the data. DateTime support will probably not affect performance that much on these data sets.

statquant

For R ISO datetime expressed in UTC handling would be easy, there is a parsing package called the fasttime that could be used.

Gus G

You lament that "the RFC standard is so loosely followed", yet the Github page says you need a special "extra overhead" option to actually parse CSV correctly, and the source examples on this page don't mention that.
Is ParaText stil fast if you actually follow the spec?

Damian Eads

Thank you for your interest. This is our first release so our main goal here is to checkpoint where each method stands today. Then, different projects can prioritize how important it is for them to optimize further. For most methods we tested, we found a case where it failed to produce correct results. A full analysis of CSV spec compliance was beyond the scope of this blog post.