How-to: Convert Existing Data into Parquet

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Last year, Cloudera, in collaboration with Twitter and others, released a new Apache Hadoop-friendly, binary, columnar file format called Parquet. (Parquet was recently proposed for the ASF Incubator.) In this post, you will get an introduction to converting your existing data into Parquet format, both with and without Hadoop.

Implementation Details

The Parquet format is described here. However, it is unlikely that you’ll actually need this repository. Rather, the code you’ll need is the set of Hadoop connectors that you can find here.

The underlying implementation for writing data as Parquet requires a subclass of parquet.hadoop.api.WriteSupport that knows how to take an in-memory object and write Parquet primitives through parquet.io.api.RecordConsumer. Currently, there are several WriteSupport implementations, including ThriftWriteSupport, AvroWriteSupport, and ProtoWriteSupport, with more on the way.

These WriteSupport implementations are then wrapped as ParquetWriter objects or ParquetOutputFormat objects for writing as standalone programs or through the Hadoop MapReduce framework, respectively.

// set a large block size to ensure a single row group. see discussion

AvroParquetOutputFormat.setBlockSize(job,500*1024*1024);

job.setMapperClass(Avro2ParquetMapper.class);

job.setNumReduceTasks(0);

returnjob.waitForCompletion(true)?0:1;

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newAvro2Parquet(),args);

System.exit(exitCode);

}

}

The mapper class is extremely simple:

1

2

3

4

5

6

7

8

9

publicclassAvro2ParquetMapper extends

Mapper<Something,NullWritable,Void,GenericRecord>{

@Override

protectedvoidmap(AvroKey key,NullWritable value,

Context context)throws IOException,InterruptedException{

context.write(null,key.datum());

}

}

You can find the code for the MapReduce Avro-to-Parquet converter here.

Notes on Compression

The Parquet specification allows you to specify a separate encoding/compression scheme for each column individually. However, this feature is not yet implemented on the write path. Currently, choosing a compression scheme will apply the same compression to each column (which should still be an improvement over row-major formats, since each column is still stored and compressed separately). As a general rule, we recommend Snappy compression as a good balance between size and CPU cost.

Notes on Block/Page Size

The Parquet specification allows you to specify block (row group) and page sizes. The page size refers to the amount of uncompressed data for a single column that is read before it is compressed as a unit and buffered in memory to be written out as a “page”. In principle, the larger the page size, the better the compression should be, though the 1MB default size already starts to achieve diminishing returns. The block size refers to the amount of compressed data that should be buffered in memory (comprising multiple pages from different columns) before a row group is written out to disk. Larger block sizes require more memory to buffer the data; the 128MB default size also shows good performance in our experience.

Impala prefers that Parquet files will contain a single row group (aka a “block”) in order to maximize the amount of data that is stored contiguously on disk. Separately, given a single row group per file, Impala prefers that the entire Parquet file will fit into an HDFS block, in order to avoid network I/O. To achieve that goal with MapReduce, each map must write only a single row group. Set the HDFS block size to a number that is greater than the size of the total Parquet output from a single input split — that is, if the HDFS block size is 128MB, and assuming no compression and rewriting the data doesn’t change the total size significantly, then the Parquet block size should be set slightly smaller 128MB. The only concern here is that the output for the entire input split must be buffered in memory before writing it to disk.

You should now have a good understanding of how to convert your data to Parquet format. The performance gains are substantial.

Uri Laserson (@laserson) is a data scientist at Cloudera. Questions/comments are welcome. Thanks to the Impala team, Tom White, and Julien Le Dem (Twitter) for help getting up-and-running with Parquet.

15 responses on “How-to: Convert Existing Data into Parquet”

Impala will read fastest when it can read contiguous data off of the disks. The Parquet format stores column groups contiguously on disk; breaking the file into multiple row groups will cause a single column to store data discontiguously. Therefore, to maximize the size of the column group, you want to have only a single row group. And based on the response to Jakub Kukul, you want to make sure that single row group fits into a single HDFS block, to avoid network I/O.

Whoops, it appears we wrote it backwards here (will edit shortly). The ideal is that the HDFS block size and the size of the Parquet file/row group are exactly the same, but this is obviously impossible to achieve. If the size of the Parquet file is larger than the HDFS block size, then reading the full file will require I/O over the network instead of local disk, which is slow. Therefore, you want to make the entire Parquet file fit into the HDFS block (so Parquet block size should be smaller than HDFS block size).

Put another way, Impala wants to avoid network I/O whenever possible. But also see response to Daniel Gomez.

Hi,
Thanks for this helpful article.
I could convert my data into parquet but I see that along with the parquet file, corresponding .crc file is created too.
How can I avoid the creation of this .crc file?

I read that “Currently, there are several WriteSupport implementations, including ThriftWriteSupport, AvroWriteSupport, and ProtoWriteSupport, with more on the way.”

However I have not seen this choice available elsewhere. For instance Hive seems to pick one of these and not give us a choice when writing. I don’t know whether it will read all of them without hinting.

. I have one design question for which I am not able to get an answer.

I have to store data in columnR format and apache parquet is good choice for it and also want to use HCatalog for universal access through pig, hive. But I read somewhere that parquet is not compatible with HCatalog. So wanted to know how I can go about this. Also wanted to know does apache avro fits into this category. How avro is different from parquet. Does they work in parallel with each other or opposite of each other.

I am trying to convert avro to parquet in standalone mode(without using hadoop) in windows platform. When I am executing the code, I am getting following exception:

Exception in thread “main” java.io.IOException: Cannot run program “D:\winutil\bin\winutils.exe”: CreateProcess error=216, This version of %1 is not compatible with the version of Windows you’re running. Check your computer’s system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
at parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:176)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:211)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:175)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:146)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:113)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:87)
at parquet.hadoop.ParquetWriter.(ParquetWriter.java:62)
at parquet.avro.AvroParquetWriter.(AvroParquetWriter.java:43)
at com.cerner.odw.CreateUserDatasetGenericParquet.main(CreateUserDatasetGenericParquet.java:59)
Caused by: java.io.IOException: CreateProcess error=216, This version of %1 is not compatible with the version of Windows you’re running. Check your computer’s system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
… 21 more

Hi I used your ‘Non-Hadoop (Standalone) Writer’ to write parquet format to HDFS. But my application needs append operation to the same file because I’m writing records on arrival. Is there a way to achieve the same ?. Caching the writer object is not feasible in my case..