Fixed an issue where the File Sink plugin was failing when writing byte array records.

Rank:

1|hzy3lb:

Description

If you try writing a byte[] to a TPFSParquet sink (at least in Spark), you get an exception when some Parquet library tries to cast the byte[] to a ByteBuffer. To reproduce, try creating a realtime pipeline that reads from Kafka and writes to TPFSParquet. The source will read some data, but the pipeline will not be able to write anything, and there will be an exception like:

java.lang.ClassCastException: [B cannot be cast to java.nio.ByteBuffer
at parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:208) ~[parquet-avro-1.6.0.jar:1.6.0]
at parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:112) ~[parquet-avro-1.6.0.jar:1.6.0]
at parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:87) ~[parquet-avro-1.6.0.jar:1.6.0]
at parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:44) ~[parquet-avro-1.6.0.jar:1.6.0]
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.scheduler.Task.run(Task.scala:89) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

Need to do more investigation to see if this is also the case for Avro, and whether it is only in Spark or if MapReduce has the same issue. In any case, the fix is probably to convert any byte[] to ByteBuffer when a StructuredRecord is converted to a GenericRecord.