Parquet file source cannot read certain types of parquet files

Details

Fixed a bug that caused errors in the File source if it read parquet files that were not generated through Hadoop.

Rank:

1|i009i7:

Description

When the File source is used to read a parquet file that doesn't contain 'avro.read.schema', 'parquet.avro.schema', or 'avro.schema' in its footer, the job will fail with:

java.lang.Exception: java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V" the class loader (instance of co/cask/cdap/internal/app/runtime/plugin/PluginClassLoader) of the current class, org/apache/parquet/avro/AvroSchemaConverter, and the class loader (instance of co/cask/cdap/internal/app/runtime/ProgramClassLoader) for the method's defining class, org/apache/avro/Schema$Field, have different Class objects for the type org/codehaus/jackson/JsonNode used in the signature
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[org.apache.hadoop.hadoop-mapreduce-client-common-2.8.0.jar:na]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) ~[org.apache.hadoop.hadoop-mapreduce-client-common-2.8.0.jar:na]
java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V" the class loader (instance of co/cask/cdap/internal/app/runtime/plugin/PluginClassLoader) of the current class, org/apache/parquet/avro/AvroSchemaConverter, and the class loader (instance of co/cask/cdap/internal/app/runtime/ProgramClassLoader) for the method's defining class, org/apache/avro/Schema$Field, have different Class objects for the type org/codehaus/jackson/JsonNode used in the signature
at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:222) ~[parquet-avro-1.8.1.jar:1.8.1]
at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:209) ~[parquet-avro-1.8.1.jar:1.8.1]
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[parquet-avro-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) ~[parquet-hadoop-1.8.1.jar:1.8.1]
at co.cask.hydrator.plugin.batch.source.PathTrackingInputFormat$TrackingRecordReader.initialize(PathTrackingInputFormat.java:140) ~[1510704095912-0/:na]
at co.cask.hydrator.plugin.batch.source.PathTrackingInputFormat$TrackingParquetRecordReader.initialize(PathTrackingInputFormat.java:232) ~[1510704095912-0/:na]
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReaderWrapper.initialize(CombineFileRecordReaderWrapper.java:69) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.8.0.jar:na]
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initialize(CombineFileRecordReader.java:59) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.8.0.jar:na]
at co.cask.cdap.internal.app.runtime.batch.dataset.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:73) ~[na:na]
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.8.0.jar:na]
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.8.0.jar:na]
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.8.0.jar:na]
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) ~[org.apache.hadoop.hadoop-mapreduce-client-common-2.8.0.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]

I believe the root cause is that the app is exporting avro classes in order to implement error dataset. This can cause classloader errors like the above. We may be able to workaround this in the plugin until error datasets and removed and exposure of avro classes in the app is removed.