Month: November 2018

Consider the following situation: A bundle of .avro files is stored on HDFS. They need to be converted to Impala tables. Schemas are not provided with files, at least not externally (it’s contained as first line of any avro file). But Impala has a known-issue with avro tables, and its usage is pretty limited: we can create avro based table only if all the columns are manually declared with their types in CREATE TABLE statement, otherwise it will fail with an error.

(pic).

But what if we have hundreds of columns or just not completely sure in schema and would like to automate process of tables creation?

Disclaimer: you can’t do that directly, but there is a work around: you have to create temporary avro table in hive, then create as select temporary parquet file as select from avro table and finally run invalidate metadata in impala to catch up all the changes in tables set into impala.

Step by step algorithm:

Check if you have file.avsc along with file.avro , if yes, skip step #2

Create external schema avro file from file.avro

Download avro-tools-1.7.7.jar – official tool to work with avro files. Pay attention to a version – it should be 1.7.7.

Place it somewhere on your server’s local file system

# parameters you will need
# <tmp_local_path>=/tmp/get_avro any random path inside /tmp folder of local file system
# <file_name> name of the file.avro from where you want to extract schema
# <hdfs_file_path> absolute path on hdfs to target file.avro
# <result_schema_file_path> path on hdfs where you would like to get created schema
# create tmp folder
mkdir -p <tmp_local_path>
# here we read with cat command first 50Kb of file.avro and store it as file_sample on our local file system
# depends on amount of columns 50Kb can be not enough. You can try --lines 1, to pick only first line or increase size.
hdfs dfs -cat <hdfs_file_path> | head --bytes 50K > <tmp_local_path>/<file_name>_sample
# use avro-tool jar to retrieve schema from sample file and store it the name of origin file.avro with extension .avsc
java -jar ~/jars/avro-tools-1.7.7.jar getschema <tmp_local_path>/<file_name>_sample > <tmp_local_path>/<file_name>.avsc
# copy from local file system back to hdfs our created file.avsc (avro schema)
hdfs dfs -put <tmp_local_path>/<file_name>.avsc <result_schema_file_path>
#clean up mess
rm -rf <tmp_local_path>