Hadoop’s Schema-on-Read model does not impose any requirements when loading data into Hadoop.

Data can be simply loaded into HDFS without association of a schema or preprocess the data. Although creating a carefully structured and organized repository of your data will provide many benefits. It allows for enforcing access and quota controls to prevent accidental deletion or corruption.

Data model will be highly dependent on the specific use case. For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. Unstructured and semi-structured data, on the other hand, are likely to focus more on directory placement and metadata management.

Make sure your design will work well with the tools you are planning to use. The schema design is highly dependent on the way the data will be queried.

Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Understanding the main use cases and data retrieval requirements will result in a schema that will be easier to maintain and support in the long term as well as improve data processing performance.

Optimize organisation of data with partitioning, bucketing, and denormalizing strategies. Keeping in mind, storing a large number of small files in Hadoop can lead to excessive memory use for the NameNode.

A good average bucket size is a few multiples of the HDFS block size. Having an even distribution of data when hashed on the bucketing column is important because it leads to consistent bucketing. Also, having the number of buckets as a power of two is quite common.

Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process.

Hadoop-specific file formats include columnar format such as Parquet and RCFile, serialization formats like Avro, and file-based data structures such as sequence files.

Splittability and compression are the key consideration for storing data in Hadoop. It allows large files to be split for input to MapReduce and other types of jobs. Splittability is a fundamental part of parallel processing.

SequenceFiles

SequenceFiles store data as binary key-value pairs and can be uncompressed or compressed. SequenceFiles are well supported within the Hadoop ecosystem, however their support outside of the ecosystem is limited. They are also only supported in Java.

Storing a largenumber of small files in Hadoop can cause a couple of issues. A common use case for SequenceFiles is as a container for smaller files.

Seriaization Formats

Serialization is the process of turning data structures into byte streams. Data storage and transmission are main purpose of serialization. The main serialization format utilized by Hadoop is Writables. Writables are compact and fast, but limited to Java.

However, other serialization frameworks getting more reputation within the Hadoop ecosystem, including Thrift, Protocol Buffers, and Avro. Avro is the most efficient and specifically created to address limitations of Hadoop Writables.

Thrift

Thrift was designed at Facebook as a framework for developing cross-language interfaces to services. Using Thrift allowed Facebook to implement a single interface that can be used with different languages to access different underlying systes.

Thrift does not support compression of records, it’s not splittable, and have no native MapReduce support.

Protocol Buffers

The Protocol Buffer (protobuf) was developed at Google to facilitate data exchange between services written in different languages. Protocol Buffers are not splittable, do not support internal compression of records, and have no native MapReduce support.

Avro

Avro is a language-neutral data serialization system designed to address the downside of Hadoop Writables: lack of language portability. Since Avro stores
the schema in the header of each file, it’s self-describing and Avro files can easily be read from a different language than the one used to write the file. Avro is splittable.

Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient.

Avro supports native MapReduce and schema evolution. The scehma used to read a file does not need to match the schema used to write the file which provides great flexibility with requirement change.

Avro supports a number of data types such as Boolean, int, float, and string. It also supports complex types such as array, map, and enum.

Columnar Formats

Most RDBMS stored data in a row-oriented format. This is efficient when many columns of the record need to be fetched. This
option can also be more efficient when you’re writing data, particularly if all columns of the record are available at write time because the record can be written with a single disk seek.

More recently, a number of databases have introduced columnar data storage which is well suited for data warehousing and queries that only access a small subset of columns. Columnar data sets provides more efficient compression.

The RCFile was developed to provide fast data loading, fast query processing, and efficient processing for MapReduce applications, although it’s only seen use as a Hive storage format.

The RCFile format breaks files into row splits, then within each split uses column-oriented storage. It also has some deficiencies that prevent optimal performance for query times and compression. RCFile is still a fairly common format used with Hive storage.

ORC

The ORC format has come to life to address some of the weaknesses with the RCFile format, specifically around storage and query performance efficiency. The ORC provides lightweight, always-on compression provided by type-specific readers
and writers. Supports the Hive type model, including new primitives such as decimal and complex types. Is a splittable storage format.

Parquet

Parquet documents says: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet shares many of the same design goals as ORC, but is intended to be a general-purpose storage format for Hadoop. The goal is to create a format that’s suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other processing engines

such as Impala and Spark. Parquet provides the following benefits, many of which it shares with ORC:

• Similar to ORC files, Parquet allows for returning only required data fields, thereby reducing I/O and increasing performance.
• Is designed to support complex nested data structures.
• Compression can be specified on a per-column level.
• Fully supports being able to read and write to with Avro and Thrift APIs.
• Stores full metadata at the end of files, so Parquet files are self-documenting.
• Uses efficient and extensible encoding schemas—for example, bit-packaging/run length encoding (RLE).

Conclusion

Having a single interface to all the files in your Hadoop cluster is valuable. Speaking of picking a file format, you will want to pick one with a schema because, in the end, most data in Hadoop will be structured or semistructured data.

So if you need a schema, Avro and Parquet are great options. However, we don’t want to have to worry about making an Avro version of the schema and a Parquet version.

Thankfully, this isn’t an issue because Parquet can be read and written to with Avro APIs and Avro schemas.

We can meet our goal of having one interface to interact with our Avro and Parquet files, and we can have a block and columnar options for storing our data.