Hive is one of the leading SQL engine running on Hadoop. Hive has had a long relationship with Hadoop from the start to support SQL like syntax. Even though Hive supports SQL like syntax there are some differences the in the Hive data types vs. SQL data types. Let’s walk through the different data types in Hive.

Traditional SQL vs. SQL on Hadoop

Why are we switching to SQL on Hadoop vs. traditional relational database? It’s all about the data…well particularly the data growth. In the past 2 years we’ve doubled the total data in human history and we will do the same in the next 2 years. Most of all that data being generated is unstructured data. Relational database depend on the schema being in place before storing the data vs. Hadoop defining the data only before it’s read. Read more about Schema on Read vs. Schema on Write.

Traditional relational database are really fast at processing data, but start to break down as data scales. Ever tried to run a complex join on a 10 TB table? Not fun on a relational database! Hive was built to run joins on TBs and PBs of data in a table. Remember our data growth is exploding and we are going to need solutions that will scale. Hadoop was built to help scale data and Hive is the SQL engine that runs on top of Hadoop.

Why do we have data types in Hive

Traditional SQL or HiveQL requires data types be defined on each column. You have probably wondered why do we declare a data type? The two reasons for declaring the data type is storage and analysis. First the storage allows for Hive to more efficiently allocate the storage for the data. There is a huge difference between storing the number 1 vs. a precision number with a million decimal places. Developers have the option to allocate only the storage needed. The second reason is allowing for analysis of the columns. Without data types 1 + 1 wouldn’t be 2 when the data types were strings. Every developer has run across in all programming language where they have had to convert from string to integer. Hive is no different declare the data types to prevent errors when analyzing data.

Data Types in Hive

In Hive data types are categorized as either complex or primitive. The primitive data types are the largest and are typically broken into sub categories. If you familiar with SQL you will recognize the primitive data types as datetime, numeric, string, and boolean. The sub categories in Hive are officially Hive are Date Time, Numeric, String, and Misc. The complex data types aren’t as complex with the categories include Struct, Array, and Map. Let’s break the categories and sub categories down with examples.

Example Data

Before we learn the different data types in Hive let’s look at the sample data set we will use. For the example below we will refer the sample data set from NASA’s public record. The sample set is a list of all recorded Meteorite landings.

NASA Meteorite Landings Data Set

Primitive Hive Data Types

say a few sentances about primitice data types

Numeric Hive Data Types

The numeric data types is the largest sub category because well numbers are really important in calculations. Also numbers tend to have the most variable storage needs.

TinyINT – smallest numeric type only covering a hundred.

1

2

3

TINYINT

....

-128-127

SmallINT – small numeric type covering into the tens of thousands.

1

2

3

SMALLINT

....

-32,768-32,767

INT is the most popular type as 4 bytes covering into the billions.

1

2

3

INT

....

-2,147,483,684-2,147,483,647

BIGINT is a bigger version of INT (hence the name) 8 byte covering into the trillions.

1

2

3

BIGINT

....

-9,223,372,036,854,775,808-9,223,372,036,854,775,807

Float is the second most used numeric type that is a 4-byte precise number.

1

2

3

FLOAT

....

1.0

DOUBLE or DOUBLE PRECISION is an 8-byte precise numeric type.

1

2

3

DOUBLEorDOUBLEPRECISION

....

1.0

Decimal or Numeric is a new numeric data type that can have a user defined size.

1

2

3

4

5

DECIMAL(n,n)

....

1.0

Cast decimal(2,2)

Date Time Hive Data Types

TIMESTAMPS the most popular time data type in Hive. UNIX timedate supporting nanoseconds.

1

2

3

4

TIMESTAMP

....

1880-01-0100:00:00(optional9nanoseconds)

YYYY-MM-DD HH-MM-SS

DATE mostly used to cast STRINGs or DATETIME to data only.

1

2

3

4

DATE

....

1880-01-01

YYYY-MM-DD

INTERVAL used when calculating time or converting from one unit (seconds, minutes, hours, etc) to another unit.

1

2

3

4

INTERVAL

....

60SECONDS

Above depends on the unit being converted.Example above1minute converted toseconds.

String Hive Data Types

STRING literals cast with single or double quotes.

1

2

3

STRING

....

("Abee")or('Abee')

Misc Hive Data Types

BOOLEAN is a standard true or false field

1

2

3

BOOLEAN

....

TRUE-FALSE

BINARY is a MISC type used for storing binary data

1

2

3

BINARY

....

(2323)<--Binary data with variable storage

Complex Hive Data Types

Hive supports complex data which are strongly correlated to Java. Remember Hive is at the simplest level an abstraction of Java for Hadoop Analyst who didn’t have Java skills. Complex data types give developer/analyst the ability to encapsulate values.

Array Hive Data Types

Array hold a specific number of data types in storage. While a complex data type in Hive, arrays are basic building block of programming.

1

2

3

ARRAY

....

ARRAY<INT(any numeric data type)>

Map Hive Data Types

Map acts as container that hold values and key identifiers to the values. Think of map as creating a database table with numeric id for each value in that table and associating the values by id vs. value.

1

2

3

MAP

....

MAP<INT(any primitive data type),INT(any primitive data type)>

Struct Hive Data Types

Structs are similar to array and Map in allowing to encapsulate multiple values into columns. Use caution with Structs because they are loosely defined which opens them to abuse.

1

2

3

4

5

STRUCTS

....

STRUCT<id:INT,name:STRING,address:STRING>

See how quickly thiscould get messy?

Recapping

Remember Hive is one the leading SQL engines on Hadoop because it’s ease of use. Hive data types can be broken down into two categories primitive and complex. Once you’ve conquered the data types in Hive get ready to move higher in the development stack with complex queries or using Pig to move unstructured data in Hive. Make sure to sign up for my newsletter to make sure you have all the skills a Data Engineer needs to be ROCKSTAR developer.