Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing platform for data summarization, ad-hoc query, data aggregration and analysis of datasets.The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data stored in Apache Hadoop.

Changing the default Metastore in Hive

Even though Derby database is the default metastore in Hive ,we can change it by editing hive-site.xml in the hive installation directory .

If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.

Retrieve Maximum Salary in Hive Table

Let us suppose we have a table employee consisting of name and salary

Second Highest Salary

SELECT *
FROM (SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_no FROM employee GROUP BY SALARY) res
WHERE res.row_no = 2 ;

Third Highest Salary

SELECT * FROM
(SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_num FROM
employee GROUP BY SALARY) res
WHERE res.row_num = 3;

preventing a large job from running for a long time

This can be done by setting the MapReduce jobs to execute in strict mode by using set hive.mapred.mode=strict;
Use of this strict mode ensures that the queries on partitioned table cannot execute
The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

Order by vs cluster by vs SORT in hive ?

The ORDER BY clause is familiar from other SQL dialects. It performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.

SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.

DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn’t sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.

CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Types of UDF in Hive

User Designed Function*(UDF)

User Defined Aggregrate Functions(UDAF)

User Defined Table Function(UDTF)

Developing and Deploying the UDF(User Defined Function )

Build a Jar file containing UDF and add it to Hive instance to be used later.

Recover Partition in Hive

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition for every partition

ALTERTABLE<tablename>ADDPARTITION

Or in short you can run

MSCKREPAIRTABLE<tablename>;

available since Hive 0.11
It will add any partitions that exist on HDFS but not in metastore to the metastore.

Copy the partition folders and data to a table folder.
Create a table with dynamic partitions on the table folder.
Run MCSK Repair on the new table.

Creating Indexes in Hive

An Index is a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.

var_samp(col)

stddev_pop(col)

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group

covar_pop(col1, col2)

Returns the population covariance of a pair of numeric columns in the group

covar_samp(col1, col2)

Returns the sample covariance of a pair of a numeric columns in the group

corr(col1, col2)

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group

percentile(BIGINT col, p)

Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

percentile(BIGINT col, array(p1 [, p2]…))

Returns the exact percentiles p1, p2, … of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

histogram_numeric(col, b)

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights

collect_set(col)

Returns a set of objects with duplicate elements eliminated

collect_list(col)

Returns a list of objects with duplicates (as of Hive 0.13.0)

NTILE

This function divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.).

Cube Dimensions in Hive

Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Windowing functions in Hive

Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data. Windowing in Hive is introduced from Hive 0.11.