Category: Hive

What is Apache HCatalog ?


What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed to external systems to access the metadata. (more…)


Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive, like SQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY,  ORDER BY, DISTRIBUTE BY and CLUSTER BY behaves differently in Hive.

Sort By vs Order By vs Group By vs Cluster By in Hive
Sort By vs Order By vs Group By vs Cluster By in Hive

SORT BY

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.

Outcome : N or more sorted files with overlapping ranges. (more…)

Hive Strict Mode

Sort By vs Order By vs Group By vs Cluster By in Hive

What is Hive Strict Mode ?

Hive Strict Mode ( hive.mapred.mode=strict) enables hive to restrict certain performance intensive operations. Such as –

  • It restricts queries of partitioned tables without a WHERE clause.

  • It restricts ORDER BY operation without a LIMIT clause ( since it uses a single reducer which can choke your processing if not handled properly

Also for dynamic partitons –

This is a default setting and prevents all partitions to be dynamic and requires at least one static partition.


You may also like –

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clientsto execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. (more…)

How-To : Configure MySQL Metastore for Hive ?

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

Assumptions : Basic knowledge of Unix is assumed and also It’s assumed that Hadoop and Hive configurations are in place.Hive with default metastore Derby is properly configured and tested out.

  1. Install  MySQL –

Note:  You will be prompted to set a password for root.

(more…)

%d bloggers like this: