In Apache Hive, like SQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behaves differently in Hive.
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
Outcome : N or more sorted files with overlapping ranges. (more…)