Hive dialect supports a commonly-used subset of Hive’s DQL. The following lists some parts of HiveQL supported by the Hive dialect.
- Sort/Cluster/Distributed BY
- Group By
- Set Operation
- Lateral View
- Window Functions
- Table Sample
The following section describes the overall query syntax. The SELECT clause can be part of a query which also includes common table expressions (CTE), set operations, and various other clauses.
[WITH CommonTableExpression [ , ... ]] SELECT [ALL | DISTINCT] select_expr [ , ... ] FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT [offset,] rows]
SELECTstatement can be part of a set query or a sub-query of another query
CommonTableExpressionis a temporary result set derived from a query specified in a
table_referenceindicates the input to the query. It can be a regular table, a view, a join or a sub-query.
- Table names and column names are case-insensitive
WHERE Clause #
WHERE condition is a boolean expression. Hive dialect supports a number of operators and UDFs
WHERE clause. Some types of sub queries are supported in
GROUP BY Clause #
Please refer to GROUP BY for more details.
ORDER BY Clause #
ORDER BY clause is used to return the result rows in a sorted manner in the user specified order.
Different from SORT BY,
ORDER BY clause guarantees
a global order in the output.
Note: To guarantee global order, there has to be single one task to sort the final output. So if the number of rows in the output is too large, it could take a very long time to finish.
CLUSTER/DISTRIBUTE/SORT BY #
Please refer to Sort/Cluster/Distributed BY for more details.
ALL and DISTINCT Clauses #
DISTINCT options specify whether duplicate rows should be returned or not.
If none of these two options are given, the default is
ALL (all matching rows are returned).
DISTINCT specifies removal of duplicate rows from the result set.
LIMIT Clause #
LIMIT clause can be used to constrain the number of rows returned by the
LIMIT takes one or two numeric arguments, which must both be non-negative integer constants.
The first argument specifies the offset of the first row to return and the second specifies the maximum number of rows to return.
When a single argument is given, it stands for the maximum number of rows and the offset defaults to 0.
Following is an example of using hive dialect to run some queries.
Note: Hive dialect no longer supports Flink SQL queries. Please switch to default dialect if you’d like to write in Flink syntax.
Flink SQL> create catalog myhive with ('type' = 'hive', 'hive-conf-dir' = '/opt/hive-conf'); [INFO] Execute statement succeed. Flink SQL> use catalog myhive; [INFO] Execute statement succeed. Flink SQL> load module hive; [INFO] Execute statement succeed. Flink SQL> use modules hive,core; [INFO] Execute statement succeed. Flink SQL> set table.sql-dialect=hive; [INFO] Session property has been set. FLINK SQL> set sql-client.execution.result-mode=tableau; Flink SQL> select explode(array(1,2,3)); -- call hive udtf +----+-------------+ | op | col | +----+-------------+ | +I | 1 | | +I | 2 | | +I | 3 | +----+-------------+ Received a total of 3 rows Flink SQL> create table tbl (key int,value string); [INFO] Execute statement succeed. Flink SQL> insert into table tbl values (5,'e'),(1,'a'),(1,'a'),(3,'c'),(2,'b'),(3,'c'),(3,'c'),(4,'d'); [INFO] Submitting SQL update statement to the cluster... [INFO] SQL update statement has been successfully submitted to the cluster: FLINK SQL> set execution.runtime-mode=batch; -- change to batch mode Flink SQL> select * from tbl cluster by key; -- run cluster by 2021-04-22 16:13:57,005 INFO org.apache.hadoop.mapred.FileInputFormat  - Total input paths to process : 1 +-----+-------+ | key | value | +-----+-------+ | 1 | a | | 1 | a | | 5 | e | | 2 | b | | 3 | c | | 3 | c | | 3 | c | | 4 | d | +-----+-------+ Received a total of 8 rows