GROUP BY 子句
描述
GROUP BY
子句用于根据一组指定的 grouping 表达式对行进行分组,并基于一个或多个指定的聚合函数对行组进行聚合计算。Spark 还支持高级聚合,通过 GROUPING SETS
、CUBE
、ROLLUP
子句对同一输入记录集进行多个聚合。grouping 表达式和高级聚合可以在 GROUP BY
子句中混合使用,并嵌套在 GROUPING SETS
子句中。有关详细信息,请参阅“混合/嵌套分组分析”部分。当 FILTER 子句附加到聚合函数时,只有匹配的行才会传递给该函数。
语法
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]
GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } (grouping_set [ , ...]) } [ , ... ]
聚合函数定义为
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
参数
-
group_expression
指定将行分组在一起的条件。行的分组基于 grouping 表达式的结果值执行。grouping 表达式可以是列名(如
GROUP BY a
)、列位置(如GROUP BY 0
)或表达式(如GROUP BY a + b
)。 -
grouping_set
grouping 集由括号中零个或多个逗号分隔的表达式指定。当 grouping 集只有一个元素时,可以省略括号。例如,
GROUPING SETS ((a), (b))
与GROUPING SETS (a, b)
相同。语法:
{ ( [ expression [ , ... ] ] ) | expression }
-
GROUPING SETS
为 GROUPING SETS 之后指定的每个 grouping 集对行进行分组。例如,
GROUP BY GROUPING SETS ((warehouse), (product))
在语义上等同于GROUP BY warehouse
和GROUP BY product
结果的并集。此子句是UNION ALL
的简写形式,其中UNION ALL
运算符的每个分支都执行GROUPING SETS
子句中指定的每个 grouping 集的聚合。类似地,GROUP BY GROUPING SETS ((warehouse, product), (product), ())
在语义上等同于GROUP BY warehouse, product
、GROUP BY product
和全局聚合的结果的并集。注意: 为了与 Hive 兼容,Spark 允许使用
GROUP BY ... GROUPING SETS (...)
。GROUP BY 表达式通常会被忽略,但如果它包含比 GROUPING SETS 表达式更多的表达式,则额外的表达式将包含在 grouping 表达式中,并且该值始终为空。例如,SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)
,列c
的输出始终为空。 -
ROLLUP
在单个语句中指定多个级别的聚合。此子句用于根据多个 grouping 集计算聚合。
ROLLUP
是GROUPING SETS
的简写形式。例如,GROUP BY warehouse, product WITH ROLLUP
或GROUP BY ROLLUP(warehouse, product)
等同于GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())
。GROUP BY ROLLUP(warehouse, product, (warehouse, location))
等同于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())
。ROLLUP
规范的 N 个元素会生成 N+1 个GROUPING SETS
。 -
CUBE
CUBE
子句用于根据GROUP BY
子句中指定的 grouping 列的组合执行聚合。CUBE
是GROUPING SETS
的简写形式。例如,GROUP BY warehouse, product WITH CUBE
或GROUP BY CUBE(warehouse, product)
等同于GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
。GROUP BY CUBE(warehouse, product, (warehouse, location))
等同于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())
。CUBE
规范的 N 个元素会生成 2^N 个GROUPING SETS
。 -
混合/嵌套分组分析
GROUP BY 子句可以包含多个
group_expression
和多个CUBE|ROLLUP|GROUPING SETS
。GROUPING SETS
也可以包含嵌套的CUBE|ROLLUP|GROUPING SETS
子句,例如GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))
、GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))
。CUBE|ROLLUP
只是GROUPING SETS
的语法糖,请参阅上面的部分了解如何将CUBE|ROLLUP
转换为GROUPING SETS
。在此上下文中,group_expression
可以视为单个组的GROUPING SETS
。对于GROUP BY
子句中的多个GROUPING SETS
,我们通过对原始GROUPING SETS
进行叉积来生成单个GROUPING SETS
。对于GROUPING SETS
子句中的嵌套GROUPING SETS
,我们只需获取其 grouping 集并将其剥离。例如,GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
和GROUP BY warehouse, ROLLUP(product), CUBE(location, size)
等同于GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))
。GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
等同于GROUP BY GROUPING SETS((warehouse), (warehouse, product))
。 -
aggregate_name
指定聚合函数名称(MIN、MAX、COUNT、SUM、AVG 等)。
-
DISTINCT
在将输入行传递给聚合函数之前删除其中的重复项。
-
FILTER
过滤输入行,其中
WHERE
子句中的boolean_expression
计算结果为 true 的行将传递给聚合函数;其他行将被丢弃。
示例
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
-- Sum of quantity per dealership. Group by `id`.
SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- Use column position in GROUP by clause.
SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id ORDER BY id;
+---+---+---+
| id|sum|max|
+---+---+---+
|100| 32| 15|
|200| 33| 20|
|300| 13| 8|
+---+---+---+
-- Count the number of distinct dealer cities per car_model.
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+------------+-----+
| car_model|count|
+------------+-----+
| Honda Civic| 3|
| Honda CRV| 2|
|Honda Accord| 3|
+------------+-----+
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
SELECT id, sum(quantity) FILTER (
WHERE car_model IN ('Honda Civic', 'Honda CRV')
) AS `sum(quantity)` FROM dealer
GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 17|
|200| 23|
|300| 5|
+---+-------------+
-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
--Prepare data for ignore nulls example
CREATE TABLE person (id INT, name STRING, age INT);
INSERT INTO person VALUES
(100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan', 50);
--Select the first row in column age
SELECT FIRST(age) FROM person;
+--------------------+
| first(age, false) |
+--------------------+
| NULL |
+--------------------+
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
+-------------------+------------------+----------+
| first(age, true) | last(id, false) | sum(id) |
+-------------------+------------------+----------+
| 30 | 400 | 1000 |
+-------------------+------------------+----------+