NULL 语义

一个表由一组行组成，每行包含一组列。列与数据类型相关联，表示实体的一个特定属性（例如，age 是一个名为 person 的实体的列）。有时，特定行的列值在行创建时是未知的。在 SQL 中，这些值表示为 NULL。本节详细介绍了在各种运算符、表达式和其他 SQL 构造中处理 NULL 值的语义。

比较运算符
逻辑运算符
表达式
WHERE、HAVING 和 JOIN 子句中的条件表达式
聚合运算符 (GROUP BY, DISTINCT)
排序运算符 (ORDER BY 子句)
集合运算符 (UNION, INTERSECT, EXCEPT)
EXISTS/NOT EXISTS 子查询
IN/NOT IN 子查询

下文展示了一个名为 person 的表的模式布局和数据。数据在 age 列中包含 NULL 值，该表将在以下各节的各种示例中使用。

表格: person

Id	姓名	年龄
100	Joe	30
200	Marry	NULL
300	Mike	18
400	Fred	50
500	Albert	NULL
600	Michelle	30
700	Dan	50

比较运算符

Apache Spark 支持标准比较运算符，例如 >、>=、=、< 和 <=。当一个操作数或两个操作数未知或为 NULL 时，这些运算符的结果为未知或 NULL。为了比较 NULL 值的相等性，Spark 提供了一个 null-safe 等于运算符 (<=>)，当其中一个操作数为 NULL 时返回 False，当两个操作数都为 NULL 时返回 True。下表说明了当一个或两个操作数为 NULL 时，比较运算符的行为。

左操作数	右操作数	>	>=	=	<	<=	<=>
NULL	任意值	NULL	NULL	NULL	NULL	NULL	False
任意值	NULL	NULL	NULL	NULL	NULL	NULL	False
NULL	NULL	NULL	NULL	NULL	NULL	NULL	True

示例

-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

逻辑运算符

Spark 支持标准逻辑运算符，例如 AND、OR 和 NOT。这些运算符接受布尔表达式作为参数并返回布尔值。

下表说明了当一个或两个操作数为 NULL 时，逻辑运算符的行为。

左操作数	右操作数	OR	AND
True	NULL	True	NULL
False	NULL	NULL	False
NULL	True	True	NULL
NULL	False	NULL	False
NULL	NULL	NULL	NULL

操作数	NOT
NULL	NULL

示例

-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

表达式

比较运算符和逻辑运算符在 Spark 中被视为表达式。除了这两种表达式之外，Spark 还支持其他形式的表达式，例如函数表达式、转换表达式等。Spark 中的表达式大致可以分为：

不容忍 NULL 的表达式
可以处理 NULL 值操作数的表达式
- 这些表达式的结果取决于表达式本身。

不容忍 NULL 的表达式

当表达式的一个或多个参数为 NULL 时，不容忍 NULL 的表达式返回 NULL，大多数表达式都属于此类别。

示例

SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

可以处理 NULL 值操作数的表达式

这类表达式旨在处理 NULL 值。表达式的结果取决于表达式本身。例如，函数表达式 isnull 在 NULL 输入时返回 true，在非 NULL 输入时返回 false，而函数 coalesce 返回其操作数列表中第一个非 NULL 值。然而，当所有操作数都为 NULL 时，coalesce 返回 NULL。下面是此类表达式的不完整列表。

COALESCE
NULLIF
IFNULL
NVL
NVL2
ISNAN
NANVL
ISNULL
ISNOTNULL
ATLEASTNNONNULLS
IN

示例

SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|                3|
+-----------------+

-- Returns `NULL` as all its operands are `NULL`. 
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

内置聚合表达式

聚合函数通过处理一组输入行来计算单个结果。以下是聚合函数处理 NULL 值的规则。

所有聚合函数在处理时都会忽略 NULL 值。
- 唯一例外是 COUNT(*) 函数。
当所有输入值都为 NULL 或输入数据集为空时，某些聚合函数返回 NULL。
这些函数列表如下：
- MAX
- MIN
- SUM
- AVG
- EVERY
- ANY
- SOME

示例

-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
|       7|
+--------+

-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
|         5|
+----------+

-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
|       0|
+--------+

-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
|      50|
+--------+

-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
|    null|
+--------+

WHERE、HAVING 和 JOIN 子句中的条件表达式

WHERE 和 HAVING 运算符根据用户指定的条件过滤行。JOIN 运算符用于根据连接条件组合来自两个表的行。对于这三个运算符，条件表达式是布尔表达式，可以返回 True、False 或未知 (NULL)。如果条件结果为 True，则它们“满足”。

示例

-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
|    name|age|
+--------+---+
|Michelle| 30|
|    Fred| 50|
|    Mike| 18|
|     Dan| 50|
|     Joe| 30|
+--------+---+

-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50|       2|
| 30|       2|
+---+--------+

-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2
    WHERE p1.age = p2.age
    AND p1.name = p2.name;
+--------+---+--------+---+
|    name|age|    name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
|    Fred| 50|    Fred| 50|
|    Mike| 18|    Mike| 18|
|     Dan| 50|     Dan| 50|
|     Joe| 30|     Joe| 30|
+--------+---+--------+---+

-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2
    WHERE p1.age <=> p2.age
    AND p1.name = p2.name;
+--------+----+--------+----+
|    name| age|    name| age|
+--------+----+--------+----+
|  Albert|null|  Albert|null|
|Michelle|  30|Michelle|  30|
|    Fred|  50|    Fred|  50|
|    Mike|  18|    Mike|  18|
|     Dan|  50|     Dan|  50|
|   Marry|null|   Marry|null|
|     Joe|  30|     Joe|  30|
+--------+----+--------+----+

聚合运算符 (GROUP BY, DISTINCT)

如前一节比较运算符中所述，两个 NULL 值不相等。然而，为了分组和去重处理的目的，两个或多个包含 NULL 数据的值被分组到同一个桶中。此行为符合 SQL 标准和企业级数据库管理系统。

示例

-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null|       2|
|  50|       2|
|  30|       2|
|  18|       1|
+----+--------+

-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
|  50|
|  30|
|  18|
+----+

排序运算符 (ORDER BY 子句)

Spark SQL 支持在 ORDER BY 子句中指定 NULL 排序。Spark 根据 NULL 排序规范将所有 NULL 值放在开头或末尾来处理 ORDER BY 子句。默认情况下，所有 NULL 值都放在开头。

示例

-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age|    name|
+----+--------+
|null|   Marry|
|null|  Albert|
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|    Fred|
|  50|     Dan|
+----+--------+

-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|     Dan|
|  50|    Fred|
|null|   Marry|
|null|  Albert|
+----+--------+

-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  50|    Fred|
|  50|     Dan|
|  30|Michelle|
|  30|     Joe|
|  18|    Mike|
|null|   Marry|
|null|  Albert|
+----+--------+

集合运算符 (UNION, INTERSECT, EXCEPT)

在集合操作的上下文中，NULL 值以 null-safe 的方式进行相等性比较。这意味着在比较行时，两个 NULL 值被认为是相等的，这与常规的 EqualTo(=) 运算符不同。

示例

CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;

-- Only common rows between two legs of `INTERSECT` are in the 
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM person
    INTERSECT
    SELECT name, age from unknown_age;
+------+----+
|  name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+

-- `NULL` values from two legs of the `EXCEPT` are not in output. 
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM person
    EXCEPT
    SELECT age FROM unknown_age;
+---+--------+
|age|    name|
+---+--------+
| 30|     Joe|
| 50|    Fred|
| 30|Michelle|
| 18|    Mike|
| 50|     Dan|
+---+--------+

-- Performs `UNION` operation between two sets of data. 
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM person
    UNION 
    SELECT name, age FROM unknown_age;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|     Joe|  30|
|Michelle|  30|
|   Marry|null|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
+--------+----+

EXISTS/NOT EXISTS 子查询

在 Spark 中，EXISTS 和 NOT EXISTS 表达式允许出现在 WHERE 子句中。这些是布尔表达式，返回 TRUE 或 FALSE。换句话说，EXISTS 是一个成员条件，当其引用的子查询返回一行或多行时返回 TRUE。类似地，NOT EXISTS 是一个非成员条件，当子查询不返回任何行或返回零行时返回 TRUE。

这两个表达式不受子查询结果中 NULL 存在的影响。它们通常更快，因为可以将它们转换为半连接/反半连接，而无需特殊的 NULL 感知规定。

示例

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

IN/NOT IN 子查询

在 Spark 中，查询的 WHERE 子句中允许使用 IN 和 NOT IN 表达式。与 EXISTS 表达式不同，IN 表达式可以返回 TRUE、FALSE 或 UNKNOWN (NULL) 值。从概念上讲，IN 表达式在语义上等同于一组由析取运算符 (OR) 分隔的相等条件。例如，c1 IN (1, 2, 3) 在语义上等同于 (C1 = 1 OR c1 = 2 OR c1 = 3)。

至于 NULL 值的处理，其语义可以从比较运算符 (=) 和逻辑运算符 (OR) 中 NULL 值的处理方式推导出来。总而言之，以下是计算 IN 表达式结果的规则。

当列表中找到非 NULL 值时，返回 TRUE。
当列表中未找到非 NULL 值且列表中不包含 NULL 值时，返回 FALSE。
当值为 NULL，或者列表中未找到非 NULL 值且列表包含至少一个 NULL 值时，返回 UNKNOWN。

当列表中包含 NULL 时，无论输入值是什么，NOT IN 始终返回 UNKNOWN。这是因为如果值不在包含 NULL 的列表中，IN 返回 UNKNOWN，而 NOT UNKNOWN 再次是 UNKNOWN。

示例

-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- The subquery has `NULL` value in the result set as well as a valid 
-- value `50`. Rows with age = 50 are returned. 
SELECT * FROM person
    WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+

-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM person
    WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+

Spark SQL 指南