@PublicEvolving public interface SupportsBucketing
DynamicTableSink
.
Buckets enable load balancing in an external storage system by splitting data into disjoint subsets. These subsets group rows with potentially "infinite" keyspace into smaller and more manageable chunks that allow for efficient parallel processing.
Bucketing depends heavily on the semantics of the underlying connector. However, a user can influence the bucketing behavior by specifying the number of buckets, the bucketing algorithm, and (if the algorithm allows it) the columns which are used for target bucket calculation.
All bucketing components (i.e. bucket number, distribution algorithm, bucket key columns) are
optional from a SQL syntax perspective. This ability interface defines which algorithms (listAlgorithms()
) are effectively supported and whether a bucket count is mandatory (requiresBucketCount()
). The planner will perform necessary validation checks.
Given the following SQL statements:
-- Example 1
CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY HASH(uid) INTO 4 BUCKETS;
-- Example 2
CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY (uid) INTO 4 BUCKETS;
-- Example 3
CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY (uid);
-- Example 4
CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED INTO 4 BUCKETS;
Example 1 declares a hash function on a fixed number of 4 buckets (i.e. HASH(uid) % 4 = target
bucket). Example 2 leaves the selection of an algorithm up to the connector, represented as
TableDistribution.Kind#UNKNOWN
. Additionally, Example 3 leaves the number of buckets up
to the connector. In contrast, Example 4 only defines the number of buckets.
A sink can implement both SupportsPartitioning
and SupportsBucketing
.
Conceptually, a partition can be seen as kind of "directory" whereas buckets correspond to
"files" per directory. Partitioning splits the data on a small, human-readable keyspace (e.g. by
year or by geographical region). This enables efficient selection via equality, inequality, or
ranges due to knowledge about existing partitions. Bucketing operates within partitions on a
potentially large and infinite keyspace.
SupportsPartitioning
Modifier and Type | Method and Description |
---|---|
Set<TableDistribution.Kind> |
listAlgorithms()
Returns the set of supported bucketing algorithms.
|
boolean |
requiresBucketCount()
Returns whether the
DynamicTableSink requires a bucket count. |
Set<TableDistribution.Kind> listAlgorithms()
The set must be non-empty. Otherwise, the planner will throw an error during validation.
If specifying an algorithm is optional, this set must include TableDistribution.Kind#UNKNOWN
.
boolean requiresBucketCount()
DynamicTableSink
requires a bucket count.
If this method returns true
, the DynamicTableSink
will require a bucket
count.
If this method return false
, the DynamicTableSink
may or may not consider
the provided bucket count.
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.