Interface SupportsPartitioning

  • All Known Implementing Classes:
    FileSystemTableSink

    @PublicEvolving
    public interface SupportsPartitioning
    Enables to write partitioned data in a DynamicTableSink.

    Partitions split the data stored in an external system into smaller portions that are identified by one or more string-based partition keys. A single partition is represented as a Map < String, String > which maps each partition key to a partition value. Partition keys and their order are defined by the catalog table.

    For example, data can be partitioned by region and within a region partitioned by month. The order of the partition keys (in the example: first by region then by month) is defined by the catalog table. A list of partitions could be:

       List(
         ['region'='europe', 'month'='2020-01'],
         ['region'='europe', 'month'='2020-02'],
         ['region'='asia', 'month'='2020-01'],
         ['region'='asia', 'month'='2020-02']
       )
     

    Given the following partitioned table:

    
     CREATE TABLE t (a INT, b STRING, c DOUBLE, region STRING, month STRING) PARTITION BY (region, month);
     

    We can insert data into static table partitions using the INSERT INTO ... PARTITION syntax:

    
     INSERT INTO t PARTITION (region='europe', month='2020-01') SELECT a, b, c FROM my_view;
     

    If all partition keys get a value assigned in the PARTITION clause, the operation is considered an "insertion into a static partition". In the above example, the query result should be written into the static partition region='europe', month='2020-01' which will be passed by the planner into applyStaticPartition(Map). The planner is also able to derived static partitions from literals of a query:

    
     INSERT INTO t SELECT a, b, c, 'asia' AS region, '2020-01' AS month FROM my_view;
     

    Alternatively, we can insert data into dynamic table partitions using the SQL syntax:

    
     INSERT INTO t PARTITION (region='europe') SELECT a, b, c, month FROM another_view;
     

    If only a subset of all partition keys get a static value assigned in the PARTITION clause or with a constant part in a SELECT clause, the operation is considered an "insertion into a dynamic partition". In the above example, the static partition part is region='europe' which will be passed by the planner into applyStaticPartition(Map). The remaining values for partition keys should be obtained from each individual record by the sink during runtime. In the two examples above, the month field is the dynamic partition key.

    If the PARTITION clause contains no static assignments or is omitted entirely, all values for partition keys are either derived from static parts of the query or obtained dynamically.

    A sink can implement both SupportsPartitioning and SupportsBucketing. Conceptually, a partition can be seen as kind of "directory" whereas buckets correspond to "files" per directory. Partitioning splits the data on a small, human-readable keyspace (e.g. by year or by geographical region). This enables efficient selection via equality, inequality, or ranges due to knowledge about existing partitions. Bucketing operates within partitions on a potentially large and infinite keyspace.

    See Also:
    SupportsBucketing
    • Method Detail

      • applyStaticPartition

        void applyStaticPartition​(Map<String,​String> partition)
        Provides the static part of a partition.

        A single partition maps each partition key to a partition value. Depending on the user-defined statement, the partition might not include all partition keys.

        See the documentation of SupportsPartitioning for more information.

        Parameters:
        partition - user-defined (possibly partial) static partition
      • requiresPartitionGrouping

        default boolean requiresPartitionGrouping​(boolean supportsGrouping)
        Returns whether data needs to be grouped by partition before it is consumed by the sink. By default, this is not required from the runtime and records arrive in arbitrary partition order.

        If this method returns true, the sink can expect that all records will be grouped by the partition keys before consumed by the sink. In other words: The sink will receive all elements of one partition and then all elements of another partition. Elements of different partitions will not be mixed. For some sinks, this can be used to reduce the number of partition writers and improve writing performance by writing one partition at a time.

        The given argument indicates whether the current execution mode supports grouping or not. For example, depending on the execution mode a sorting operation might not be available during runtime.

        Parameters:
        supportsGrouping - whether the current execution mode supports grouping
        Returns:
        whether data need to be grouped by partition before consumed by the sink. If supportsGrouping is false, it should never return true, otherwise the planner will fail.