A TableSource
provides access to data which is stored in external systems (database, key-value store, message queue) or files. After a TableSource is registered in a TableEnvironment it can be accessed by Table API or SQL queries.
A TableSink
emits a Table to an external storage system, such as a database, key-value store, message queue, or file system (in different encodings, e.g., CSV, Parquet, or ORC).
A TableFactory
allows for separating the declaration of a connection to an external system from the actual implementation. A table factory creates configured instances of table sources and sinks from normalized, string-based properties. The properties can be generated programmatically using a Descriptor
or via YAML configuration files for the SQL Client.
Have a look at the common concepts and API page for details how to register a TableSource and how to emit a Table through a TableSink. See the built-in sources, sinks, and formats page for examples how to use factories.
A TableSource
is a generic interface that gives Table API and SQL queries access to data stored in an external system. It provides the schema of the table and the records that are mapped to rows with the table’s schema. Depending on whether the TableSource
is used in a streaming or batch query, the records are produced as a DataSet
or DataStream
.
If a TableSource
is used in a streaming query it must implement the StreamTableSource
interface, if it is used in a batch query it must implement the BatchTableSource
interface. A TableSource
can also implement both interfaces and be used in streaming and batch queries.
StreamTableSource
and BatchTableSource
extend the base interface TableSource
that defines the following methods:
getTableSchema()
: Returns the schema of the table, i.e., the names and types of the fields of the table. The field types are defined using Flink’s TypeInformation
(see Table API types and SQL types).
getReturnType()
: Returns the physical type of the DataStream
(StreamTableSource
) or DataSet
(BatchTableSource
) and the records that are produced by the TableSource
.
explainSource()
: Returns a String that describes the TableSource
. This method is optional and used for display purposes only.
The TableSource
interface separates the logical table schema from the physical type of the returned DataStream
or DataSet
. As a consequence, all fields of the table schema (getTableSchema()
) must be mapped to a field with corresponding type of the physical return type (getReturnType()
). By default, this mapping is done based on field names. For example, a TableSource
that defines a table schema with two fields [name: String, size: Integer]
requires a TypeInformation
with at least two fields called name
and size
of type String
and Integer
, respectively. This could be a PojoTypeInfo
or a RowTypeInfo
that have two fields named name
and size
with matching types.
However, some types, such as Tuple or CaseClass types, do support custom field names. If a TableSource
returns a DataStream
or DataSet
of a type with fixed field names, it can implement the DefinedFieldMapping
interface to map field names from the table schema to field names of the physical return type.
The BatchTableSource
interface extends the TableSource
interface and defines one additional method:
getDataSet(execEnv)
: Returns a DataSet
with the data of the table. The type of the DataSet
must be identical to the return type defined by the TableSource.getReturnType()
method. The DataSet
can by created using a regular data source of the DataSet API. Commonly, a BatchTableSource
is implemented by wrapping a InputFormat
or batch connector.The StreamTableSource
interface extends the TableSource
interface and defines one additional method:
getDataStream(execEnv)
: Returns a DataStream
with the data of the table. The type of the DataStream
must be identical to the return type defined by the TableSource.getReturnType()
method. The DataStream
can by created using a regular data source of the DataStream API. Commonly, a StreamTableSource
is implemented by wrapping a SourceFunction
or a stream connector.Time-based operations of streaming Table API and SQL queries, such as windowed aggregations or joins, require explicitly specified time attributes.
A TableSource
defines a time attribute as a field of type Types.SQL_TIMESTAMP
in its table schema. In contrast to all regular fields in the schema, a time attribute must not be matched to a physical field in the return type of the table source. Instead, a TableSource
defines a time attribute by implementing a certain interface.
Processing time attributes are commonly used in streaming queries. A processing time attribute returns the current wall-clock time of the operator that accesses it. A TableSource
defines a processing time attribute by implementing the DefinedProctimeAttribute
interface. The interface looks as follows:
getProctimeAttribute()
: Returns the name of the processing time attribute. The specified attribute must be defined of type Types.SQL_TIMESTAMP
in the table schema and can be used in time-based operations. A DefinedProctimeAttribute
table source can define no processing time attribute by returning null
.Attention Both StreamTableSource
and BatchTableSource
can implement DefinedProctimeAttribute
and define a processing time attribute. In case of a BatchTableSource
the processing time field is initialized with the current timestamp during the table scan.
Rowtime attributes are attributes of type TIMESTAMP
and handled in a unified way in stream and batch queries.
A table schema field of type SQL_TIMESTAMP
can be declared as rowtime attribute by specifying
TimestampExtractor
that computes the actual value for the attribute (usually from one or more other fields), andWatermarkStrategy
that specifies how watermarks are generated for the the rowtime attribute.A TableSource
defines a rowtime attribute by implementing the DefinedRowtimeAttributes
interface. The interface looks as follows:
getRowtimeAttributeDescriptors()
: Returns a list of RowtimeAttributeDescriptor
. A RowtimeAttributeDescriptor
describes a rowtime attribute with the following properties:
attributeName
: The name of the rowtime attribute in the table schema. The field must be defined with type Types.SQL_TIMESTAMP
.timestampExtractor
: The timestamp extractor extracts the timestamp from a record with the return type. For example, it can convert convert a Long field into a timestamp or parse a String-encoded timestamp. Flink comes with a set of built-in TimestampExtractor
implementation for common use cases. It is also possible to provide a custom implementation.watermarkStrategy
: The watermark strategy defines how watermarks are generated for the rowtime attribute. Flink comes with a set of built-in WatermarkStrategy
implementations for common use cases. It is also possible to provide a custom implementation.Attention Although the getRowtimeAttributeDescriptors()
method returns a list of descriptors, only a single rowtime attribute is support at the moment. We plan to remove this restriction in the future and support tables with more than one rowtime attribute.
Attention Both, StreamTableSource
and BatchTableSource
, can implement DefinedRowtimeAttributes
and define a rowtime attribute. In either case, the rowtime field is extracted using the TimestampExtractor
. Hence, a TableSource
that implements StreamTableSource
and BatchTableSource
and defines a rowtime attribute provides exactly the same data to streaming and batch queries.
Flink provides TimestampExtractor
implementations for common use cases.
The following TimestampExtractor
implementations are currently available:
ExistingField(fieldName)
: Extracts the value of a rowtime attribute from an existing LONG
, SQL_TIMESTAMP
, or timestamp formatted STRING
field. One example of such a string would be ‘2018-05-28 12:34:56.000’.StreamRecordTimestamp()
: Extracts the value of a rowtime attribute from the timestamp of the DataStream
StreamRecord
. Note, this TimestampExtractor
is not available for batch table sources.A custom TimestampExtractor
can be defined by implementing the corresponding interface.
Flink provides WatermarkStrategy
implementations for common use cases.
The following WatermarkStrategy
implementations are currently available:
AscendingTimestamps
: A watermark strategy for ascending timestamps. Records with timestamps that are out-of-order will be considered late.BoundedOutOfOrderTimestamps(delay)
: A watermark strategy for timestamps that are at most out-of-order by the specified delay.PreserveWatermarks()
: A strategy which indicates the watermarks should be preserved from the underlying DataStream
.A custom WatermarkStrategy
can be defined by implementing the corresponding interface.
A TableSource
supports projection push-down by implementing the ProjectableTableSource
interface. The interface defines a single method:
projectFields(fields)
: Returns a copy of the TableSource
with adjusted physical return type. The fields
parameter provides the indexes of the fields that must be provided by the TableSource
. The indexes relate to the TypeInformation
of the physical return type, not to the logical table schema. The copied TableSource
must adjust its return type and the returned DataStream
or DataSet
. The TableSchema
of the copied TableSource
must not be changed, i.e, it must be the same as the original TableSource
. If the TableSource
implements the DefinedFieldMapping
interface, the field mapping must be adjusted to the new return type.The ProjectableTableSource
adds support to project flat fields. If the TableSource
defines a table with nested schema, it can implement the NestedFieldsProjectableTableSource
to extend the projection to nested fields. The NestedFieldsProjectableTableSource
is defined as follows:
projectNestedField(fields, nestedFields)
: Returns a copy of the TableSource
with adjusted physical return type. Fields of the physical return type may be removed or reordered but their type must not be changed. The contract of this method is essentially the same as for the ProjectableTableSource.projectFields()
method. In addition, the nestedFields
parameter contains for each field index in the fields
list, a list of paths to all nested fields that are accessed by the query. All other nested fields do not need to be read, parsed, and set in the records that are produced by the TableSource
. IMPORTANT the types of the projected fields must not be changed but unused fields may be set to null or to a default value.The FilterableTableSource
interface adds support for filter push-down to a TableSource
. A TableSource
extending this interface is able to filter records such that the returned DataStream
or DataSet
returns fewer records.
The interface looks as follows:
applyPredicate(predicates)
: Returns a copy of the TableSource
with added predicates. The predicates
parameter is a mutable list of conjunctive predicates that are “offered” to the TableSource
. The TableSource
accepts to evaluate a predicate by removing it from the list. Predicates that are left in the list will be evaluated by a subsequent filter operator.isFilterPushedDown()
: Returns true if the applyPredicate()
method was called before. Hence, isFilterPushedDown()
must return true for all TableSource
instances returned from a applyPredicate()
call.A TableSink
specifies how to emit a Table
to an external system or location. The interface is generic such that it can support different storage locations and formats. There are different table sinks for batch tables and streaming tables.
The general interface looks as follows:
The TableSink#configure
method is called to pass the schema of the Table (field names and types) to emit to the TableSink
. The method must return a new instance of the TableSink which is configured to emit the provided Table schema.
Defines an external TableSink
to emit a batch table.
The interface looks as follows:
Defines an external TableSink
to emit a streaming table with only insert changes.
The interface looks as follows:
If the table is also modified by update or delete changes, a TableException
will be thrown.
Defines an external TableSink
to emit a streaming table with insert, update, and delete changes.
The interface looks as follows:
The table will be converted into a stream of accumulate and retraction messages which are encoded as Java Tuple2
. The first field is a boolean flag to indicate the message type (true
indicates insert, false
indicates delete). The second field holds the record of the requested type T
.
Defines an external TableSink
to emit a streaming table with insert, update, and delete changes.
The interface looks as follows:
The table must be have unique key fields (atomic or composite) or be append-only. If the table does not have a unique key and is not append-only, a TableException
will be thrown. The unique key of the table is configured by the UpsertStreamTableSink#setKeyFields()
method.
The table will be converted into a stream of upsert and delete messages which are encoded as a Java Tuple2
. The first field is a boolean flag to indicate the message type. The second field holds the record of the requested type T
.
A message with true boolean field is an upsert message for the configured key. A message with false flag is a delete message for the configured key. If the table is append-only, all messages will have a true flag and must be interpreted as insertions.
A TableFactory
allows to create different table-related instances from string-based properties. All available factories are called for matching to the given set of properties and a corresponding factory class.
Factories leverage Java’s Service Provider Interfaces (SPI) for discovering. This means that every dependency and JAR file should contain a file org.apache.flink.table.factories.TableFactory
in the META_INF/services
resource directory that lists all available table factories that it provides.
Every table factory needs to implement the following interface:
requiredContext()
: Specifies the context that this factory has been implemented for. The framework guarantees to only match for this factory if the specified set of properties and values are met. Typical properties might be connector.type
, format.type
, or update-mode
. Property keys such as connector.property-version
and format.property-version
are reserved for future backwards compatibility cases.supportedProperties
: List of property keys that this factory can handle. This method will be used for validation. If a property is passed that this factory cannot handle, an exception will be thrown. The list must not contain the keys that are specified by the context.In order to create a specific instance, a factory class can implement one or more interfaces provided in org.apache.flink.table.factories
:
BatchTableSourceFactory
: Creates a batch table source.BatchTableSinkFactory
: Creates a batch table sink.StreamTableSoureFactory
: Creates a stream table source.StreamTableSinkFactory
: Creates a stream table sink.DeserializationSchemaFactory
: Creates a deserialization schema format.SerializationSchemaFactory
: Creates a serialization schema format.The discovery of a factory happens in multiple stages:
StreamTableSourceFactory
).AmbiguousTableFactoryException
or NoMatchingTableFactoryException
.The following example shows how to provide a custom streaming source with an additional connector.debug
property flag for parameterization.
In a SQL Client environment file, the previously presented factory could be declared as:
The YAML file is translated into flattened string properties and a table factory is called with those properties that describe the connection to the external system:
Attention Properties such as tables.#.name
or tables.#.type
are SQL Client specifics and are not passed to any factory. The type
property decides, depending on the execution environment, whether a BatchTableSourceFactory
/StreamTableSourceFactory
(for source
), a BatchTableSinkFactory
/StreamTableSinkFactory
(for sink
), or both (for both
) need to discovered.
For a type-safe, programmatic approach with explanatory Scaladoc/Javadoc, the Table & SQL API offers descriptors in org.apache.flink.table.descriptors
that translate into string-based properties. See the built-in descriptors for sources, sinks, and formats as a reference.
A connector for MySystem
in our example can extend ConnectorDescriptor
as shown below:
The descriptor can then be used in the API as follows: