pyflink.datastream.connectors.file_system.FileSource#
- class FileSource(j_file_source)[source]#
A unified data source that reads files - both in batch and in streaming mode.
This source supports all (distributed) file systems and object stores that can be accessed via the Flink’s FileSystem class.
Start building a file source via one of the following calls:
for_record_stream_format()
This creates a
FileSourceBuilder
on which you can configure all the properties of the file source.<h2>Batch and Streaming</h2>
This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
When you start creating a file source (via the
FileSourceBuilder
created through one of the above-mentioned methods) the source is by default in bounded/batch mode. Callmonitor_continuously()
to put the source into continuous streaming mode.<h2>Format Types</h2>
The reading of each file happens through file readers defined by <i>file formats</i>. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.
A
StreamFormat
reads the contents of a file from a file stream. It is the simplest format to implement, and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply (such as object reuse, batching, etc.).
<h2>Discovering / Enumerating Files</h2>
The way that the source lists the files to be processes is defined by the
FileEnumeratorProvider
. The FileEnumeratorProvider is responsible to select the relevant files (for example filter out hidden files) and to optionally splits files into multiple regions (= file source splits) that can be read in parallel).Methods
for_bulk_file_format
(bulk_format, *paths)for_record_stream_format
(stream_format, *paths)Builds a new FileSource using a
StreamFormat
to read record-by-record from a file stream.get_java_function
()