FileSource (Flink : 1.19-SNAPSHOT API)

java.lang.Object
- org.apache.flink.connector.file.src.AbstractFileSource<T,FileSourceSplit>
- - org.apache.flink.connector.file.src.FileSource<T>

Type Parameters:

T - The type of the events/records produced by this source.

All Implemented Interfaces:

Serializable, DynamicParallelismInference, Source<T,FileSourceSplit,PendingSplitsCheckpoint<FileSourceSplit>>, SourceReaderFactory<T,FileSourceSplit>, ResultTypeQueryable<T>
```
@PublicEvolving
public final class FileSource<T>
extends AbstractFileSource<T,FileSourceSplit>
implements DynamicParallelismInference
```
A unified data source that reads files - both in batch and in streaming mode.
This source supports all (distributed) file systems and object stores that can be accessed via the Flink's FileSystem class.
Start building a file source via one of the following calls:
- forRecordStreamFormat(StreamFormat, Path...)
- forBulkFileFormat(BulkFormat, Path...)
This creates a FileSource.FileSourceBuilder on which you can configure all the properties of the file source.
Batch and Streaming

This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
When you start creating a file source (via the FileSource.FileSourceBuilder created through one of the above-mentioned methods) the source is by default in bounded/batch mode. Call AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration) to put the source into continuous streaming mode.
Format Types

The reading of each file happens through file readers defined by file formats. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.
- A StreamFormat reads the contents of a file from a file stream. It is the simplest format to implement, and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply (such as object reuse, batching, etc.).
- A BulkFormat reads batches of records from a file at a time. It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
Discovering / Enumerating Files

The way that the source lists the files to be processes is defined by the FileEnumerator. The FileEnumerator is responsible to select the relevant files (for example filter out hidden files) and to optionally splits files into multiple regions (= file source splits) that can be read in parallel).
See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class FileSource.FileSourceBuilder<T>
The builder for the FileSource, to configure the various behaviors.
- Nested classes/interfaces inherited from class org.apache.flink.connector.file.src.AbstractFileSource
  AbstractFileSource.AbstractFileSourceBuilder<T,SplitT extends FileSourceSplit,SELF extends AbstractFileSource.AbstractFileSourceBuilder<T,SplitT,SELF>>
- Nested classes/interfaces inherited from interface org.apache.flink.api.connector.source.DynamicParallelismInference
  DynamicParallelismInference.Context

Nested Classes
Modifier and Type	Class and Description
`static class`	`FileSource.FileSourceBuilder<T>` The builder for the `FileSource`, to configure the various behaviors.

Field Summary

Fields
Modifier and Type	Field and Description
`static FileEnumerator.Provider`	`DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR` The default file enumerator used for non-splittable formats.
`static FileSplitAssigner.Provider`	`DEFAULT_SPLIT_ASSIGNER` The default split assigner, a lazy locality-aware assigner.
`static FileEnumerator.Provider`	`DEFAULT_SPLITTABLE_FILE_ENUMERATOR` The default file enumerator used for splittable formats.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`static <T> FileSource.FileSourceBuilder<T>`	`forBulkFileFormat(BulkFormat<T,FileSourceSplit> bulkFormat, Path... paths)` Builds a new `FileSource` using a `BulkFormat` to read batches of records from files.
`static <T> FileSource.FileSourceBuilder<T>`	`forRecordFileFormat(FileRecordFormat<T> recordFormat, Path... paths)` Deprecated. Please use `forRecordStreamFormat(StreamFormat, Path...)` instead.
`static <T> FileSource.FileSourceBuilder<T>`	`forRecordStreamFormat(StreamFormat<T> streamFormat, Path... paths)` Builds a new `FileSource` using a `StreamFormat` to read record-by-record from a file stream.
`SimpleVersionedSerializer<FileSourceSplit>`	`getSplitSerializer()` Creates a serializer for the source splits.
`int`	`inferParallelism(DynamicParallelismInference.Context dynamicParallelismContext)` The method is invoked on the master (JobManager) before the initialization of the source vertex.

Methods inherited from class org.apache.flink.connector.file.src.AbstractFileSource
createEnumerator, createReader, getAssignerFactory, getBoundedness, getContinuousEnumerationSettings, getEnumeratorCheckpointSerializer, getProducedType, restoreEnumerator

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_SPLIT_ASSIGNER
```
public static final FileSplitAssigner.Provider DEFAULT_SPLIT_ASSIGNER
```
    The default split assigner, a lazy locality-aware assigner.
  - DEFAULT_SPLITTABLE_FILE_ENUMERATOR
```
public static final FileEnumerator.Provider DEFAULT_SPLITTABLE_FILE_ENUMERATOR
```
    The default file enumerator used for splittable formats. The enumerator recursively enumerates files, split files that consist of multiple distributed storage blocks into multiple splits, and filters hidden files (files starting with '.' or '_'). Files with suffixes of common compression formats (for example '.gzip', '.bz2', '.xy', '.zip', ...) will not be split.
  - DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
```
public static final FileEnumerator.Provider DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
```
    The default file enumerator used for non-splittable formats. The enumerator recursively enumerates files, creates one split for the file, and filters hidden files (files starting with '.' or '_').
- Method Detail
  - getSplitSerializer
```
public SimpleVersionedSerializer<FileSourceSplit> getSplitSerializer()
```
    Description copied from interface: Source
    
    Creates a serializer for the source splits. Splits are serialized when sending them from enumerator to reader, and when checkpointing the reader's current state.
    
    Specified by:
    
    getSplitSerializer in interface Source<T,FileSourceSplit,PendingSplitsCheckpoint<FileSourceSplit>>
    
    Specified by:
    
    getSplitSerializer in class AbstractFileSource<T,FileSourceSplit>
    
    Returns:
    
    The serializer for the split type.
  - inferParallelism
```
public int inferParallelism(DynamicParallelismInference.Context dynamicParallelismContext)
```
    Description copied from interface: DynamicParallelismInference
    
    The method is invoked on the master (JobManager) before the initialization of the source vertex.
    
    Specified by:
    
    inferParallelism in interface DynamicParallelismInference
    
    Parameters:
    
    dynamicParallelismContext - The context to get dynamic parallelism decision infos.
  - forRecordStreamFormat
```
public static <T> FileSource.FileSourceBuilder<T> forRecordStreamFormat(StreamFormat<T> streamFormat,
                                                                        Path... paths)
```
    Builds a new FileSource using a StreamFormat to read record-by-record from a file stream.
    When possible, stream-based formats are generally easier (preferable) to file-based formats, because they support better default behavior around I/O batching or progress tracking (checkpoints).
    Stream formats also automatically de-compress files based on the file extension. This supports files ending in ".deflate" (Deflate), ".xz" (XZ), ".bz2" (BZip2), ".gz", ".gzip" (GZip).
  - forBulkFileFormat
```
public static <T> FileSource.FileSourceBuilder<T> forBulkFileFormat(BulkFormat<T,FileSourceSplit> bulkFormat,
                                                                    Path... paths)
```
    Builds a new FileSource using a BulkFormat to read batches of records from files.
    Examples for bulk readers are compressed and vectorized formats such as ORC or Parquet.
  - forRecordFileFormat
```
@Deprecated
public static <T> FileSource.FileSourceBuilder<T> forRecordFileFormat(FileRecordFormat<T> recordFormat,
                                                                                  Path... paths)
```
    Deprecated. Please use forRecordStreamFormat(StreamFormat, Path...) instead.
    
    Builds a new FileSource using a FileRecordFormat to read record-by-record from a a file path.
    A FileRecordFormat is more general than the StreamFormat, but also requires often more careful parametrization.

Back to Flink Website

Class FileSource<T>

Batch and Streaming

Format Types

Discovering / Enumerating Files

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.flink.connector.file.src.AbstractFileSource

Nested classes/interfaces inherited from interface org.apache.flink.api.connector.source.DynamicParallelismInference

Field Summary

Method Summary

Methods inherited from class org.apache.flink.connector.file.src.AbstractFileSource

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_SPLIT_ASSIGNER

DEFAULT_SPLITTABLE_FILE_ENUMERATOR

DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR

Method Detail

getSplitSerializer

inferParallelism

forRecordStreamFormat

forBulkFileFormat

forRecordFileFormat

Back to Flink Website