T
- The type of the events/records produced by this source.@PublicEvolving public final class FileSource<T> extends AbstractFileSource<T,FileSourceSplit> implements DynamicParallelismInference
This source supports all (distributed) file systems and object stores that can be accessed via
the Flink's FileSystem
class.
Start building a file source via one of the following calls:
This creates a FileSource.FileSourceBuilder
on which you can configure all the
properties of the file source.
This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
When you start creating a file source (via the FileSource.FileSourceBuilder
created
through one of the above-mentioned methods) the source is by default in bounded/batch mode. Call
AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)
to put the source into
continuous streaming mode.
The reading of each file happens through file readers defined by file formats. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.
StreamFormat
reads the contents of a file from a file stream. It is the simplest
format to implement, and provides many features out-of-the-box (like checkpointing logic)
but is limited in the optimizations it can apply (such as object reuse, batching, etc.).
BulkFormat
reads batches of records from a file at a time. It is the most "low
level" format to implement, but offers the greatest flexibility to optimize the
implementation.
The way that the source lists the files to be processes is defined by the FileEnumerator
. The FileEnumerator
is responsible to select the relevant files (for
example filter out hidden files) and to optionally splits files into multiple regions (= file
source splits) that can be read in parallel).
Modifier and Type | Class and Description |
---|---|
static class |
FileSource.FileSourceBuilder<T>
The builder for the
FileSource , to configure the various behaviors. |
AbstractFileSource.AbstractFileSourceBuilder<T,SplitT extends FileSourceSplit,SELF extends AbstractFileSource.AbstractFileSourceBuilder<T,SplitT,SELF>>
DynamicParallelismInference.Context
Modifier and Type | Field and Description |
---|---|
static FileEnumerator.Provider |
DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
The default file enumerator used for non-splittable formats.
|
static FileSplitAssigner.Provider |
DEFAULT_SPLIT_ASSIGNER
The default split assigner, a lazy locality-aware assigner.
|
static FileEnumerator.Provider |
DEFAULT_SPLITTABLE_FILE_ENUMERATOR
The default file enumerator used for splittable formats.
|
Modifier and Type | Method and Description |
---|---|
static <T> FileSource.FileSourceBuilder<T> |
forBulkFileFormat(BulkFormat<T,FileSourceSplit> bulkFormat,
Path... paths)
Builds a new
FileSource using a BulkFormat to read batches of records from
files. |
static <T> FileSource.FileSourceBuilder<T> |
forRecordFileFormat(FileRecordFormat<T> recordFormat,
Path... paths)
Deprecated.
Please use
forRecordStreamFormat(StreamFormat, Path...) instead. |
static <T> FileSource.FileSourceBuilder<T> |
forRecordStreamFormat(StreamFormat<T> streamFormat,
Path... paths)
Builds a new
FileSource using a StreamFormat to read record-by-record from a
file stream. |
SimpleVersionedSerializer<FileSourceSplit> |
getSplitSerializer()
Creates a serializer for the source splits.
|
int |
inferParallelism(DynamicParallelismInference.Context dynamicParallelismContext)
The method is invoked on the master (JobManager) before the initialization of the source
vertex.
|
createEnumerator, createReader, getAssignerFactory, getBoundedness, getContinuousEnumerationSettings, getEnumeratorCheckpointSerializer, getEnumeratorFactory, getProducedType, restoreEnumerator
public static final FileSplitAssigner.Provider DEFAULT_SPLIT_ASSIGNER
public static final FileEnumerator.Provider DEFAULT_SPLITTABLE_FILE_ENUMERATOR
public static final FileEnumerator.Provider DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
public SimpleVersionedSerializer<FileSourceSplit> getSplitSerializer()
Source
getSplitSerializer
in interface Source<T,FileSourceSplit,PendingSplitsCheckpoint<FileSourceSplit>>
getSplitSerializer
in class AbstractFileSource<T,FileSourceSplit>
public int inferParallelism(DynamicParallelismInference.Context dynamicParallelismContext)
DynamicParallelismInference
inferParallelism
in interface DynamicParallelismInference
dynamicParallelismContext
- The context to get dynamic parallelism decision infos.public static <T> FileSource.FileSourceBuilder<T> forRecordStreamFormat(StreamFormat<T> streamFormat, Path... paths)
FileSource
using a StreamFormat
to read record-by-record from a
file stream.
When possible, stream-based formats are generally easier (preferable) to file-based formats, because they support better default behavior around I/O batching or progress tracking (checkpoints).
Stream formats also automatically de-compress files based on the file extension. This supports files ending in ".deflate" (Deflate), ".xz" (XZ), ".bz2" (BZip2), ".gz", ".gzip" (GZip).
public static <T> FileSource.FileSourceBuilder<T> forBulkFileFormat(BulkFormat<T,FileSourceSplit> bulkFormat, Path... paths)
FileSource
using a BulkFormat
to read batches of records from
files.
Examples for bulk readers are compressed and vectorized formats such as ORC or Parquet.
@Deprecated public static <T> FileSource.FileSourceBuilder<T> forRecordFileFormat(FileRecordFormat<T> recordFormat, Path... paths)
forRecordStreamFormat(StreamFormat, Path...)
instead.FileSource
using a FileRecordFormat
to read record-by-record
from a a file path.
A FileRecordFormat
is more general than the StreamFormat
, but also
requires often more careful parametrization.
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.