T- The type of the events/records produced by this source.
@PublicEvolving public final class FileSource<T> extends AbstractFileSource<T,FileSourceSplit>
This source supports all (distributed) file systems and object stores that can be accessed via
Start building a file source via one of the following calls:
This creates a
FileSource.FileSourceBuilder on which you can configure all the
properties of the file source.
This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
When you start creating a file source (via the
through one of the above-mentioned methods) the source is by default in bounded/batch mode. Call
AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration) to put the source into
continuous streaming mode.
The reading of each file happens through file readers defined by file formats. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.
StreamFormatreads the contents of a file from a file stream. It is the simplest format to implement, and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply (such as object reuse, batching, etc.).
BulkFormatreads batches of records from a file at a time. It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
FileRecordFormatis in the middle of the trade-off spectrum between the
The way that the source lists the files to be processes is defined by the
FileEnumerator is responsible to select the relevant files (for
example filter out hidden files) and to optionally splits files into multiple regions (= file
source splits) that can be read in parallel).
|Modifier and Type||Class and Description|
The builder for the
AbstractFileSource.AbstractFileSourceBuilder<T,SplitT extends FileSourceSplit,SELF extends AbstractFileSource.AbstractFileSourceBuilder<T,SplitT,SELF>>
|Modifier and Type||Field and Description|
The default file enumerator used for non-splittable formats.
The default split assigner, a lazy locality-aware assigner.
The default file enumerator used for splittable formats.
|Modifier and Type||Method and Description|
Builds a new
Builds a new
Builds a new
Creates a serializer for the source splits.
createEnumerator, createReader, getAssignerFactory, getBoundedness, getContinuousEnumerationSettings, getEnumeratorCheckpointSerializer, getProducedType, restoreEnumerator
public static final FileSplitAssigner.Provider DEFAULT_SPLIT_ASSIGNER
public static final FileEnumerator.Provider DEFAULT_SPLITTABLE_FILE_ENUMERATOR
public static final FileEnumerator.Provider DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
public SimpleVersionedSerializer<FileSourceSplit> getSplitSerializer()
public static <T> FileSource.FileSourceBuilder<T> forRecordStreamFormat(StreamFormat<T> streamFormat, Path... paths)
StreamFormatto read record-by-record from a file stream.
When possible, stream-based formats are generally easier (preferable) to file-based formats, because they support better default behavior around I/O batching or progress tracking (checkpoints).
Stream formats also automatically de-compress files based on the file extension. This supports files ending in ".deflate" (Deflate), ".xz" (XZ), ".bz2" (BZip2), ".gz", ".gzip" (GZip).
public static <T> FileSource.FileSourceBuilder<T> forBulkFileFormat(BulkFormat<T,FileSourceSplit> bulkFormat, Path... paths)
BulkFormatto read batches of records from files.
Examples for bulk readers are compressed and vectorized formats such as ORC or Parquet.
public static <T> FileSource.FileSourceBuilder<T> forRecordFileFormat(FileRecordFormat<T> recordFormat, Path... paths)
FileRecordFormatto read record-by-record from a a file path.
FileRecordFormat is more general than the
StreamFormat, but also
requires often more careful parametrization.
Copyright © 2014–2023 The Apache Software Foundation. All rights reserved.