@PublicEvolving public interface FileRecordFormat<T> extends Serializable, ResultTypeQueryable<T>
This format is for cases where the readers need access to the file directly or need to create
a custom stream. For readers that can directly on input streams, consider using the
StreamFormat, which is more robust.
The outer class
FileRecordFormat acts mainly as a configuration holder and factory for
the reader. The actual reading is done by the
FileRecordFormat.Reader, which is created
based on an input stream in the
createReader(Configuration, Path, long, long) method and
restored (from checkpointed positions) in the method
long, long, long).
File splitting means dividing a file into multiple regions that can be read independently.
Whether a format supports splitting is indicated via the
Splitting has the potential to increase parallelism and performance, but poses additional constraints on the format readers: Readers need to be able to find a consistent starting point within the file near the offset where the split starts, (like the next record delimiter, or a block start or a sync marker). This is not necessarily possible for all formats, which is why splitting is optional.
Readers can optionally return the current position of the reader, via the
FileRecordFormat.Reader.getCheckpointedPosition(). This can improve recovery speed from a
By default (if that method is not overridden or returns null), then recovery from a checkpoint works by reading the split again and skipping the number of records that were processed before the checkpoint. Implementing this method allows formats to directly seek to that position, rather than read and discard a number or records.
The position is a combination of offset in the file and a number of records to skip after this
CheckpointedPosition). This helps formats that cannot describe all record
positions by an offset, for example because records are compressed in batches or stored in a
columnar layout (e.g., ORC, Parquet). The default behavior can be viewed as returning a
CheckpointedPosition where the offset is always zero and only the
CheckpointedPosition.getRecordsAfterOffset() is incremented with each emitted record.
Like many other API classes in Flink, the outer class is serializable to support sending instances to distributed workers for parallel execution. This is purely short-term serialization for RPC and no instance of this will be long-term persisted in a serialized form.
Internally in the file source, the readers pass batches of records from the reading threads (that perform the typically blocking I/O operations) to the async mailbox threads that do the streaming and batch data processing. Passing records in batches (rather than one-at-a-time) much reduce the thread-to-thread handover overhead.
This batching is by default based a number of records. See
RECORDS_PER_FETCH to configure that handover batch size.
|Modifier and Type||Interface and Description|
The actual reader that reads the records.
|Modifier and Type||Field and Description|
Config option for the number of records to hand over in each fetch.
|Modifier and Type||Method and Description|
Creates a new reader to read in this format.
Gets the type produced by this format.
Checks whether this format is splittable.
Restores a reader from a checkpointed position.
static final ConfigOption<Integer> RECORDS_PER_FETCH
The number should be large enough so that the thread-to-thread handover overhead is amortized across the records, but small enough so that the these records together do not consume too memory to be feasible.
FileRecordFormat.Reader<T> createReader(Configuration config, Path filePath, long splitOffset, long splitLength) throws IOException
restoreReader(Configuration, Path, long, long, long)for details.
FileRecordFormat.Reader<T> restoreReader(Configuration config, Path filePath, long restoredOffset, long splitOffset, long splitLength) throws IOException
FileRecordFormat.Reader.getCheckpointedPosition()a value with non-negative
offset. That value is supplied as the
If the reader never produced a
CheckpointedPosition with a non-negative offset
before, then this method is not called, and the reader is created in the same way as a fresh
reader via the method
createReader(Configuration, Path, long, long) and the
appropriate number of records are read and discarded, to position to reader to the
top-level JavaDocs (section "Splitting") for details.
Copyright © 2014–2022 The Apache Software Foundation. All rights reserved.