StreamFormat
instead. The main motivation for removing it is the
inherent design flaw in the batching of FileRecordFormat: StreamFormat can guarantee that
only a certain amount of memory is being used (unless a single record exceeds that already),
but FileRecordFormat can only batch by the number of records. By removing FileRecordFormat,
we relay the responsibility of implementing the batching to the format developer; they need
to use BulkFormat and find a better way than batch by number of records.@Deprecated @PublicEvolving public interface FileRecordFormat<T> extends Serializable, ResultTypeQueryable<T>
This format is for cases where the readers need access to the file directly or need to create
a custom stream. For readers that can directly on input streams, consider using the StreamFormat
, which is more robust.
The outer class FileRecordFormat
acts mainly as a configuration holder and factory for
the reader. The actual reading is done by the FileRecordFormat.Reader
, which is created
based on an input stream in the createReader(Configuration, Path, long, long)
method and
restored (from checkpointed positions) in the method restoreReader(Configuration, Path,
long, long, long)
.
File splitting means dividing a file into multiple regions that can be read independently.
Whether a format supports splitting is indicated via the isSplittable()
method.
Splitting has the potential to increase parallelism and performance, but poses additional constraints on the format readers: Readers need to be able to find a consistent starting point within the file near the offset where the split starts, (like the next record delimiter, or a block start or a sync marker). This is not necessarily possible for all formats, which is why splitting is optional.
Readers can optionally return the current position of the reader, via the FileRecordFormat.Reader.getCheckpointedPosition()
. This can improve recovery speed from a
checkpoint.
By default (if that method is not overridden or returns null), then recovery from a checkpoint works by reading the split again and skipping the number of records that were processed before the checkpoint. Implementing this method allows formats to directly seek to that position, rather than read and discard a number or records.
The position is a combination of offset in the file and a number of records to skip after this
offset (see CheckpointedPosition
). This helps formats that cannot describe all record
positions by an offset, for example because records are compressed in batches or stored in a
columnar layout (e.g., ORC, Parquet). The default behavior can be viewed as returning a CheckpointedPosition
where the offset is always zero and only the CheckpointedPosition.getRecordsAfterOffset()
is incremented with each emitted record.
Like many other API classes in Flink, the outer class is serializable to support sending instances to distributed workers for parallel execution. This is purely short-term serialization for RPC and no instance of this will be long-term persisted in a serialized form.
Internally in the file source, the readers pass batches of records from the reading threads (that perform the typically blocking I/O operations) to the async mailbox threads that do the streaming and batch data processing. Passing records in batches (rather than one-at-a-time) much reduce the thread-to-thread handover overhead.
This batching is by default based a number of records. See RECORDS_PER_FETCH
to configure that handover batch size.
Modifier and Type | Interface and Description |
---|---|
static interface |
FileRecordFormat.Reader<T>
Deprecated.
This interface is Deprecated, use
StreamFormat.Reader instead. |
Modifier and Type | Field and Description |
---|---|
static ConfigOption<Integer> |
RECORDS_PER_FETCH
Deprecated.
Config option for the number of records to hand over in each fetch.
|
Modifier and Type | Method and Description |
---|---|
FileRecordFormat.Reader<T> |
createReader(Configuration config,
Path filePath,
long splitOffset,
long splitLength)
Deprecated.
Creates a new reader to read in this format.
|
TypeInformation<T> |
getProducedType()
Deprecated.
Gets the type produced by this format.
|
boolean |
isSplittable()
Deprecated.
Checks whether this format is splittable.
|
FileRecordFormat.Reader<T> |
restoreReader(Configuration config,
Path filePath,
long restoredOffset,
long splitOffset,
long splitLength)
Deprecated.
Restores a reader from a checkpointed position.
|
static final ConfigOption<Integer> RECORDS_PER_FETCH
The number should be large enough so that the thread-to-thread handover overhead is amortized across the records, but small enough so that these records together do not consume too much memory.
FileRecordFormat.Reader<T> createReader(Configuration config, Path filePath, long splitOffset, long splitLength) throws IOException
restoreReader(Configuration, Path, long, long, long)
for details.IOException
FileRecordFormat.Reader<T> restoreReader(Configuration config, Path filePath, long restoredOffset, long splitOffset, long splitLength) throws IOException
FileRecordFormat.Reader.getCheckpointedPosition()
a
value with non-negative offset
. That value is
supplied as the restoredOffset
.
If the reader never produced a CheckpointedPosition
with a non-negative offset
before, then this method is not called, and the reader is created in the same way as a fresh
reader via the method createReader(Configuration, Path, long, long)
and the
appropriate number of records are read and discarded, to position to reader to the
checkpointed position.
IOException
boolean isSplittable()
See top-level JavaDocs
(section "Splitting") for details.
TypeInformation<T> getProducedType()
getProducedType
in interface ResultTypeQueryable<T>
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.