T
- The type of records created by this format reader.@PublicEvolving public abstract class SimpleStreamFormat<T> extends Object implements StreamFormat<T>
StreamFormat
, for formats that are not splittable.
This format makes no difference between creating readers from scratch (new file) or from a
checkpoint. Because of that, if the reader actively checkpoints its position (via the Reader#getCheckpointedPosition()
method) then the checkpointed offset must be a byte offset in
the file from which the stream can be resumed as if it were the beginning of the file.
For all other details, please check the docs of StreamFormat
.
StreamFormat.Reader<T>
FETCH_IO_SIZE
Constructor and Description |
---|
SimpleStreamFormat() |
Modifier and Type | Method and Description |
---|---|
abstract StreamFormat.Reader<T> |
createReader(Configuration config,
FSDataInputStream stream)
Creates a new reader.
|
StreamFormat.Reader<T> |
createReader(Configuration config,
FSDataInputStream stream,
long fileLen,
long splitEnd)
Creates a new reader to read in this format.
|
abstract TypeInformation<T> |
getProducedType()
Gets the type produced by this format.
|
boolean |
isSplittable()
This format is always not splittable.
|
StreamFormat.Reader<T> |
restoreReader(Configuration config,
FSDataInputStream stream,
long restoredOffset,
long fileLen,
long splitEnd)
Restores a reader from a checkpointed position.
|
public abstract StreamFormat.Reader<T> createReader(Configuration config, FSDataInputStream stream) throws IOException
If the reader previously checkpointed an offset, then the input stream will be positioned
to that particular offset. Readers checkpoint an offset by returning a value from the method
Reader#getCheckpointedPosition()
method with an offset other than CheckpointedPosition.NO_OFFSET
).
IOException
public abstract TypeInformation<T> getProducedType()
getProducedType
in interface ResultTypeQueryable<T>
getProducedType
in interface StreamFormat<T>
public final boolean isSplittable()
isSplittable
in interface StreamFormat<T>
public final StreamFormat.Reader<T> createReader(Configuration config, FSDataInputStream stream, long fileLen, long splitEnd) throws IOException
StreamFormat
StreamFormat.restoreReader(Configuration, FSDataInputStream, long, long, long)
for details.
If the format is splittable
, then the stream
is positioned
to the beginning of the file split, otherwise it will be at position zero.
The fileLen
is the length of the entire file, while splitEnd
is the offset
of the first byte after the split end boundary (exclusive end boundary). For non-splittable
formats, both values are identical.
createReader
in interface StreamFormat<T>
IOException
public final StreamFormat.Reader<T> restoreReader(Configuration config, FSDataInputStream stream, long restoredOffset, long fileLen, long splitEnd) throws IOException
StreamFormat
StreamFormat.Reader.getCheckpointedPosition()
a value with
non-negative offset
. That value is supplied as the
restoredOffset
.
If the format is splittable
, then the stream
is positioned
to the beginning of the file split, otherwise it will be at position zero. The stream is NOT
positioned to the checkpointed offset, because the format is free to interpret this offset in
a different way than the byte offset in the file (for example as a record index).
If the reader never produced a CheckpointedPosition
with a non-negative offset
before, then this method is not called, and the reader is created in the same way as a fresh
reader via the method StreamFormat.createReader(Configuration, FSDataInputStream, long, long)
and
the appropriate number of records are read and discarded, to position to reader to the
checkpointed position.
Having a different method for restoring readers to a checkpointed position allows readers to seek to the start position differently in that case, compared to when the reader is created from a split offset generated at the enumerator. In the latter case, the offsets are commonly "approximate", because the enumerator typically generates splits based only on metadata. Readers then have to skip some bytes while searching for the next position to start from (based on a delimiter, sync marker, block offset, etc.). In contrast, checkpointed offsets are often precise, because they were recorded as the reader when through the data stream. Starting a reader from a checkpointed offset may hence not require and search for the next delimiter/block/marker.
The fileLen
is the length of the entire file, while splitEnd
is the offset
of the first byte after the split end boundary (exclusive end boundary). For non-splittable
formats, both values are identical.
restoreReader
in interface StreamFormat<T>
IOException
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.