FileRecordFormat (Flink : 1.14-SNAPSHOT API)

All Superinterfaces:

ResultTypeQueryable<T>, Serializable
```
@PublicEvolving
public interface FileRecordFormat<T>
extends Serializable, ResultTypeQueryable<T>
```
A reader format that reads individual records from a file.
This format is for cases where the readers need access to the file directly or need to create a custom stream. For readers that can directly on input streams, consider using the StreamFormat, which is more robust.
The outer class FileRecordFormat acts mainly as a configuration holder and factory for the reader. The actual reading is done by the FileRecordFormat.Reader, which is created based on an input stream in the createReader(Configuration, Path, long, long) method and restored (from checkpointed positions) in the method restoreReader(Configuration, Path, long, long, long).
Splitting

File splitting means dividing a file into multiple regions that can be read independently. Whether a format supports splitting is indicated via the isSplittable() method.
Splitting has the potential to increase parallelism and performance, but poses additional constraints on the format readers: Readers need to be able to find a consistent starting point within the file near the offset where the split starts, (like the next record delimiter, or a block start or a sync marker). This is not necessarily possible for all formats, which is why splitting is optional.
Checkpointing

Readers can optionally return the current position of the reader, via the FileRecordFormat.Reader.getCheckpointedPosition(). This can improve recovery speed from a checkpoint.
By default (if that method is not overridden or returns null), then recovery from a checkpoint works by reading the split again and skipping the number of records that were processed before the checkpoint. Implementing this method allows formats to directly seek to that position, rather than read and discard a number or records.
The position is a combination of offset in the file and a number of records to skip after this offset (see CheckpointedPosition). This helps formats that cannot describe all record positions by an offset, for example because records are compressed in batches or stored in a columnar layout (e.g., ORC, Parquet). The default behavior can be viewed as returning a CheckpointedPosition where the offset is always zero and only the CheckpointedPosition.getRecordsAfterOffset() is incremented with each emitted record.
Serializable

Like many other API classes in Flink, the outer class is serializable to support sending instances to distributed workers for parallel execution. This is purely short-term serialization for RPC and no instance of this will be long-term persisted in a serialized form.
Record Batching

Internally in the file source, the readers pass batches of records from the reading threads (that perform the typically blocking I/O operations) to the async mailbox threads that do the streaming and batch data processing. Passing records in batches (rather than one-at-a-time) much reduce the thread-to-thread handover overhead.
This batching is by default based a number of records. See RECORDS_PER_FETCH to configure that handover batch size.

Nested Class Summary

Nested Classes
Modifier and Type Interface and Description

static interface FileRecordFormat.Reader<T>
The actual reader that reads the records.

Nested Classes
Modifier and Type	Interface and Description
`static interface`	`FileRecordFormat.Reader<T>` The actual reader that reads the records.

Field Summary

Fields
Modifier and Type Field and Description

static ConfigOption<Integer> RECORDS_PER_FETCH
Config option for the number of records to hand over in each fetch.

Fields
Modifier and Type	Field and Description
`static ConfigOption<Integer>`	`RECORDS_PER_FETCH` Config option for the number of records to hand over in each fetch.

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method and Description
`FileRecordFormat.Reader<T>`	`createReader(Configuration config, Path filePath, long splitOffset, long splitLength)` Creates a new reader to read in this format.
`TypeInformation<T>`	`getProducedType()` Gets the type produced by this format.
`boolean`	`isSplittable()` Checks whether this format is splittable.
`FileRecordFormat.Reader<T>`	`restoreReader(Configuration config, Path filePath, long restoredOffset, long splitOffset, long splitLength)` Restores a reader from a checkpointed position.

- Field Detail
  - RECORDS_PER_FETCH
```
static final ConfigOption<Integer> RECORDS_PER_FETCH
```
    Config option for the number of records to hand over in each fetch.
    The number should be large enough so that the thread-to-thread handover overhead is amortized across the records, but small enough so that the these records together do not consume too memory to be feasible.
- Method Detail
  - createReader
```
FileRecordFormat.Reader<T> createReader(Configuration config,
                                        Path filePath,
                                        long splitOffset,
                                        long splitLength)
                                 throws IOException
```
    Creates a new reader to read in this format. This method is called when a fresh reader is created for a split that was assigned from the enumerator. This method may also be called on recovery from a checkpoint, if the reader never stored an offset in the checkpoint (see restoreReader(Configuration, Path, long, long, long) for details.
    
    Throws:
    
    IOException
  - restoreReader
```
FileRecordFormat.Reader<T> restoreReader(Configuration config,
                                         Path filePath,
                                         long restoredOffset,
                                         long splitOffset,
                                         long splitLength)
                                  throws IOException
```
    Restores a reader from a checkpointed position. This method is called when the reader is recovered from a checkpoint and the reader has previously stored an offset into the checkpoint, by returning from the FileRecordFormat.Reader.getCheckpointedPosition() a value with non-negative offset. That value is supplied as the restoredOffset.
    If the reader never produced a CheckpointedPosition with a non-negative offset before, then this method is not called, and the reader is created in the same way as a fresh reader via the method createReader(Configuration, Path, long, long) and the appropriate number of records are read and discarded, to position to reader to the checkpointed position.
    
    Throws:
    
    IOException
  - isSplittable
```
boolean isSplittable()
```
    Checks whether this format is splittable. Splittable formats allow Flink to create multiple splits per file, so that Flink can read multiple regions of the file concurrently.
    See top-level JavaDocs (section "Splitting") for details.
  - getProducedType
```
TypeInformation<T> getProducedType()
```
    Gets the type produced by this format. This type will be the type produced by the file source as a whole.
    
    Specified by:
    
    getProducedType in interface ResultTypeQueryable<T>
    
    Returns:
    
    The data type produced by this function or input format.

Back to Flink Website

Interface FileRecordFormat<T>

Splitting

Checkpointing

Serializable

Record Batching

Nested Class Summary

Field Summary

Method Summary

Field Detail

RECORDS_PER_FETCH

Method Detail

createReader

restoreReader

isSplittable

getProducedType

Back to Flink Website