Package org.apache.flink.api.common.io
Class DelimitedInputFormat<OT>
- java.lang.Object
-
- org.apache.flink.api.common.io.RichInputFormat<OT,FileInputSplit>
-
- org.apache.flink.api.common.io.FileInputFormat<OT>
-
- org.apache.flink.api.common.io.DelimitedInputFormat<OT>
-
- All Implemented Interfaces:
Serializable
,CheckpointableInputFormat<FileInputSplit,Long>
,InputFormat<OT,FileInputSplit>
,InputSplitSource<FileInputSplit>
- Direct Known Subclasses:
GenericCsvInputFormat
,TextInputFormat
@Public public abstract class DelimitedInputFormat<OT> extends FileInputFormat<OT> implements CheckpointableInputFormat<FileInputSplit,Long>
Base implementation for input formats that split the input at a delimiter into records. The parsing of the record bytes into the record has to be implemented in thereadRecord(Object, byte[], int, int)
method.The default delimiter is the newline character
'\n'
.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.flink.api.common.io.FileInputFormat
FileInputFormat.FileBaseStatistics, FileInputFormat.InputSplitOpenThread
-
-
Field Summary
Fields Modifier and Type Field Description protected byte[]
currBuffer
protected int
currLen
protected int
currOffset
protected static String
RECORD_DELIMITER
The configuration key to set the record delimiter.-
Fields inherited from class org.apache.flink.api.common.io.FileInputFormat
currentSplit, enumerateNestedFiles, INFLATER_INPUT_STREAM_FACTORIES, minSplitSize, numSplits, openTimeout, READ_WHOLE_SPLIT_FLAG, splitLength, splitStart, stream, unsplittable
-
-
Constructor Summary
Constructors Modifier Constructor Description DelimitedInputFormat()
protected
DelimitedInputFormat(Path filePath, Configuration configuration)
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
close()
Closes the input by releasing all buffers and closing the file input stream.void
configure(Configuration parameters)
Configures this input format by reading the path to the file from the configuration and the string that defines the record delimiter.int
getBufferSize()
Charset
getCharset()
Get the character set used for the row delimiter.Long
getCurrentState()
Returns the split currently being read, along with its current state.byte[]
getDelimiter()
int
getLineLengthLimit()
int
getNumLineSamples()
FileInputFormat.FileBaseStatistics
getStatistics(BaseStatistics cachedStats)
Obtains basic file statistics containing only file size.protected void
initializeSplit(FileInputSplit split, Long state)
Initialization method that is called after opening or reopening an input split.protected static void
loadConfigParameters(Configuration parameters)
OT
nextRecord(OT record)
Reads the next record from the input.void
open(FileInputSplit split)
Opens the given input split.boolean
reachedEnd()
Checks whether the current split is at its end.protected boolean
readLine()
abstract OT
readRecord(OT reuse, byte[] bytes, int offset, int numBytes)
This function parses the given byte array which represents a serialized record.void
reopen(FileInputSplit split, Long state)
Restores the state of a parallel instance reading from anInputFormat
.void
setBufferSize(int bufferSize)
void
setCharset(String charset)
Set the name of the character set used for the row delimiter.void
setDelimiter(byte[] delimiter)
void
setDelimiter(char delimiter)
void
setDelimiter(String delimiter)
void
setLineLengthLimit(int lineLengthLimit)
void
setNumLineSamples(int numLineSamples)
-
Methods inherited from class org.apache.flink.api.common.io.FileInputFormat
acceptFile, createInputSplits, decorateInputStream, extractFileExtension, getFilePaths, getFileStats, getFileStats, getInflaterInputStreamFactory, getInputSplitAssigner, getMinSplitSize, getNestedFileEnumeration, getNumSplits, getOpenTimeout, getSplitLength, getSplitStart, getSupportedCompressionFormats, registerInflaterInputStreamFactory, setFilePath, setFilePath, setFilePaths, setFilePaths, setFilesFilter, setMinSplitSize, setNestedFileEnumeration, setNumSplits, setOpenTimeout, testForUnsplittable, toString
-
Methods inherited from class org.apache.flink.api.common.io.RichInputFormat
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
-
-
-
-
Field Detail
-
currBuffer
protected transient byte[] currBuffer
-
currOffset
protected transient int currOffset
-
currLen
protected transient int currLen
-
RECORD_DELIMITER
protected static final String RECORD_DELIMITER
The configuration key to set the record delimiter.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
DelimitedInputFormat
public DelimitedInputFormat()
-
DelimitedInputFormat
protected DelimitedInputFormat(Path filePath, Configuration configuration)
-
-
Method Detail
-
loadConfigParameters
protected static void loadConfigParameters(Configuration parameters)
-
getCharset
@PublicEvolving public Charset getCharset()
Get the character set used for the row delimiter. This is also used by subclasses to interpret field delimiters, comment strings, and for configuringFieldParser
s.- Returns:
- the charset
-
setCharset
@PublicEvolving public void setCharset(String charset)
Set the name of the character set used for the row delimiter. This is also used by subclasses to interpret field delimiters, comment strings, and for configuringFieldParser
s.These fields are interpreted when set. Changing the charset thereafter may cause unexpected results.
- Parameters:
charset
- name of the charset
-
getDelimiter
public byte[] getDelimiter()
-
setDelimiter
public void setDelimiter(byte[] delimiter)
-
setDelimiter
public void setDelimiter(char delimiter)
-
setDelimiter
public void setDelimiter(String delimiter)
-
getLineLengthLimit
public int getLineLengthLimit()
-
setLineLengthLimit
public void setLineLengthLimit(int lineLengthLimit)
-
getBufferSize
public int getBufferSize()
-
setBufferSize
public void setBufferSize(int bufferSize)
-
getNumLineSamples
public int getNumLineSamples()
-
setNumLineSamples
public void setNumLineSamples(int numLineSamples)
-
readRecord
public abstract OT readRecord(OT reuse, byte[] bytes, int offset, int numBytes) throws IOException
This function parses the given byte array which represents a serialized record. The function returns a valid record or throws an IOException.- Parameters:
reuse
- An optionally reusable object.bytes
- Binary data of serialized records.offset
- The offset where to start to read the record data.numBytes
- The number of bytes that can be read starting at the offset position.- Returns:
- Returns the read record if it was successfully deserialized.
- Throws:
IOException
- if the record could not be read.
-
configure
public void configure(Configuration parameters)
Configures this input format by reading the path to the file from the configuration and the string that defines the record delimiter.- Specified by:
configure
in interfaceInputFormat<OT,FileInputSplit>
- Overrides:
configure
in classFileInputFormat<OT>
- Parameters:
parameters
- The configuration object to read the parameters from.- See Also:
InputFormat.configure(org.apache.flink.configuration.Configuration)
-
getStatistics
public FileInputFormat.FileBaseStatistics getStatistics(BaseStatistics cachedStats) throws IOException
Description copied from class:FileInputFormat
Obtains basic file statistics containing only file size. If the input is a directory, then the size is the sum of all contained files.- Specified by:
getStatistics
in interfaceInputFormat<OT,FileInputSplit>
- Overrides:
getStatistics
in classFileInputFormat<OT>
- Parameters:
cachedStats
- The statistics that were cached. May be null.- Returns:
- The base statistics for the input, or null, if not available.
- Throws:
IOException
- See Also:
InputFormat.getStatistics(org.apache.flink.api.common.io.statistics.BaseStatistics)
-
open
public void open(FileInputSplit split) throws IOException
Opens the given input split. This method opens the input stream to the specified file, allocates read buffers and positions the stream at the correct position, making sure that any partial record at the beginning is skipped.- Specified by:
open
in interfaceInputFormat<OT,FileInputSplit>
- Overrides:
open
in classFileInputFormat<OT>
- Parameters:
split
- The input split to open.- Throws:
IOException
- Thrown, if the spit could not be opened due to an I/O problem.- See Also:
FileInputFormat.open(org.apache.flink.core.fs.FileInputSplit)
-
reachedEnd
public boolean reachedEnd()
Checks whether the current split is at its end.- Specified by:
reachedEnd
in interfaceInputFormat<OT,FileInputSplit>
- Returns:
- True, if the split is at its end, false otherwise.
-
nextRecord
public OT nextRecord(OT record) throws IOException
Description copied from interface:InputFormat
Reads the next record from the input.When this method is called, the input format it guaranteed to be opened.
- Specified by:
nextRecord
in interfaceInputFormat<OT,FileInputSplit>
- Parameters:
record
- Object that may be reused.- Returns:
- Read record.
- Throws:
IOException
- Thrown, if an I/O error occurred.
-
close
public void close() throws IOException
Closes the input by releasing all buffers and closing the file input stream.- Specified by:
close
in interfaceInputFormat<OT,FileInputSplit>
- Overrides:
close
in classFileInputFormat<OT>
- Throws:
IOException
- Thrown, if the closing of the file stream causes an I/O error.
-
readLine
protected final boolean readLine() throws IOException
- Throws:
IOException
-
getCurrentState
@PublicEvolving public Long getCurrentState() throws IOException
Description copied from interface:CheckpointableInputFormat
Returns the split currently being read, along with its current state. This will be used to restore the state of the reading channel when recovering from a task failure. In the case of a simple text file, the state can correspond to the last read offset in the split.- Specified by:
getCurrentState
in interfaceCheckpointableInputFormat<FileInputSplit,Long>
- Returns:
- The state of the channel.
- Throws:
IOException
- Thrown if the creation of the state object failed.
-
reopen
@PublicEvolving public void reopen(FileInputSplit split, Long state) throws IOException
Description copied from interface:CheckpointableInputFormat
Restores the state of a parallel instance reading from anInputFormat
. This is necessary when recovering from a task failure. When this method is called, the input format it guaranteed to be configured.NOTE: The caller has to make sure that the provided split is the one to whom the state belongs.
- Specified by:
reopen
in interfaceCheckpointableInputFormat<FileInputSplit,Long>
- Parameters:
split
- The split to be opened.state
- The state from which to start from. This can contain the offset, but also other data, depending on the input format.- Throws:
IOException
-
initializeSplit
protected void initializeSplit(FileInputSplit split, @Nullable Long state) throws IOException
Initialization method that is called after opening or reopening an input split.- Parameters:
split
- Split that was opened or reopenedstate
- Checkpointed state if the split was reopened- Throws:
IOException
-
-