Package org.apache.flink.api.common.io
Class FileInputFormat<OT>
- java.lang.Object
-
- org.apache.flink.api.common.io.RichInputFormat<OT,FileInputSplit>
-
- org.apache.flink.api.common.io.FileInputFormat<OT>
-
- All Implemented Interfaces:
Serializable
,InputFormat<OT,FileInputSplit>
,InputSplitSource<FileInputSplit>
- Direct Known Subclasses:
AbstractCsvInputFormat
,AvroInputFormat
,BinaryInputFormat
,DelimitedInputFormat
@Public public abstract class FileInputFormat<OT> extends RichInputFormat<OT,FileInputSplit>
The base class forRichInputFormat
s that read from files. For specific input types theInputFormat.nextRecord(Object)
andInputFormat.reachedEnd()
methods need to be implemented. Additionally, one may overrideopen(FileInputSplit)
andclose()
to change the life cycle behavior.After the
open(FileInputSplit)
method completed, the file input data is available from thestream
field.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FileInputFormat.FileBaseStatistics
Encapsulation of the basic statistics the optimizer obtains about a file.static class
FileInputFormat.InputSplitOpenThread
Obtains a DataInputStream in an thread that is not interrupted.
-
Field Summary
Fields Modifier and Type Field Description protected FileInputSplit
currentSplit
The current split that this parallel instance must consume.protected boolean
enumerateNestedFiles
The flag to specify whether recursive traversal of the input directory structure is enabled.protected static Map<String,InflaterInputStreamFactory<?>>
INFLATER_INPUT_STREAM_FACTORIES
A mapping of file extensions to decompression algorithms based on DEFLATE.protected long
minSplitSize
The minimal split size, set by the configure() method.protected int
numSplits
The desired number of splits, as set by the configure() method.protected long
openTimeout
Stream opening timeout.protected static long
READ_WHOLE_SPLIT_FLAG
The splitLength is set to -1L for reading the whole split.protected long
splitLength
The length of the split that this parallel instance must consume.protected long
splitStart
The start of the split that this parallel instance must consume.protected FSDataInputStream
stream
The input stream reading from the input file.protected boolean
unsplittable
Some file input formats are not splittable on a block level (deflate) Therefore, the FileInputFormat can only read whole files.
-
Constructor Summary
Constructors Modifier Constructor Description FileInputFormat()
protected
FileInputFormat(Path filePath)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
acceptFile(FileStatus fileStatus)
A simple hook to filter files and directories from the input.void
close()
Closes the file input stream of the input format.void
configure(Configuration parameters)
Configures the file input format by reading the file path from the configuration.FileInputSplit[]
createInputSplits(int minNumSplits)
Computes the input splits for the file.protected FSDataInputStream
decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit)
This method allows to wrap/decorate the rawFSDataInputStream
for a certain file split, e.g., for decoding.protected static String
extractFileExtension(String fileName)
Returns the extension of a file name (!Path[]
getFilePaths()
Returns the paths of all files to be read by the FileInputFormat.protected FileInputFormat.FileBaseStatistics
getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files)
protected FileInputFormat.FileBaseStatistics
getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files)
protected static InflaterInputStreamFactory<?>
getInflaterInputStreamFactory(String fileExtension)
LocatableInputSplitAssigner
getInputSplitAssigner(FileInputSplit[] splits)
Returns the assigner for the input splits.long
getMinSplitSize()
boolean
getNestedFileEnumeration()
int
getNumSplits()
long
getOpenTimeout()
long
getSplitLength()
Gets the length or remaining length of the current split.long
getSplitStart()
Gets the start of the current split.FileInputFormat.FileBaseStatistics
getStatistics(BaseStatistics cachedStats)
Obtains basic file statistics containing only file size.static Set<String>
getSupportedCompressionFormats()
void
open(FileInputSplit fileSplit)
Opens an input stream to the file defined in the input format.static void
registerInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory)
Registers a decompression algorithm through aInflaterInputStreamFactory
with a file extension for transparent decompression.void
setFilePath(String filePath)
void
setFilePath(Path filePath)
Sets a single path of a file to be read.void
setFilePaths(String... filePaths)
Sets multiple paths of files to be read.void
setFilePaths(Path... filePaths)
Sets multiple paths of files to be read.void
setFilesFilter(FilePathFilter filesFilter)
void
setMinSplitSize(long minSplitSize)
void
setNestedFileEnumeration(boolean enable)
void
setNumSplits(int numSplits)
void
setOpenTimeout(long openTimeout)
protected boolean
testForUnsplittable(FileStatus pathFile)
String
toString()
-
Methods inherited from class org.apache.flink.api.common.io.RichInputFormat
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.flink.api.common.io.InputFormat
nextRecord, reachedEnd
-
-
-
-
Field Detail
-
INFLATER_INPUT_STREAM_FACTORIES
protected static final Map<String,InflaterInputStreamFactory<?>> INFLATER_INPUT_STREAM_FACTORIES
A mapping of file extensions to decompression algorithms based on DEFLATE. Such compressions lead to unsplittable files.
-
READ_WHOLE_SPLIT_FLAG
protected static final long READ_WHOLE_SPLIT_FLAG
The splitLength is set to -1L for reading the whole split.- See Also:
- Constant Field Values
-
stream
protected transient FSDataInputStream stream
The input stream reading from the input file.
-
splitStart
protected transient long splitStart
The start of the split that this parallel instance must consume.
-
splitLength
protected transient long splitLength
The length of the split that this parallel instance must consume.
-
currentSplit
protected transient FileInputSplit currentSplit
The current split that this parallel instance must consume.
-
minSplitSize
protected long minSplitSize
The minimal split size, set by the configure() method.
-
numSplits
protected int numSplits
The desired number of splits, as set by the configure() method.
-
openTimeout
protected long openTimeout
Stream opening timeout.
-
unsplittable
protected boolean unsplittable
Some file input formats are not splittable on a block level (deflate) Therefore, the FileInputFormat can only read whole files.
-
enumerateNestedFiles
protected boolean enumerateNestedFiles
The flag to specify whether recursive traversal of the input directory structure is enabled.
-
-
Constructor Detail
-
FileInputFormat
public FileInputFormat()
-
FileInputFormat
protected FileInputFormat(Path filePath)
-
-
Method Detail
-
registerInflaterInputStreamFactory
public static void registerInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory)
Registers a decompression algorithm through aInflaterInputStreamFactory
with a file extension for transparent decompression.- Parameters:
fileExtension
- of the compressed filesfactory
- to create anInflaterInputStream
that handles the decompression format
-
getInflaterInputStreamFactory
protected static InflaterInputStreamFactory<?> getInflaterInputStreamFactory(String fileExtension)
-
getSupportedCompressionFormats
@VisibleForTesting public static Set<String> getSupportedCompressionFormats()
-
extractFileExtension
protected static String extractFileExtension(String fileName)
Returns the extension of a file name (!= a path).- Returns:
- the extension of the file name or
null
if there is no extension.
-
getFilePaths
public Path[] getFilePaths()
Returns the paths of all files to be read by the FileInputFormat.- Returns:
- The list of all paths to read.
-
setFilePath
public void setFilePath(String filePath)
-
setFilePath
public void setFilePath(Path filePath)
Sets a single path of a file to be read.- Parameters:
filePath
- The path of the file to read.
-
setFilePaths
public void setFilePaths(String... filePaths)
Sets multiple paths of files to be read.- Parameters:
filePaths
- The paths of the files to read.
-
setFilePaths
public void setFilePaths(Path... filePaths)
Sets multiple paths of files to be read.- Parameters:
filePaths
- The paths of the files to read.
-
getMinSplitSize
public long getMinSplitSize()
-
setMinSplitSize
public void setMinSplitSize(long minSplitSize)
-
getNumSplits
public int getNumSplits()
-
setNumSplits
public void setNumSplits(int numSplits)
-
getOpenTimeout
public long getOpenTimeout()
-
setOpenTimeout
public void setOpenTimeout(long openTimeout)
-
setNestedFileEnumeration
public void setNestedFileEnumeration(boolean enable)
-
getNestedFileEnumeration
public boolean getNestedFileEnumeration()
-
getSplitStart
public long getSplitStart()
Gets the start of the current split.- Returns:
- The start of the split.
-
getSplitLength
public long getSplitLength()
Gets the length or remaining length of the current split.- Returns:
- The length or remaining length of the current split.
-
setFilesFilter
public void setFilesFilter(FilePathFilter filesFilter)
-
configure
public void configure(Configuration parameters)
Configures the file input format by reading the file path from the configuration.- Parameters:
parameters
- The configuration with all parameters (note: not the Flink config but the TaskConfig).- See Also:
InputFormat.configure(org.apache.flink.configuration.Configuration)
-
getStatistics
public FileInputFormat.FileBaseStatistics getStatistics(BaseStatistics cachedStats) throws IOException
Obtains basic file statistics containing only file size. If the input is a directory, then the size is the sum of all contained files.- Parameters:
cachedStats
- The statistics that were cached. May be null.- Returns:
- The base statistics for the input, or null, if not available.
- Throws:
IOException
- See Also:
InputFormat.getStatistics(org.apache.flink.api.common.io.statistics.BaseStatistics)
-
getFileStats
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files) throws IOException
- Throws:
IOException
-
getFileStats
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files) throws IOException
- Throws:
IOException
-
getInputSplitAssigner
public LocatableInputSplitAssigner getInputSplitAssigner(FileInputSplit[] splits)
Description copied from interface:InputSplitSource
Returns the assigner for the input splits. Assigner determines which parallel instance of the input format gets which input split.- Returns:
- The input split assigner.
-
createInputSplits
public FileInputSplit[] createInputSplits(int minNumSplits) throws IOException
Computes the input splits for the file. By default, one file block is one split. If more splits are requested than blocks are available, then a split may be a fraction of a block and splits may cross block boundaries.- Parameters:
minNumSplits
- The minimum desired number of file splits.- Returns:
- The computed file splits.
- Throws:
IOException
- See Also:
InputFormat.createInputSplits(int)
-
testForUnsplittable
protected boolean testForUnsplittable(FileStatus pathFile)
-
acceptFile
public boolean acceptFile(FileStatus fileStatus)
A simple hook to filter files and directories from the input. The method may be overridden. Hadoop's FileInputFormat has a similar mechanism and applies the same filters by default.- Parameters:
fileStatus
- The file status to check.- Returns:
- true, if the given file or directory is accepted
-
open
public void open(FileInputSplit fileSplit) throws IOException
Opens an input stream to the file defined in the input format. The stream is positioned at the beginning of the given split.The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread working on the input format do not reach the file system.
- Parameters:
fileSplit
- The split to be opened.- Throws:
IOException
- Thrown, if the spit could not be opened due to an I/O problem.
-
decorateInputStream
protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable
This method allows to wrap/decorate the rawFSDataInputStream
for a certain file split, e.g., for decoding. When overriding this method, also consider adaptingtestForUnsplittable(org.apache.flink.core.fs.FileStatus)
if your stream decoration renders the input file unsplittable. Also consider calling existing superclass implementations.- Parameters:
inputStream
- is the input stream to decoratedfileSplit
- is the file split for which the input stream shall be decorated- Returns:
- the decorated input stream
- Throws:
Throwable
- if the decoration fails- See Also:
InputStreamFSInputWrapper
-
close
public void close() throws IOException
Closes the file input stream of the input format.- Throws:
IOException
- Thrown, if the input could not be closed properly.
-
-