@Public public abstract class FileInputFormat<OT> extends RichInputFormat<OT,FileInputSplit>
RichInputFormat
s that read from files. For specific input types the
InputFormat.nextRecord(Object)
and InputFormat.reachedEnd()
methods need to be implemented.
Additionally, one may override open(FileInputSplit)
and close()
to change the
life cycle behavior.
After the open(FileInputSplit)
method completed, the file input data is available
from the stream
field.
Modifier and Type | Class and Description |
---|---|
static class |
FileInputFormat.FileBaseStatistics
Encapsulation of the basic statistics the optimizer obtains about a file.
|
static class |
FileInputFormat.InputSplitOpenThread
Obtains a DataInputStream in an thread that is not interrupted.
|
Modifier and Type | Field and Description |
---|---|
protected FileInputSplit |
currentSplit
The current split that this parallel instance must consume.
|
static String |
ENUMERATE_NESTED_FILES_FLAG
Deprecated.
|
protected boolean |
enumerateNestedFiles
The flag to specify whether recursive traversal of the input directory structure is enabled.
|
protected Path |
filePath
Deprecated.
Please override
supportsMultiPaths() and use getFilePaths() and setFilePaths(Path...) . |
protected static Map<String,InflaterInputStreamFactory<?>> |
INFLATER_INPUT_STREAM_FACTORIES
A mapping of file extensions to decompression algorithms based on DEFLATE.
|
protected long |
minSplitSize
The minimal split size, set by the configure() method.
|
protected int |
numSplits
The desired number of splits, as set by the configure() method.
|
protected long |
openTimeout
Stream opening timeout.
|
protected static long |
READ_WHOLE_SPLIT_FLAG
The splitLength is set to -1L for reading the whole split.
|
protected long |
splitLength
The length of the split that this parallel instance must consume.
|
protected long |
splitStart
The start of the split that this parallel instance must consume.
|
protected FSDataInputStream |
stream
The input stream reading from the input file.
|
protected boolean |
unsplittable
Some file input formats are not splittable on a block level (deflate) Therefore, the
FileInputFormat can only read whole files.
|
Modifier | Constructor and Description |
---|---|
|
FileInputFormat() |
protected |
FileInputFormat(Path filePath) |
Modifier and Type | Method and Description |
---|---|
boolean |
acceptFile(FileStatus fileStatus)
A simple hook to filter files and directories from the input.
|
void |
close()
Closes the file input stream of the input format.
|
void |
configure(Configuration parameters)
Configures the file input format by reading the file path from the configuration.
|
FileInputSplit[] |
createInputSplits(int minNumSplits)
Computes the input splits for the file.
|
protected FSDataInputStream |
decorateInputStream(FSDataInputStream inputStream,
FileInputSplit fileSplit)
This method allows to wrap/decorate the raw
FSDataInputStream for a certain file
split, e.g., for decoding. |
protected static String |
extractFileExtension(String fileName)
Returns the extension of a file name (!
|
Path |
getFilePath()
Deprecated.
Please use getFilePaths() instead.
|
Path[] |
getFilePaths()
Returns the paths of all files to be read by the FileInputFormat.
|
protected FileInputFormat.FileBaseStatistics |
getFileStats(FileInputFormat.FileBaseStatistics cachedStats,
Path[] filePaths,
ArrayList<FileStatus> files) |
protected FileInputFormat.FileBaseStatistics |
getFileStats(FileInputFormat.FileBaseStatistics cachedStats,
Path filePath,
FileSystem fs,
ArrayList<FileStatus> files) |
protected static InflaterInputStreamFactory<?> |
getInflaterInputStreamFactory(String fileExtension) |
LocatableInputSplitAssigner |
getInputSplitAssigner(FileInputSplit[] splits)
Returns the assigner for the input splits.
|
long |
getMinSplitSize() |
boolean |
getNestedFileEnumeration() |
int |
getNumSplits() |
long |
getOpenTimeout() |
long |
getSplitLength()
Gets the length or remaining length of the current split.
|
long |
getSplitStart()
Gets the start of the current split.
|
FileInputFormat.FileBaseStatistics |
getStatistics(BaseStatistics cachedStats)
Obtains basic file statistics containing only file size.
|
static Set<String> |
getSupportedCompressionFormats() |
void |
open(FileInputSplit fileSplit)
Opens an input stream to the file defined in the input format.
|
static void |
registerInflaterInputStreamFactory(String fileExtension,
InflaterInputStreamFactory<?> factory)
Registers a decompression algorithm through a
InflaterInputStreamFactory with a file extension
for transparent decompression. |
void |
setFilePath(Path filePath)
Sets a single path of a file to be read.
|
void |
setFilePath(String filePath) |
void |
setFilePaths(Path... filePaths)
Sets multiple paths of files to be read.
|
void |
setFilePaths(String... filePaths)
Sets multiple paths of files to be read.
|
void |
setFilesFilter(FilePathFilter filesFilter) |
void |
setMinSplitSize(long minSplitSize) |
void |
setNestedFileEnumeration(boolean enable) |
void |
setNumSplits(int numSplits) |
void |
setOpenTimeout(long openTimeout) |
boolean |
supportsMultiPaths()
Deprecated.
Will be removed for Flink 2.0.
|
protected boolean |
testForUnsplittable(FileStatus pathFile) |
String |
toString() |
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
nextRecord, reachedEnd
protected static final Map<String,InflaterInputStreamFactory<?>> INFLATER_INPUT_STREAM_FACTORIES
protected static final long READ_WHOLE_SPLIT_FLAG
protected transient FSDataInputStream stream
protected transient long splitStart
protected transient long splitLength
protected transient FileInputSplit currentSplit
@Deprecated protected Path filePath
supportsMultiPaths()
and use getFilePaths()
and setFilePaths(Path...)
.protected long minSplitSize
protected int numSplits
protected long openTimeout
protected boolean unsplittable
protected boolean enumerateNestedFiles
@Deprecated public static final String ENUMERATE_NESTED_FILES_FLAG
public FileInputFormat()
protected FileInputFormat(Path filePath)
public static void registerInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory)
InflaterInputStreamFactory
with a file extension
for transparent decompression.fileExtension
- of the compressed filesfactory
- to create an InflaterInputStream
that handles the
decompression formatprotected static InflaterInputStreamFactory<?> getInflaterInputStreamFactory(String fileExtension)
@VisibleForTesting public static Set<String> getSupportedCompressionFormats()
protected static String extractFileExtension(String fileName)
null
if there is no extension.@Deprecated public Path getFilePath()
public Path[] getFilePaths()
public void setFilePath(String filePath)
public void setFilePath(Path filePath)
filePath
- The path of the file to read.public void setFilePaths(String... filePaths)
filePaths
- The paths of the files to read.public void setFilePaths(Path... filePaths)
filePaths
- The paths of the files to read.public long getMinSplitSize()
public void setMinSplitSize(long minSplitSize)
public int getNumSplits()
public void setNumSplits(int numSplits)
public long getOpenTimeout()
public void setOpenTimeout(long openTimeout)
public void setNestedFileEnumeration(boolean enable)
public boolean getNestedFileEnumeration()
public long getSplitStart()
public long getSplitLength()
public void setFilesFilter(FilePathFilter filesFilter)
public void configure(Configuration parameters)
parameters
- The configuration with all parameters (note: not the Flink config but the
TaskConfig).InputFormat.configure(org.apache.flink.configuration.Configuration)
public FileInputFormat.FileBaseStatistics getStatistics(BaseStatistics cachedStats) throws IOException
cachedStats
- The statistics that were cached. May be null.IOException
InputFormat.getStatistics(org.apache.flink.api.common.io.statistics.BaseStatistics)
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files) throws IOException
IOException
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files) throws IOException
IOException
public LocatableInputSplitAssigner getInputSplitAssigner(FileInputSplit[] splits)
InputSplitSource
public FileInputSplit[] createInputSplits(int minNumSplits) throws IOException
minNumSplits
- The minimum desired number of file splits.IOException
InputFormat.createInputSplits(int)
protected boolean testForUnsplittable(FileStatus pathFile)
public boolean acceptFile(FileStatus fileStatus)
fileStatus
- The file status to check.public void open(FileInputSplit fileSplit) throws IOException
The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread working on the input format do not reach the file system.
fileSplit
- The split to be opened.IOException
- Thrown, if the spit could not be opened due to an I/O problem.protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable
FSDataInputStream
for a certain file
split, e.g., for decoding. When overriding this method, also consider adapting testForUnsplittable(org.apache.flink.core.fs.FileStatus)
if your stream decoration renders the input file
unsplittable. Also consider calling existing superclass implementations.inputStream
- is the input stream to decoratedfileSplit
- is the file split for which the input stream shall be decoratedThrowable
- if the decoration failsInputStreamFSInputWrapper
public void close() throws IOException
IOException
- Thrown, if the input could not be closed properly.@Deprecated public boolean supportsMultiPaths()
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.