public final class HadoopDataInputStream extends FSDataInputStream
FSDataInputStream
for Hadoop's input streams. This
supports all file systems supported by Hadoop, such as HDFS and S3 (S3a/S3n).Modifier and Type | Field and Description |
---|---|
static int |
MIN_SKIP_BYTES
Minimum amount of bytes to skip forward before we issue a seek instead of discarding read.
|
Constructor and Description |
---|
HadoopDataInputStream(org.apache.hadoop.fs.FSDataInputStream fsDataInputStream)
Creates a new data input stream from the given Hadoop input stream.
|
Modifier and Type | Method and Description |
---|---|
int |
available() |
void |
close() |
void |
forceSeek(long seekPos)
Positions the stream to the given location.
|
org.apache.hadoop.fs.FSDataInputStream |
getHadoopInputStream()
Gets the wrapped Hadoop input stream.
|
long |
getPos()
Gets the current position in the input stream.
|
int |
read() |
int |
read(byte[] buffer,
int offset,
int length) |
void |
seek(long seekPos)
Seek to the given offset from the start of the file.
|
long |
skip(long n) |
void |
skipFully(long bytes)
Skips over a given amount of bytes in the stream.
|
mark, markSupported, read, reset
public static final int MIN_SKIP_BYTES
The current value is just a magic number. In the long run, this value could become configurable, but for now it is a conservative, relatively small value that should bring safe improvements for small skips (e.g. in reading meta data), that would hurt the most with frequent seeks.
The optimal value depends on the DFS implementation and configuration plus the underlying filesystem. For now, this number is chosen "big enough" to provide improvements for smaller seeks, and "small enough" to avoid disadvantages over real seeks. While the minimum should be the page size, a true optimum per system would be the amounts of bytes the can be consumed sequentially within the seektime. Unfortunately, seektime is not constant and devices, OS, and DFS potentially also use read buffers and read-ahead.
public HadoopDataInputStream(org.apache.hadoop.fs.FSDataInputStream fsDataInputStream)
fsDataInputStream
- The Hadoop input streampublic void seek(long seekPos) throws IOException
FSDataInputStream
seek
in class FSDataInputStream
seekPos
- the desired offsetIOException
- Thrown if an error occurred while seeking inside the input stream.public long getPos() throws IOException
FSDataInputStream
getPos
in class FSDataInputStream
IOException
- Thrown if an I/O error occurred in the underlying stream implementation
while accessing the stream's position.public int read() throws IOException
read
in class InputStream
IOException
public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
close
in class InputStream
IOException
public int read(@Nonnull byte[] buffer, int offset, int length) throws IOException
read
in class InputStream
IOException
public int available() throws IOException
available
in class InputStream
IOException
public long skip(long n) throws IOException
skip
in class InputStream
IOException
public org.apache.hadoop.fs.FSDataInputStream getHadoopInputStream()
public void forceSeek(long seekPos) throws IOException
seek(long)
, this method
will always issue a "seek" command to the dfs and may not replace it by skip(long)
for small seeks.
Notice that the underlying DFS implementation can still decide to do skip instead of seek.
seekPos
- the position to seek to.IOException
public void skipFully(long bytes) throws IOException
bytes
- the number of bytes to skip.IOException
Copyright © 2014–2021 The Apache Software Foundation. All rights reserved.