public abstract class TableInputFormat<T extends Tuple> extends RichInputFormat<T,TableInputSplit>
InputFormat
subclass that wraps the access for HTables.Modifier and Type | Field and Description |
---|---|
protected org.apache.hadoop.hbase.client.Scan |
scan |
protected org.apache.hadoop.hbase.client.HTable |
table |
Constructor and Description |
---|
TableInputFormat() |
Modifier and Type | Method and Description |
---|---|
void |
close()
Method that marks the end of the life-cycle of an input split.
|
void |
closeInputFormat()
Closes this InputFormat instance.
|
void |
configure(Configuration parameters)
Creates a
Scan object and opens the HTable connection. |
TableInputSplit[] |
createInputSplits(int minNumSplits)
Creates the different splits of the input that can be processed in parallel.
|
InputSplitAssigner |
getInputSplitAssigner(TableInputSplit[] inputSplits)
Gets the type of the input splits that are processed by this input format.
|
protected abstract org.apache.hadoop.hbase.client.Scan |
getScanner()
Returns an instance of Scan that retrieves the required subset of records from the HBase table.
|
BaseStatistics |
getStatistics(BaseStatistics cachedStatistics)
Gets the basic statistics from the input described by this format.
|
protected abstract String |
getTableName()
What table is to be read.
|
protected boolean |
includeRegionInSplit(byte[] startKey,
byte[] endKey)
Test if the given region is to be included in the InputSplit while splitting the regions of a table.
|
protected abstract T |
mapResultToTuple(org.apache.hadoop.hbase.client.Result r)
The output from HBase is always an instance of
Result . |
T |
nextRecord(T reuse)
Reads the next record from the input.
|
void |
open(TableInputSplit split)
Opens a parallel instance of the input format to work on a split.
|
boolean |
reachedEnd()
Method used to check if the end of the input is reached.
|
getRuntimeContext, openInputFormat, setRuntimeContext
protected transient org.apache.hadoop.hbase.client.HTable table
protected transient org.apache.hadoop.hbase.client.Scan scan
protected abstract org.apache.hadoop.hbase.client.Scan getScanner()
protected abstract String getTableName()
protected abstract T mapResultToTuple(org.apache.hadoop.hbase.client.Result r)
Result
.
This method is to copy the data in the Result instance into the required Tuple
r
- The Result instance from HBase that needs to be convertedTuple
that contains the needed information.public void configure(Configuration parameters)
Scan
object and opens the HTable
connection.
These are opened here because they are needed in the createInputSplits
which is called before the openInputFormat method.
So the connection is opened in configure(Configuration)
and closed in closeInputFormat()
.parameters
- The configuration that is to be usedConfiguration
public void open(TableInputSplit split) throws IOException
InputFormat
When this method is called, the input format it guaranteed to be configured.
split
- The split to be opened.IOException
- Thrown, if the spit could not be opened due to an I/O problem.public boolean reachedEnd() throws IOException
InputFormat
When this method is called, the input format it guaranteed to be opened.
IOException
- Thrown, if an I/O error occurred.public T nextRecord(T reuse) throws IOException
InputFormat
When this method is called, the input format it guaranteed to be opened.
reuse
- Object that may be reused.IOException
- Thrown, if an I/O error occurred.public void close() throws IOException
InputFormat
When this method is called, the input format it guaranteed to be opened.
IOException
- Thrown, if the input could not be closed properly.public void closeInputFormat()
RichInputFormat
RichInputFormat.openInputFormat()
should be closed in this method.closeInputFormat
in class RichInputFormat<T extends Tuple,TableInputSplit>
InputFormat
public TableInputSplit[] createInputSplits(int minNumSplits) throws IOException
InputFormat
When this method is called, the input format it guaranteed to be configured.
minNumSplits
- The minimum desired number of splits. If fewer are created, some parallel
instances may remain idle.IOException
- Thrown, when the creation of the splits was erroneous.protected boolean includeRegionInSplit(byte[] startKey, byte[] endKey)
This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job,
(and hence, not contributing to the InputSplit), given the start and end keys of the same.
Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R
processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due
to the ordering of the keys.
Note: It is possible that endKey.length() == 0
, for the last (recent) region.
Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded(
i.e. all regions are included).
startKey
- Start key of the regionendKey
- End key of the regionpublic InputSplitAssigner getInputSplitAssigner(TableInputSplit[] inputSplits)
InputFormat
public BaseStatistics getStatistics(BaseStatistics cachedStatistics)
InputFormat
When this method is called, the input format it guaranteed to be configured.
cachedStatistics
- The statistics that were cached. May be null.Copyright © 2014–2017 The Apache Software Foundation. All rights reserved.