IN
- Type of the elements emitted by this sinkFileSink
instead.@PublicEvolving @Deprecated public class StreamingFileSink<IN> extends RichSinkFunction<IN> implements CheckpointedFunction, CheckpointListener
FileSystem
files within buckets. This is integrated
with the checkpointing mechanism to provide exactly once semantics.
When creating the sink a basePath
must be specified. The base directory contains one
directory for every bucket. The bucket directories themselves contain several part files, with at
least one for each parallel subtask of the sink which is writing data to that bucket. These part
files contain the actual output data.
The sink uses a BucketAssigner
to determine in which bucket directory each element
should be written to inside the base directory. The BucketAssigner
can, for example, use
time or a property of the element to determine the bucket directory. The default BucketAssigner
is a DateTimeBucketAssigner
which will create one new bucket every hour.
You can specify a custom BucketAssigner
using the setBucketAssigner(bucketAssigner)
method, after calling forRowFormat(Path, Encoder)
or forBulkFormat(Path,
BulkWriter.Factory)
.
The names of the part files could be defined using OutputFileConfig
. This
configuration contains a part prefix and a part suffix that will be used with the parallel
subtask index of the sink and a rolling counter to determine the file names. For example with a
prefix "prefix" and a suffix ".ext", a file named "prefix-1-17.ext"
contains the data
from subtask 1
of the sink and is the 17th
bucket created by that subtask.
Part files roll based on the user-specified RollingPolicy
. By default, a DefaultRollingPolicy
is used for row-encoded sink output; a OnCheckpointRollingPolicy
is
used for bulk-encoded sink output.
In some scenarios, the open buckets are required to change based on time. In these cases, the
user can specify a bucketCheckInterval
(by default 1m) and the sink will check
periodically and roll the part file if the specified rolling policy says so.
Part files can be in one of three states: in-progress
, pending
or finished
. The reason for this is how the sink works together with the checkpointing mechanism to
provide exactly-once semantics and fault-tolerance. The part file that is currently being written
to is in-progress
. Once a part file is closed for writing it becomes pending
.
When a checkpoint is successful the currently pending files will be moved to finished
.
If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll
back to the state it had when that last successful checkpoint occurred. To this end, when
restoring, the restored files in pending
state are transferred into the finished
state while any in-progress
files are rolled back, so that they do not contain data that
arrived after the checkpoint from which we restore.
Modifier and Type | Class and Description |
---|---|
static class |
StreamingFileSink.BucketsBuilder<IN,BucketID,T extends StreamingFileSink.BucketsBuilder<IN,BucketID,T>>
Deprecated.
The base abstract class for the
StreamingFileSink.RowFormatBuilder and StreamingFileSink.BulkFormatBuilder . |
static class |
StreamingFileSink.BulkFormatBuilder<IN,BucketID,T extends StreamingFileSink.BulkFormatBuilder<IN,BucketID,T>>
Deprecated.
A builder for configuring the sink for bulk-encoding formats, e.g.
|
static class |
StreamingFileSink.DefaultBulkFormatBuilder<IN>
Deprecated.
Builder for the vanilla
StreamingFileSink using a bulk format. |
static class |
StreamingFileSink.DefaultRowFormatBuilder<IN>
Deprecated.
Builder for the vanilla
StreamingFileSink using a row format. |
static class |
StreamingFileSink.RowFormatBuilder<IN,BucketID,T extends StreamingFileSink.RowFormatBuilder<IN,BucketID,T>>
Deprecated.
A builder for configuring the sink for row-wise encoding formats.
|
SinkFunction.Context
Modifier | Constructor and Description |
---|---|
protected |
StreamingFileSink(StreamingFileSink.BucketsBuilder<IN,?,? extends StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder,
long bucketCheckInterval)
Deprecated.
Creates a new
StreamingFileSink that writes files to the given base directory with
the give buckets properties. |
Modifier and Type | Method and Description |
---|---|
void |
close()
Deprecated.
Tear-down method for the user code.
|
static <IN> StreamingFileSink.DefaultBulkFormatBuilder<IN> |
forBulkFormat(Path basePath,
BulkWriter.Factory<IN> writerFactory)
Deprecated.
Creates the builder for a
StreamingFileSink with bulk-encoding format. |
static <IN> StreamingFileSink.DefaultRowFormatBuilder<IN> |
forRowFormat(Path basePath,
Encoder<IN> encoder)
Deprecated.
Creates the builder for a
StreamingFileSink with row-encoding format. |
void |
initializeState(FunctionInitializationContext context)
Deprecated.
This method is called when the parallel function instance is created during distributed
execution.
|
void |
invoke(IN value,
SinkFunction.Context context)
Deprecated.
Writes the given value to the sink.
|
void |
notifyCheckpointAborted(long checkpointId)
Deprecated.
This method is called as a notification once a distributed checkpoint has been aborted.
|
void |
notifyCheckpointComplete(long checkpointId)
Deprecated.
Notifies the listener that the checkpoint with the given
checkpointId completed and
was committed. |
void |
snapshotState(FunctionSnapshotContext context)
Deprecated.
This method is called when a snapshot for a checkpoint is requested.
|
getIterationRuntimeContext, getRuntimeContext, open, setRuntimeContext
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
finish, invoke, writeWatermark
open
protected StreamingFileSink(StreamingFileSink.BucketsBuilder<IN,?,? extends StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder, long bucketCheckInterval)
StreamingFileSink
that writes files to the given base directory with
the give buckets properties.public static <IN> StreamingFileSink.DefaultRowFormatBuilder<IN> forRowFormat(Path basePath, Encoder<IN> encoder)
StreamingFileSink
with row-encoding format.IN
- the type of incoming elementsbasePath
- the base path where all the buckets are going to be created as
sub-directories.encoder
- the Encoder
to be used when writing elements in the buckets.StreamingFileSink.RowFormatBuilder.build()
after
specifying the desired parameters.public static <IN> StreamingFileSink.DefaultBulkFormatBuilder<IN> forBulkFormat(Path basePath, BulkWriter.Factory<IN> writerFactory)
StreamingFileSink
with bulk-encoding format.IN
- the type of incoming elementsbasePath
- the base path where all the buckets are going to be created as
sub-directories.writerFactory
- the BulkWriter.Factory
to be used when writing elements in the
buckets.StreamingFileSink.BulkFormatBuilder.build()
after specifying the desired parameters.public void initializeState(FunctionInitializationContext context) throws Exception
CheckpointedFunction
initializeState
in interface CheckpointedFunction
context
- the context for initializing the operatorException
- Thrown, if state could not be created ot restored.public void notifyCheckpointComplete(long checkpointId) throws Exception
CheckpointListener
checkpointId
completed and
was committed.
These notifications are "best effort", meaning they can sometimes be skipped. To behave
properly, implementers need to follow the "Checkpoint Subsuming Contract". Please see the
class-level JavaDocs
for details.
Please note that checkpoints may generally overlap, so you cannot assume that the notifyCheckpointComplete()
call is always for the latest prior checkpoint (or snapshot) that
was taken on the function/operator implementing this interface. It might be for a checkpoint
that was triggered earlier. Implementing the "Checkpoint Subsuming Contract" (see above)
properly handles this situation correctly as well.
Please note that throwing exceptions from this method will not cause the completed checkpoint to be revoked. Throwing exceptions will typically cause task/job failure and trigger recovery.
notifyCheckpointComplete
in interface CheckpointListener
checkpointId
- The ID of the checkpoint that has been completed.Exception
- This method can propagate exceptions, which leads to a failure/recovery for
the task. Note that this will NOT lead to the checkpoint being revoked.public void notifyCheckpointAborted(long checkpointId)
CheckpointListener
Important: The fact that a checkpoint has been aborted does NOT mean that the data
and artifacts produced between the previous checkpoint and the aborted checkpoint are to be
discarded. The expected behavior is as if this checkpoint was never triggered in the first
place, and the next successful checkpoint simply covers a longer time span. See the
"Checkpoint Subsuming Contract" in the class-level JavaDocs
for
details.
These notifications are "best effort", meaning they can sometimes be skipped.
This method is very rarely necessary to implement. The "best effort" guarantee, together with the fact that this method should not result in discarding any data (per the "Checkpoint Subsuming Contract") means it is mainly useful for earlier cleanups of auxiliary resources. One example is to pro-actively clear a local per-checkpoint state cache upon checkpoint failure.
notifyCheckpointAborted
in interface CheckpointListener
checkpointId
- The ID of the checkpoint that has been aborted.public void snapshotState(FunctionSnapshotContext context) throws Exception
CheckpointedFunction
FunctionInitializationContext
when the Function was initialized, or offered now by FunctionSnapshotContext
itself.snapshotState
in interface CheckpointedFunction
context
- the context for drawing a snapshot of the operatorException
- Thrown, if state could not be created ot restored.public void invoke(IN value, SinkFunction.Context context) throws Exception
SinkFunction
You have to override this method when implementing a SinkFunction
, this is a
default
method for backward compatibility with the old-style method only.
invoke
in interface SinkFunction<IN>
value
- The input record.context
- Additional context about the input record.Exception
- This method may throw exceptions. Throwing an exception will cause the
operation to fail and may trigger recovery.public void close() throws Exception
RichFunction
This method can be used for clean up work.
close
in interface RichFunction
close
in class AbstractRichFunction
Exception
- Implementations may forward exceptions, which are caught by the runtime.
When the runtime catches an exception, it aborts the task and lets the fail-over logic
decide whether to retry the task execution.Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.