Class FileSink<IN>
- java.lang.Object
-
- org.apache.flink.connector.file.sink.FileSink<IN>
-
- Type Parameters:
IN
- Type of the elements in the input of the sink that are also the elements to be written to its output
- All Implemented Interfaces:
Serializable
,SupportsConcurrentExecutionAttempts
,Sink<IN>
,SupportsCommitter<FileSinkCommittable>
,SupportsWriterState<IN,FileWriterBucketState>
,SupportsWriterState.WithCompatibleState
,SupportsPreCommitTopology<FileSinkCommittable,FileSinkCommittable>
@Experimental public class FileSink<IN> extends Object implements Sink<IN>, SupportsWriterState<IN,FileWriterBucketState>, SupportsCommitter<FileSinkCommittable>, SupportsWriterState.WithCompatibleState, SupportsPreCommitTopology<FileSinkCommittable,FileSinkCommittable>, SupportsConcurrentExecutionAttempts
A unified sink that emits its input elements toFileSystem
files within buckets. This sink achieves exactly-once semantics for bothBATCH
andSTREAMING
.When creating the sink a
basePath
must be specified. The base directory contains one directory for every bucket. The bucket directories themselves contain several part files, with at least one for each parallel subtask of the sink which is writing data to that bucket. These part files contain the actual output data.The sink uses a
BucketAssigner
to determine in which bucket directory each element should be written to inside the base directory. TheBucketAssigner
can, for example, roll on every checkpoint or use time or a property of the element to determine the bucket directory. The defaultBucketAssigner
is aDateTimeBucketAssigner
which will create one new bucket every hour. You can specify a customBucketAssigner
using thesetBucketAssigner(bucketAssigner)
method, after callingforRowFormat(Path, Encoder)
orforBulkFormat(Path, BulkWriter.Factory)
.The names of the part files could be defined using
OutputFileConfig
. This configuration contains a part prefix and a part suffix that will be used with a random uid assigned to each subtask of the sink and a rolling counter to determine the file names. For example with a prefix "prefix" and a suffix ".ext", a file named"prefix-81fc4980-a6af-41c8-9937-9939408a734b-17.ext"
contains the data from subtask with uid81fc4980-a6af-41c8-9937-9939408a734b
of the sink and is the17th
part-file created by that subtask.Part files roll based on the user-specified
RollingPolicy
. By default, aDefaultRollingPolicy
is used for row-encoded sink output; aOnCheckpointRollingPolicy
is used for bulk-encoded sink output.In some scenarios, the open buckets are required to change based on time. In these cases, the user can specify a
bucketCheckInterval
(by default 1m) and the sink will check periodically and roll the part file if the specified rolling policy says so.Part files can be in one of three states:
in-progress
,pending
orfinished
. The reason for this is how the sink works to provide exactly-once semantics and fault-tolerance. The part file that is currently being written to isin-progress
. Once a part file is closed for writing it becomespending
. When a checkpoint is successful (forSTREAMING
) or at the end of the job (forBATCH
) the currently pending files will be moved tofinished
.For
STREAMING
in order to guarantee exactly-once semantics in case of a failure, the sink should roll back to the state it had when that last successful checkpoint occurred. To this end, when restoring, the restored files inpending
state are transferred into thefinished
state while anyin-progress
files are rolled back, so that they do not contain data that arrived after the checkpoint from which we restore.FileSink also support compacting small files to accelerate the access speed of the resulted files. Compaction could be enabled via
enableCompact
. Once enabled, the compaction could only be disabled via callingdisableCompact
explicitly, otherwise there might be data loss.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FileSink.BulkFormatBuilder<IN,T extends FileSink.BulkFormatBuilder<IN,T>>
A builder for configuring the sink for bulk-encoding formats, e.g.static class
FileSink.DefaultBulkFormatBuilder<IN>
Builder for the vanillaFileSink
using a bulk format.static class
FileSink.DefaultRowFormatBuilder<IN>
Builder for the vanillaFileSink
using a row format.static class
FileSink.RowFormatBuilder<IN,T extends FileSink.RowFormatBuilder<IN,T>>
A builder for configuring the sink for row-wise encoding formats.-
Nested classes/interfaces inherited from interface org.apache.flink.api.connector.sink2.SupportsWriterState
SupportsWriterState.WithCompatibleState
-
-
Method Summary
-
-
-
Method Detail
-
createWriter
public FileWriter<IN> createWriter(WriterInitContext context) throws IOException
Description copied from interface:Sink
Creates aSinkWriter
.- Specified by:
createWriter
in interfaceSink<IN>
- Parameters:
context
- the runtime context.- Returns:
- A sink writer.
- Throws:
IOException
- for any failure during creation.
-
restoreWriter
public FileWriter<IN> restoreWriter(WriterInitContext context, Collection<FileWriterBucketState> recoveredState) throws IOException
Description copied from interface:SupportsWriterState
Create aStatefulSinkWriter
from a recovered state.- Specified by:
restoreWriter
in interfaceSupportsWriterState<IN,FileWriterBucketState>
- Parameters:
context
- the runtime context.recoveredState
- the state to recover from.- Returns:
- A sink writer.
- Throws:
IOException
- for any failure during creation.
-
getWriterStateSerializer
public SimpleVersionedSerializer<FileWriterBucketState> getWriterStateSerializer()
Description copied from interface:SupportsWriterState
Any stateful sink needs to provide this state serializer and implementStatefulSinkWriter.snapshotState(long)
properly. The respective state is used inSupportsWriterState.restoreWriter(WriterInitContext, Collection)
on recovery.- Specified by:
getWriterStateSerializer
in interfaceSupportsWriterState<IN,FileWriterBucketState>
- Returns:
- the serializer of the writer's state type.
-
createCommitter
public Committer<FileSinkCommittable> createCommitter(CommitterInitContext context) throws IOException
Description copied from interface:SupportsCommitter
Creates aCommitter
that permanently makes the previously written data visible throughCommitter.commit(Collection)
.- Specified by:
createCommitter
in interfaceSupportsCommitter<IN>
- Parameters:
context
- The context information for the committer initialization.- Returns:
- A committer for the two-phase commit protocol.
- Throws:
IOException
- for any failure during creation.
-
getCommittableSerializer
public SimpleVersionedSerializer<FileSinkCommittable> getCommittableSerializer()
Description copied from interface:SupportsCommitter
Returns the serializer of the committable type.- Specified by:
getCommittableSerializer
in interfaceSupportsCommitter<IN>
-
getWriteResultSerializer
public SimpleVersionedSerializer<FileSinkCommittable> getWriteResultSerializer()
Description copied from interface:SupportsPreCommitTopology
Returns the serializer of the WriteResult type.- Specified by:
getWriteResultSerializer
in interfaceSupportsPreCommitTopology<FileSinkCommittable,FileSinkCommittable>
-
getCompatibleWriterStateNames
public Collection<String> getCompatibleWriterStateNames()
Description copied from interface:SupportsWriterState.WithCompatibleState
A collection of state names of sinks from which the state can be restored. For example, the newFileSink
can resume from the state of an oldStreamingFileSink
as a drop-in replacement when resuming from a checkpoint/savepoint.- Specified by:
getCompatibleWriterStateNames
in interfaceSupportsWriterState.WithCompatibleState
-
forRowFormat
public static <IN> FileSink.DefaultRowFormatBuilder<IN> forRowFormat(Path basePath, Encoder<IN> encoder)
-
forBulkFormat
public static <IN> FileSink.DefaultBulkFormatBuilder<IN> forBulkFormat(Path basePath, BulkWriter.Factory<IN> bulkWriterFactory)
-
addPreCommitTopology
public DataStream<CommittableMessage<FileSinkCommittable>> addPreCommitTopology(DataStream<CommittableMessage<FileSinkCommittable>> committableStream)
Description copied from interface:SupportsPreCommitTopology
Intercepts and modifies the committables sent on checkpoint or at end of input. Implementers need to ensure to modify allCommittableMessage
s appropriately.- Specified by:
addPreCommitTopology
in interfaceSupportsPreCommitTopology<FileSinkCommittable,FileSinkCommittable>
- Parameters:
committableStream
- the stream of committables.- Returns:
- the custom topology before
Committer
.
-
-