public class RocksDBKeyedStateBackend<K> extends AbstractKeyedStateBackend<K>
AbstractKeyedStateBackend
that stores its state in RocksDB
and serializes
state to streams provided by a CheckpointStreamFactory
upon checkpointing. This state backend can store very large state that exceeds memory and spills
to disk. Except for the snapshotting, this class should be accessed as if it is not threadsafe.
This class follows the rules for closing/releasing native RocksDB resources as described in + this document.
Modifier and Type | Class and Description |
---|---|
static class |
RocksDBKeyedStateBackend.RocksDbKvStateInfo
Rocks DB specific information about the k/v states.
|
AbstractKeyedStateBackend.PartitionStateFactory
KeyedStateBackend.KeySelectionListener<K>
Modifier and Type | Field and Description |
---|---|
protected org.rocksdb.RocksDB |
db
Our RocksDB database, this is used by the actual subclasses of
AbstractRocksDBState
to store state. |
static String |
MERGE_OPERATOR_NAME
The name of the merge operator in RocksDB.
|
cancelStreamRegistry, keyContext, keyGroupCompressionDecorator, keyGroupRange, keySerializer, kvStateRegistry, latencyTrackingStateConfig, numberOfKeyGroups, ttlTimeProvider, userCodeClassLoader
Constructor and Description |
---|
RocksDBKeyedStateBackend(ClassLoader userCodeClassLoader,
File instanceBasePath,
RocksDBResourceContainer optionsContainer,
Function<String,org.rocksdb.ColumnFamilyOptions> columnFamilyOptionsFactory,
TaskKvStateRegistry kvStateRegistry,
TypeSerializer<K> keySerializer,
ExecutionConfig executionConfig,
TtlTimeProvider ttlTimeProvider,
LatencyTrackingStateConfig latencyTrackingStateConfig,
org.rocksdb.RocksDB db,
LinkedHashMap<String,RocksDBKeyedStateBackend.RocksDbKvStateInfo> kvStateInformation,
Map<String,HeapPriorityQueueSnapshotRestoreWrapper<?>> registeredPQStates,
int keyGroupPrefixBytes,
CloseableRegistry cancelStreamRegistry,
StreamCompressionDecorator keyGroupCompressionDecorator,
ResourceGuard rocksDBResourceGuard,
RocksDBSnapshotStrategyBase<K,?> checkpointSnapshotStrategy,
RocksDBWriteBatchWrapper writeBatchWrapper,
org.rocksdb.ColumnFamilyHandle defaultColumnFamilyHandle,
RocksDBNativeMetricMonitor nativeMetricMonitor,
SerializedCompositeKeyBuilder<K> sharedRocksKeyBuilder,
PriorityQueueSetFactory priorityQueueFactory,
RocksDbTtlCompactFiltersManager ttlCompactFiltersManager,
InternalKeyContext<K> keyContext,
long writeBatchSize) |
Modifier and Type | Method and Description |
---|---|
void |
compactState(StateDescriptor<?,?> stateDesc) |
<T extends HeapPriorityQueueElement & PriorityComparable<? super T> & Keyed<?>> |
create(String stateName,
TypeSerializer<T> byteOrderedElementSerializer)
Creates a
KeyGroupedInternalPriorityQueue . |
<T extends HeapPriorityQueueElement & PriorityComparable<? super T> & Keyed<?>> |
create(String stateName,
TypeSerializer<T> byteOrderedElementSerializer,
boolean allowFutureMetadataUpdates)
Creates a
KeyGroupedInternalPriorityQueue . |
<N,SV,SEV,S extends State,IS extends S> |
createOrUpdateInternalState(TypeSerializer<N> namespaceSerializer,
StateDescriptor<S,SV> stateDesc,
StateSnapshotTransformer.StateSnapshotTransformFactory<SEV> snapshotTransformFactory)
Creates or updates internal state and returns a new
InternalKvState . |
<N,SV,SEV,S extends State,IS extends S> |
createOrUpdateInternalState(TypeSerializer<N> namespaceSerializer,
StateDescriptor<S,SV> stateDesc,
StateSnapshotTransformer.StateSnapshotTransformFactory<SEV> snapshotTransformFactory,
boolean allowFutureMetadataUpdates)
Creates or updates internal state and returns a new
InternalKvState . |
void |
dispose()
Should only be called by one thread, and only after all accesses to the DB happened.
|
int |
getKeyGroupPrefixBytes() |
<N> Stream<K> |
getKeys(String state,
N namespace) |
<N> Stream<Tuple2<K,N>> |
getKeysAndNamespaces(String state) |
org.rocksdb.ReadOptions |
getReadOptions() |
org.rocksdb.WriteOptions |
getWriteOptions() |
boolean |
isSafeToReuseKVState()
Whether it's safe to reuse key-values from the state-backend, e.g for the purpose of
optimization.
|
void |
notifyCheckpointAborted(long checkpointId)
This method is called as a notification once a distributed checkpoint has been aborted.
|
void |
notifyCheckpointComplete(long completedCheckpointId)
Notifies the listener that the checkpoint with the given
checkpointId completed and
was committed. |
int |
numKeyValueStateEntries()
Returns the total number of state entries across all keys/namespaces.
|
boolean |
requiresLegacySynchronousTimerSnapshots(SnapshotType checkpointType) |
SavepointResources<K> |
savepoint()
Returns a
SavepointResources that can be used by SavepointSnapshotStrategy to
write out a savepoint in the common/unified format. |
void |
setCurrentKey(K newKey)
Sets the current key that is used for partitioned state.
|
RunnableFuture<SnapshotResult<KeyedStateHandle>> |
snapshot(long checkpointId,
long timestamp,
CheckpointStreamFactory streamFactory,
CheckpointOptions checkpointOptions)
Triggers an asynchronous snapshot of the keyed state backend from RocksDB.
|
applyToAllKeys, applyToAllKeys, close, deregisterKeySelectionListener, getCurrentKey, getCurrentKeyGroupIndex, getKeyContext, getKeyGroupCompressionDecorator, getKeyGroupRange, getKeySerializer, getLatencyTrackingStateConfig, getNumberOfKeyGroups, getOrCreateKeyedState, getPartitionedState, notifyCheckpointSubsumed, numKeyValueStatesByName, publishQueryableStateIfEnabled, registerKeySelectionListener, setCurrentKeyGroupIndex
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getDelegatedKeyedStateBackend
isStateImmutableInStateBackend
createOrUpdateInternalState
public static final String MERGE_OPERATOR_NAME
protected final org.rocksdb.RocksDB db
AbstractRocksDBState
to store state. The different k/v states that we have don't each have their own RocksDB
instance. They all write to this instance but to their own column family.public RocksDBKeyedStateBackend(ClassLoader userCodeClassLoader, File instanceBasePath, RocksDBResourceContainer optionsContainer, Function<String,org.rocksdb.ColumnFamilyOptions> columnFamilyOptionsFactory, TaskKvStateRegistry kvStateRegistry, TypeSerializer<K> keySerializer, ExecutionConfig executionConfig, TtlTimeProvider ttlTimeProvider, LatencyTrackingStateConfig latencyTrackingStateConfig, org.rocksdb.RocksDB db, LinkedHashMap<String,RocksDBKeyedStateBackend.RocksDbKvStateInfo> kvStateInformation, Map<String,HeapPriorityQueueSnapshotRestoreWrapper<?>> registeredPQStates, int keyGroupPrefixBytes, CloseableRegistry cancelStreamRegistry, StreamCompressionDecorator keyGroupCompressionDecorator, ResourceGuard rocksDBResourceGuard, RocksDBSnapshotStrategyBase<K,?> checkpointSnapshotStrategy, RocksDBWriteBatchWrapper writeBatchWrapper, org.rocksdb.ColumnFamilyHandle defaultColumnFamilyHandle, RocksDBNativeMetricMonitor nativeMetricMonitor, SerializedCompositeKeyBuilder<K> sharedRocksKeyBuilder, PriorityQueueSetFactory priorityQueueFactory, RocksDbTtlCompactFiltersManager ttlCompactFiltersManager, InternalKeyContext<K> keyContext, @Nonnegative long writeBatchSize)
public <N> Stream<K> getKeys(String state, N namespace)
state
- State variable for which existing keys will be returned.namespace
- Namespace for which existing keys will be returned.public <N> Stream<Tuple2<K,N>> getKeysAndNamespaces(String state)
state
- State variable for which existing keys will be returned.public void setCurrentKey(K newKey)
KeyedStateBackend
setCurrentKey
in interface InternalKeyContext<K>
setCurrentKey
in interface KeyedStateBackend<K>
setCurrentKey
in class AbstractKeyedStateBackend<K>
newKey
- The new current key.KeyedStateBackend
public void dispose()
dispose
in interface KeyedStateBackend<K>
dispose
in interface Disposable
dispose
in class AbstractKeyedStateBackend<K>
@Nonnull public <T extends HeapPriorityQueueElement & PriorityComparable<? super T> & Keyed<?>> KeyGroupedInternalPriorityQueue<T> create(@Nonnull String stateName, @Nonnull TypeSerializer<T> byteOrderedElementSerializer)
PriorityQueueSetFactory
KeyGroupedInternalPriorityQueue
.T
- type of the stored elements.stateName
- unique name for associated with this queue.byteOrderedElementSerializer
- a serializer that with a format that is lexicographically
ordered in alignment with elementPriorityComparator.public <T extends HeapPriorityQueueElement & PriorityComparable<? super T> & Keyed<?>> KeyGroupedInternalPriorityQueue<T> create(@Nonnull String stateName, @Nonnull TypeSerializer<T> byteOrderedElementSerializer, boolean allowFutureMetadataUpdates)
PriorityQueueSetFactory
KeyGroupedInternalPriorityQueue
.T
- type of the stored elements.stateName
- unique name for associated with this queue.byteOrderedElementSerializer
- a serializer that with a format that is lexicographically
ordered in alignment with elementPriorityComparator.allowFutureMetadataUpdates
- whether allow metadata to update in the future or not.public int getKeyGroupPrefixBytes()
public org.rocksdb.WriteOptions getWriteOptions()
public org.rocksdb.ReadOptions getReadOptions()
@Nonnull public RunnableFuture<SnapshotResult<KeyedStateHandle>> snapshot(long checkpointId, long timestamp, @Nonnull CheckpointStreamFactory streamFactory, @Nonnull CheckpointOptions checkpointOptions) throws Exception
dispose()
. For
each backend, this method must always be called by the same thread.checkpointId
- The Id of the checkpoint.timestamp
- The timestamp of the checkpoint.streamFactory
- The factory that we can use for writing our state to streams.checkpointOptions
- Options for how to perform this checkpoint.Exception
- indicating a problem in the synchronous part of the checkpoint.@Nonnull public SavepointResources<K> savepoint() throws Exception
CheckpointableKeyedStateBackend
SavepointResources
that can be used by SavepointSnapshotStrategy
to
write out a savepoint in the common/unified format.Exception
public void notifyCheckpointComplete(long completedCheckpointId) throws Exception
CheckpointListener
checkpointId
completed and
was committed.
These notifications are "best effort", meaning they can sometimes be skipped. To behave
properly, implementers need to follow the "Checkpoint Subsuming Contract". Please see the
class-level JavaDocs
for details.
Please note that checkpoints may generally overlap, so you cannot assume that the notifyCheckpointComplete()
call is always for the latest prior checkpoint (or snapshot) that
was taken on the function/operator implementing this interface. It might be for a checkpoint
that was triggered earlier. Implementing the "Checkpoint Subsuming Contract" (see above)
properly handles this situation correctly as well.
Please note that throwing exceptions from this method will not cause the completed checkpoint to be revoked. Throwing exceptions will typically cause task/job failure and trigger recovery.
completedCheckpointId
- The ID of the checkpoint that has been completed.Exception
- This method can propagate exceptions, which leads to a failure/recovery for
the task. Note that this will NOT lead to the checkpoint being revoked.public void notifyCheckpointAborted(long checkpointId) throws Exception
CheckpointListener
Important: The fact that a checkpoint has been aborted does NOT mean that the data
and artifacts produced between the previous checkpoint and the aborted checkpoint are to be
discarded. The expected behavior is as if this checkpoint was never triggered in the first
place, and the next successful checkpoint simply covers a longer time span. See the
"Checkpoint Subsuming Contract" in the class-level JavaDocs
for
details.
These notifications are "best effort", meaning they can sometimes be skipped.
This method is very rarely necessary to implement. The "best effort" guarantee, together with the fact that this method should not result in discarding any data (per the "Checkpoint Subsuming Contract") means it is mainly useful for earlier cleanups of auxiliary resources. One example is to pro-actively clear a local per-checkpoint state cache upon checkpoint failure.
checkpointId
- The ID of the checkpoint that has been aborted.Exception
- This method can propagate exceptions, which leads to a failure/recovery for
the task or job.@Nonnull public <N,SV,SEV,S extends State,IS extends S> IS createOrUpdateInternalState(@Nonnull TypeSerializer<N> namespaceSerializer, @Nonnull StateDescriptor<S,SV> stateDesc, @Nonnull StateSnapshotTransformer.StateSnapshotTransformFactory<SEV> snapshotTransformFactory) throws Exception
KeyedStateFactory
InternalKvState
.N
- The type of the namespace.SV
- The type of the stored state value.SEV
- The type of the stored state value or entry for collection types (list or map).S
- The type of the public API state.IS
- The type of internal state.namespaceSerializer
- TypeSerializer for the state namespace.stateDesc
- The StateDescriptor
that contains the name of the state.snapshotTransformFactory
- factory of state snapshot transformer.Exception
@Nonnull public <N,SV,SEV,S extends State,IS extends S> IS createOrUpdateInternalState(@Nonnull TypeSerializer<N> namespaceSerializer, @Nonnull StateDescriptor<S,SV> stateDesc, @Nonnull StateSnapshotTransformer.StateSnapshotTransformFactory<SEV> snapshotTransformFactory, boolean allowFutureMetadataUpdates) throws Exception
KeyedStateFactory
InternalKvState
.N
- The type of the namespace.SV
- The type of the stored state value.SEV
- The type of the stored state value or entry for collection types (list or map).S
- The type of the public API state.IS
- The type of internal state.namespaceSerializer
- TypeSerializer for the state namespace.stateDesc
- The StateDescriptor
that contains the name of the state.snapshotTransformFactory
- factory of state snapshot transformer.allowFutureMetadataUpdates
- whether allow metadata to update in the future or not.Exception
@VisibleForTesting public int numKeyValueStateEntries()
TestableKeyedStateBackend
public boolean requiresLegacySynchronousTimerSnapshots(SnapshotType checkpointType)
requiresLegacySynchronousTimerSnapshots
in class AbstractKeyedStateBackend<K>
public boolean isSafeToReuseKVState()
KeyedStateBackend
NOTE: this method should not be used to check for InternalPriorityQueue
, as the
priority queue could be stored on different locations, e.g RocksDB state-backend could store
that on JVM heap if configuring HEAP as the time-service factory.
@VisibleForTesting public void compactState(StateDescriptor<?,?> stateDesc) throws org.rocksdb.RocksDBException
org.rocksdb.RocksDBException
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.