Apache Pulsar Connector #

Flink provides an Apache Pulsar connector for reading and writing data from and to Pulsar topics with exactly-once guarantees.

Dependency #

You can use the connector with the Pulsar 2.9.0 or higher. Because the Pulsar connector supports Pulsar transactions, it is recommended to use the Pulsar 2.10.0 or higher. Details on Pulsar compatibility can be found in PIP-72.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-pulsar_2.11</artifactId>
    <version>1.14.4</version>
</dependency>

Flink’s streaming connectors are not part of the binary distribution. See how to link with them for cluster execution here.

Pulsar Source #

This part describes the Pulsar source based on the new data source API.

If you want to use the legacy SourceFunction or on Flink 1.13 or lower releases, just use the StreamNative’s pulsar-flink.

Usage #

The Pulsar source provides a builder class for constructing a PulsarSource instance. The code snippet below builds a PulsarSource instance. It consumes messages from the earliest cursor of the topic “persistent://public/default/my-topic” in Exclusive subscription type (my-subscription) and deserializes the raw payload of the messages as strings.

PulsarSource<String> source = PulsarSource.builder()
    .setServiceUrl(serviceUrl)
    .setAdminUrl(adminUrl)
    .setStartCursor(StartCursor.earliest())
    .setTopics("my-topic")
    .setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
    .setSubscriptionName("my-subscription")
    .setSubscriptionType(SubscriptionType.Exclusive)
    .build();

env.fromSource(source, WatermarkStrategy.noWatermarks(), "Pulsar Source");

The following properties are required for building a PulsarSource:

Pulsar service URL, configured by setServiceUrl(String)
Pulsar service HTTP URL (also known as admin URL), configured by setAdminUrl(String)
Pulsar subscription name, configured by setSubscriptionName(String)
Topics / partitions to subscribe, see the following topic-partition subscription for more details.
Deserializer to parse Pulsar messages, see the following deserializer for more details.

It is recommended to set the consumer name in Pulsar Source by setConsumerName(String). This sets a unique name for the Flink connector in the Pulsar statistic dashboard. You can use it to monitor the performance of your Flink connector and applications.

Topic-partition Subscription #

Pulsar source provide two ways of topic-partition subscription:

Topic list, subscribing messages from all partitions in a list of topics. For example:

PulsarSource.builder().setTopics("some-topic1", "some-topic2");

// Partition 0 and 2 of topic "topic-a"
PulsarSource.builder().setTopics("topic-a-partition-0", "topic-a-partition-2");

Topic pattern, subscribing messages from all topics whose name matches the provided regular expression. For example:
```
PulsarSource.builder().setTopicPattern("topic-*");
```

Flexible Topic Naming #

Since Pulsar 2.0, all topic names internally are in a form of {persistent|non-persistent}://tenant/namespace/topic. Now, for partitioned topics, you can use short names in many cases (for the sake of simplicity). The flexible naming system stems from the fact that there is now a default topic type, tenant, and namespace in a Pulsar cluster.

Topic property	Default
topic type	`persistent`
tenant	`public`
namespace	`default`

This table lists a mapping relationship between your input topic name and the translated topic name:

Input topic name	Translated topic name
`my-topic`	`persistent://public/default/my-topic`
`my-tenant/my-namespace/my-topic`	`persistent://my-tenant/my-namespace/my-topic`

For non-persistent topics, you need to specify the entire topic name, as the default-based rules do not apply for non-partitioned topics. Thus, you cannot use a short name like non-persistent://my-topic and need to use non-persistent://public/default/my-topic instead.

Subscribing Pulsar Topic Partition #

Internally, Pulsar divides a partitioned topic as a set of non-partitioned topics according to the partition size.

For example, if a simple-string topic with 3 partitions is created under the sample tenant with the flink namespace. The topics on Pulsar would be:

Topic name	Partitioned
`persistent://sample/flink/simple-string`	Y
`persistent://sample/flink/simple-string-partition-0`	N
`persistent://sample/flink/simple-string-partition-1`	N
`persistent://sample/flink/simple-string-partition-2`	N

You can directly consume messages from the topic partitions by using the non-partitioned topic names above. For example, use PulsarSource.builder().setTopics("sample/flink/simple-string-partition-1", "sample/flink/simple-string-partition-2") would consume the partitions 1 and 2 of the sample/flink/simple-string topic.

Setting Topic Patterns #

The Pulsar source extracts the topic type (persistent or non-persistent) from the provided topic pattern. For example, you can use the PulsarSource.builder().setTopicPattern("non-persistent://my-topic*") to specify a non-persistent topic. By default, a persistent topic is created if you do not specify the topic type in the regular expression.

You can use setTopicPattern("topic-*", RegexSubscriptionMode.AllTopics) to consume both persistent and non-persistent topics based on the topic pattern. The Pulsar source would filter the available topics by the RegexSubscriptionMode.

Deserializer #

A deserializer (PulsarDeserializationSchema) is for decoding Pulsar messages from bytes. You can configure the deserializer using setDeserializationSchema(PulsarDeserializationSchema). The PulsarDeserializationSchema defines how to deserialize a Pulsar Message<byte[]>.

If only the raw payload of a message (message data in bytes) is needed, you can use the predefined PulsarDeserializationSchema. Pulsar connector provides three implementation methods.

Decode the message by using Pulsar’s Schema.

// Primitive types
PulsarDeserializationSchema.pulsarSchema(Schema);

// Struct types (JSON, Protobuf, Avro, etc.)
PulsarDeserializationSchema.pulsarSchema(Schema, Class);

// KeyValue type
PulsarDeserializationSchema.pulsarSchema(Schema, Class, Class);

Decode the message by using Flink’s DeserializationSchema

PulsarDeserializationSchema.flinkSchema(DeserializationSchema);

Decode the message by using Flink’s TypeInformation

PulsarDeserializationSchema.flinkTypeInfo(TypeInformation, ExecutionConfig);

Pulsar Message<byte[]> contains some extra properties, such as message key, message publish time, message time, and application-defined key/value pairs etc. These properties could be defined in the Message<byte[]> interface.

If you want to deserialize the Pulsar message by these properties, you need to implement PulsarDeserializationSchema. Ensure that the TypeInformation from the PulsarDeserializationSchema.getProducedType() is correct. Flink uses this TypeInformation to pass the messages to downstream operators.

Pulsar Subscriptions #

A Pulsar subscription is a named configuration rule that determines how messages are delivered to Flink readers. The subscription name is required for consuming messages. Pulsar connector supports four subscription types:

There is no difference between Exclusive and Failover in the Pulsar connector. When a Flink reader crashes, all (non-acknowledged and subsequent) messages are redelivered to the available Flink readers.

By default, if no subscription type is defined, Pulsar source uses the Shared subscription type.

// Shared subscription with name "my-shared"
PulsarSource.builder().setSubscriptionName("my-shared");

// Exclusive subscription with name "my-exclusive"
PulsarSource.builder().setSubscriptionName("my-exclusive").setSubscriptionType(SubscriptionType.Exclusive);

Ensure that you provide a RangeGenerator implementation if you want to use the Key_Shared subscription type on the Pulsar connector. The RangeGenerator generates a set of key hash ranges so that a respective reader subtask only dispatches messages where the hash of the message key is contained in the specified range.

The Pulsar connector uses UniformRangeGenerator that divides the range by the Flink source parallelism if no RangeGenerator is provided in the Key_Shared subscription type.

Starting Position #

The Pulsar source is able to consume messages starting from different positions by setting the setStartCursor(StartCursor) option. Built-in start cursors include:

Start from the earliest available message in the topic.
```
StartCursor.earliest();
```
Start from the latest available message in the topic.
```
StartCursor.latest();
```
Start from a specified message between the earliest and the latest. The Pulsar connector consumes from the latest available message if the message ID does not exist.

The start message is included in consuming result.
```
StartCursor.fromMessageId(MessageId);
```
Start from a specified message between the earliest and the latest. The Pulsar connector consumes from the latest available message if the message ID doesn’t exist.

Include or exclude the start message by using the second boolean parameter.
```
StartCursor.fromMessageId(MessageId, boolean);
```
Start from the specified message publish time by Message<byte[]>.getPublishTime(). This method is deprecated because the name is totally wrong which may cause confuse. You can use StartCursor.fromPublishTime(long) instead.
```
StartCursor.fromMessageTime(long);
```
Start from the specified message publish time by Message<byte[]>.getPublishTime().
```
StartCursor.fromPublishTime(long);
```

Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID (MessageId) of the message is ordered in that sequence. The MessageId contains some extra information (the ledger, entry, partition) about how the message is stored, you can create a MessageId by using DefaultImplementation.newMessageId(long ledgerId, long entryId, int partitionIndex).

Boundedness #

The Pulsar source supports streaming and batch execution mode. By default, the PulsarSource is configured for unbounded data.

For unbounded data the Pulsar source never stops until a Flink job is stopped or failed. You can use the setUnboundedStopCursor(StopCursor) to set the Pulsar source to stop at a specific stop position.

You can use setBoundedStopCursor(StopCursor) to specify a stop position for bounded data.

Built-in stop cursors include:

The Pulsar source never stops consuming messages.
```
StopCursor.never();
```
Stop at the latest available message when the Pulsar source starts consuming messages.
```
StopCursor.latest();
```
Stop when the connector meets a given message, or stop at a message which is produced after this given message.
```
StopCursor.atMessageId(MessageId);
```
Stop but include the given message in the consuming result.
```
StopCursor.afterMessageId(MessageId);
```
Stop at the specified event time by Message<byte[]>.getEventTime(). The message with the given event time won’t be included in the consuming result.
```
StopCursor.atEventTime(long);
```
Stop after the specified event time by Message<byte[]>.getEventTime(). The message with the given event time will be included in the consuming result.
```
StopCursor.afterEventTime(long);
```
Stop at the specified publish time by Message<byte[]>.getPublishTime(). The message with the given publish time won’t be included in the consuming result.
```
StopCursor.atPublishTime(long);
```
Stop after the specified publish time by Message<byte[]>.getPublishTime(). The message with the given publish time will be included in the consuming result.
```
StopCursor.afterPublishTime(long);
```

Source Configurable Options #

In addition to configuration options described above, you can set arbitrary options for PulsarClient, PulsarAdmin, Pulsar Consumer and PulsarSource by using setConfig(ConfigOption<T>, T), setConfig(Configuration) and setConfig(Properties).

PulsarClient Options #

The Pulsar connector uses the client API to create the Consumer instance. The Pulsar connector extracts most parts of Pulsar’s ClientConfigurationData, which is required for creating a PulsarClient, as Flink configuration options in PulsarOptions.

Key	Default	Type	Description
pulsar.client.authParamMap	(none)	Map	Parameters for the authentication plugin.
pulsar.client.authParams	(none)	String	Parameters for the authentication plugin. Example: `key1:val1,key2:val2`
pulsar.client.authPluginClassName	(none)	String	Name of the authentication plugin.
pulsar.client.concurrentLookupRequest	5000	Integer	The number of concurrent lookup requests allowed to send on each broker connection to prevent overload on the broker. It should be configured with a higher value only in case of it requires to produce or subscribe on thousands of topic using a created `PulsarClient`
pulsar.client.connectionTimeoutMs	10000	Integer	Duration (in ms) of waiting for a connection to a broker to be established. If the duration passes without a response from a broker, the connection attempt is dropped.
pulsar.client.connectionsPerBroker	1	Integer	The maximum number of connections that the client library will open to a single broker. By default, the connection pool will use a single connection for all the producers and consumers. Increasing this parameter may improve throughput when using many producers over a high latency connection.
pulsar.client.enableBusyWait	false	Boolean	Option to enable busy-wait settings. This option will enable spin-waiting on executors and IO threads in order to reduce latency during context switches. The spinning will consume 100% CPU even when the broker is not doing any work. It is recommended to reduce the number of IO threads and BookKeeper client threads to only have fewer CPU cores busy.
pulsar.client.enableTransaction	false	Boolean	If transaction is enabled, start the `transactionCoordinatorClient` with `PulsarClient`.
pulsar.client.initialBackoffIntervalNanos	100000000	Long	Default duration (in nanoseconds) for a backoff interval.
pulsar.client.keepAliveIntervalSeconds	30	Integer	Interval (in seconds) for keeping connection between the Pulsar client and broker alive.
pulsar.client.listenerName	(none)	String	Configure the `listenerName` that the broker will return the corresponding `advertisedListener`.
pulsar.client.maxBackoffIntervalNanos	60000000000	Long	The maximum duration (in nanoseconds) for a backoff interval.
pulsar.client.maxLookupRedirects	20	Integer	The maximum number of times a lookup-request redirections to a broker.
pulsar.client.maxLookupRequest	50000	Integer	The maximum number of lookup requests allowed on each broker connection to prevent overload on the broker. It should be greater than `maxConcurrentLookupRequests`. Requests that inside `maxConcurrentLookupRequests` are already sent to broker, and requests beyond `maxConcurrentLookupRequests` and under `maxLookupRequests` will wait in each client cnx.
pulsar.client.maxNumberOfRejectedRequestPerConnection	50	Integer	The maximum number of rejected requests of a broker in a certain period (30s) after the current connection is closed and the client creates a new connection to connect to a different broker.
pulsar.client.memoryLimitBytes	0	Long	The limit (in bytes) on the amount of direct memory that will be allocated by this client instance. Note: at this moment this is only limiting the memory for producers. Setting this to `0` will disable the limit.
pulsar.client.numIoThreads	1	Integer	The number of threads used for handling connections to brokers.
pulsar.client.numListenerThreads	1	Integer	The number of threads used for handling message listeners. The listener thread pool is shared across all the consumers and readers that are using a `listener` model to get messages. For a given consumer, the listener is always invoked from the same thread to ensure ordering.
pulsar.client.operationTimeoutMs	30000	Integer	Operation timeout (in ms). Operations such as creating producers, subscribing or unsubscribing topics are retried during this interval. If the operation is not completed during this interval, the operation will be marked as failed.
pulsar.client.proxyProtocol	SNI	Enum	Protocol type to determine the type of proxy routing when a client connects to the proxy using `pulsar.client.proxyServiceUrl`. Possible values: "SNI"
pulsar.client.proxyServiceUrl	(none)	String	Proxy-service URL when a client connects to the broker via the proxy. The client can choose the type of proxy-routing.
pulsar.client.requestTimeoutMs	60000	Integer	Maximum duration (in ms) for completing a request.
pulsar.client.serviceUrl	(none)	String	Service URL provider for Pulsar service. To connect to Pulsar using client libraries, you need to specify a Pulsar protocol URL. You can assign Pulsar protocol URLs to specific clusters and use the `pulsar` scheme. This is an example of `localhost`: `pulsar://localhost:6650`. If you have multiple brokers, the URL is as: `pulsar://localhost:6550,localhost:6651,localhost:6652` A URL for a production Pulsar cluster is as: `pulsar://pulsar.us-west.example.com:6650` If you use TLS authentication, the URL is as `pulsar+ssl://pulsar.us-west.example.com:6651`
pulsar.client.sslProvider	(none)	String	The name of the security provider used for SSL connections. The default value is the default security provider of the JVM.
pulsar.client.statsIntervalSeconds	60	Long	Interval between each stats info. Stats is activated with positive `statsInterval` Set `statsIntervalSeconds` to 1 second at least.
pulsar.client.tlsAllowInsecureConnection	false	Boolean	Whether the Pulsar client accepts untrusted TLS certificate from the broker.
pulsar.client.tlsCiphers		List<String>	A list of cipher suites. This is a named combination of authentication, encryption, MAC and the key exchange algorithm used to negotiate the security settings for a network connection using the TLS or SSL network protocol. By default all the available cipher suites are supported.
pulsar.client.tlsHostnameVerificationEnable	false	Boolean	Whether to enable TLS hostname verification. It allows to validate hostname verification when a client connects to the broker over TLS. It validates incoming x509 certificate and matches provided hostname (CN/SAN) with the expected broker's host name. It follows RFC 2818, 3.1. Server Identity hostname verification.
pulsar.client.tlsProtocols		List<String>	The SSL protocol used to generate the SSLContext. By default, it is set TLS, which is fine for most cases. Allowed values in recent JVMs are TLS, TLSv1.3, TLSv1.2 and TLSv1.1.
pulsar.client.tlsTrustCertsFilePath	(none)	String	Path to the trusted TLS certificate file.
pulsar.client.tlsTrustStorePassword	(none)	String	The store password for the key store file.
pulsar.client.tlsTrustStorePath	(none)	String	The location of the trust store file.
pulsar.client.tlsTrustStoreType	"JKS"	String	The file format of the trust store file.
pulsar.client.useKeyStoreTls	false	Boolean	If TLS is enabled, whether use the KeyStore type as the TLS configuration parameter. If it is set to `false`, it means to use the default pem type configuration.
pulsar.client.useTcpNoDelay	true	Boolean	Whether to use the TCP no-delay flag on the connection to disable Nagle algorithm. No-delay features ensures that packets are sent out on the network as soon as possible, and it is critical to achieve low latency publishes. On the other hand, sending out a huge number of small packets might limit the overall throughput. Therefore, if latency is not a concern, it is recommended to set the `useTcpNoDelay` flag to `false`. By default, it is set to `true`.

PulsarAdmin Options #

The admin API is used for querying topic metadata and for discovering the desired topics when the Pulsar connector uses topic-pattern subscription. It shares most part of the configuration options with the client API. The configuration options listed here are only used in the admin API. They are also defined in PulsarOptions.

Key	Default	Type	Description
pulsar.admin.adminUrl	(none)	String	The Pulsar service HTTP URL for the admin endpoint. For example, `http://my-broker.example.com:8080`, or `https://my-broker.example.com:8443` for TLS.
pulsar.admin.autoCertRefreshTime	300000	Integer	The auto cert refresh time (in ms) if Pulsar admin supports TLS authentication.
pulsar.admin.connectTimeout	60000	Integer	The connection time out (in ms) for the PulsarAdmin client.
pulsar.admin.readTimeout	60000	Integer	The server response read timeout (in ms) for the PulsarAdmin client for any request.
pulsar.admin.requestTimeout	300000	Integer	The server request timeout (in ms) for the PulsarAdmin client for any request.

Pulsar Consumer Options #

In general, Pulsar provides the Reader API and Consumer API for consuming messages in different scenarios. The Pulsar connector uses the Consumer API. It extracts most parts of Pulsar’s ConsumerConfigurationData as Flink configuration options in PulsarSourceOptions.

Key	Default	Type	Description
pulsar.consumer.ackReceiptEnabled	false	Boolean	Acknowledgement will return a receipt but this does not mean that the message will not be resent after getting the receipt.
pulsar.consumer.ackTimeoutMillis	0	Long	The timeout (in ms) for unacknowledged messages, truncated to the nearest millisecond. The timeout needs to be greater than 1 second. By default, the acknowledge timeout is disabled and that means that messages delivered to a consumer will not be re-delivered unless the consumer crashes. When acknowledgement timeout being enabled, if a message is not acknowledged within the specified timeout it will be re-delivered to the consumer (possibly to a different consumer in case of a shared subscription).
pulsar.consumer.acknowledgementsGroupTimeMicros	100000	Long	Group a consumer acknowledgment for a specified time (in μs). By default, a consumer uses `100μs` grouping time to send out acknowledgments to a broker. If the group time is set to `0`, acknowledgments are sent out immediately. A longer ack group time is more efficient at the expense of a slight increase in message re-deliveries after a failure.
pulsar.consumer.autoAckOldestChunkedMessageOnQueueFull	false	Boolean	Buffering a large number of outstanding uncompleted chunked messages can bring memory pressure and it can be guarded by providing this `pulsar.consumer.maxPendingChunkedMessage` threshold. Once a consumer reaches this threshold, it drops the outstanding unchunked-messages by silently acknowledging if `pulsar.consumer.autoAckOldestChunkedMessageOnQueueFull` is true. Otherwise, it marks them for redelivery.
pulsar.consumer.autoUpdatePartitionsIntervalSeconds	60	Integer	The interval (in seconds) of updating partitions. This only works if autoUpdatePartitions is enabled.
pulsar.consumer.consumerName	(none)	String	The consumer name is informative and it can be used to identify a particular consumer instance from the topic stats.
pulsar.consumer.cryptoFailureAction	FAIL	Enum	The consumer should take action when it receives a message that can not be decrypted. `FAIL`: this is the default option to fail messages until crypto succeeds. `DISCARD`: silently acknowledge but do not deliver messages to an application. `CONSUME`: deliver encrypted messages to applications. It is the application's responsibility to decrypt the message. Fail to decompress the messages. If messages contain batch messages, a client is not be able to retrieve individual messages in batch. The delivered encrypted message contains `EncryptionContext` which contains encryption and compression information in. You can use an application to decrypt the consumed message payload. Possible values: "FAIL" "DISCARD" "CONSUME"
pulsar.consumer.deadLetterPolicy.deadLetterTopic	(none)	String	Name of the dead topic where the failed messages are sent.
pulsar.consumer.deadLetterPolicy.maxRedeliverCount	0	Integer	The maximum number of times that a message are redelivered before being sent to the dead letter queue.
pulsar.consumer.deadLetterPolicy.retryLetterTopic	(none)	String	Name of the retry topic where the failed messages are sent.
pulsar.consumer.expireTimeOfIncompleteChunkedMessageMillis	60000	Long	If a producer fails to publish all the chunks of a message, the consumer can expire incomplete chunks if the consumer cannot receive all chunks in expire times (default 1 hour, in ms).
pulsar.consumer.maxPendingChunkedMessage	10	Integer	The consumer buffers chunk messages into memory until it receives all the chunks of the original message. While consuming chunk-messages, chunks from the same message might not be contiguous in the stream and they might be mixed with other messages' chunks. So, consumer has to maintain multiple buffers to manage chunks coming from different messages. This mainly happens when multiple publishers are publishing messages on the topic concurrently or publishers failed to publish all chunks of the messages. For example, there are M1-C1, M2-C1, M1-C2, M2-C2 messages.Messages M1-C1 and M1-C2 belong to the M1 original message while M2-C1 and M2-C2 belong to the M2 message. Buffering a large number of outstanding uncompleted chunked messages can bring memory pressure and it can be guarded by providing this `pulsar.consumer.maxPendingChunkedMessage` threshold. Once, a consumer reaches this threshold, it drops the outstanding unchunked messages by silently acknowledging or asking the broker to redeliver messages later by marking it unacknowledged. This behavior can be controlled by the `pulsar.consumer.autoAckOldestChunkedMessageOnQueueFull` option.
pulsar.consumer.maxTotalReceiverQueueSizeAcrossPartitions	50000	Integer	The maximum total receiver queue size across partitions. This setting reduces the receiver queue size for individual partitions if the total receiver queue size exceeds this value.
pulsar.consumer.negativeAckRedeliveryDelayMicros	60000000	Long	Delay (in μs) to wait before redelivering messages that failed to be processed. When an application uses `Consumer.negativeAcknowledge(Message)`, failed messages are redelivered after a fixed timeout.
pulsar.consumer.poolMessages	false	Boolean	Enable pooling of messages and the underlying data buffers.
pulsar.consumer.priorityLevel	0	Integer	Priority level for a consumer to which a broker gives more priorities while dispatching messages in the shared subscription type. The broker follows descending priorities. For example, 0=max-priority, 1, 2,... In shared subscription mode, the broker first dispatches messages to the consumers on the highest priority level if they have permits. Otherwise, the broker considers consumers on the next priority level. Example 1 If a subscription has consumer A with `priorityLevel` 0 and consumer B with `priorityLevel` 1, then the broker only dispatches messages to consumer A until it runs out permits and then starts dispatching messages to consumer B. Example 2 Consumer Priority, Level, Permits C1, 0, 2 C2, 0, 1 C3, 0, 1 C4, 1, 2 C5, 1, 1 The order in which a broker dispatches messages to consumers is: C1, C2, C3, C1, C4, C5, C4.
pulsar.consumer.properties		Map	A name or value property of this consumer. `properties` is application defined metadata attached to a consumer. When getting a topic stats, associate this metadata with the consumer stats for easier identification.
pulsar.consumer.readCompacted	false	Boolean	If enabling `readCompacted`, a consumer reads messages from a compacted topic rather than reading a full message backlog of a topic. A consumer only sees the latest value for each key in the compacted topic, up until reaching the point in the topic message when compacting backlog. Beyond that point, send messages as normal. Only enabling `readCompacted` on subscriptions to persistent topics, which have a single active consumer (like failure or exclusive subscriptions). Attempting to enable it on subscriptions to non-persistent topics or on shared subscriptions leads to a subscription call throwing a `PulsarClientException`.
pulsar.consumer.receiverQueueSize	1000	Integer	Size of a consumer's receiver queue. For example, the number of messages accumulated by a consumer before an application calls `Receive`. A value higher than the default value increases consumer throughput, though at the expense of more memory utilization.
pulsar.consumer.replicateSubscriptionState	false	Boolean	If `replicateSubscriptionState` is enabled, a subscription state is replicated to geo-replicated clusters.
pulsar.consumer.retryEnable	false	Boolean	If enabled, the consumer will automatically retry messages.
pulsar.consumer.subscriptionMode	Durable	Enum	Select the subscription mode to be used when subscribing to the topic. `Durable`: Make the subscription to be backed by a durable cursor that will retain messages and persist the current position. `NonDurable`: Lightweight subscription mode that doesn't have a durable cursor associated Possible values: "Durable" "NonDurable"
pulsar.consumer.subscriptionName	(none)	String	Specify the subscription name for this consumer. This argument is required when constructing the consumer.
pulsar.consumer.subscriptionType	Shared	Enum	Subscription type. Four subscription types are available: Exclusive Failover Shared Key_Shared Possible values: "Exclusive" "Shared" "Failover" "Key_Shared"
pulsar.consumer.tickDurationMillis	1000	Long	Granularity (in ms) of the ack-timeout redelivery. A greater (for example, 1 hour) `tickDurationMillis` reduces the memory overhead to track messages.

PulsarSource Options #

The configuration options below are mainly used for customizing the performance and message acknowledgement behavior. You can ignore them if you do not have any performance issues.

Key	Default	Type	Description
pulsar.source.autoCommitCursorInterval	5000	Long	This option is used only when the user disables the checkpoint and uses Exclusive or Failover subscription. We would automatically commit the cursor using the given period (in ms).
pulsar.source.enableAutoAcknowledgeMessage	false	Boolean	Flink commits the consuming position with pulsar transactions on checkpoint. However, if you have disabled the Flink checkpoint or disabled transaction for your Pulsar cluster, ensure that you have set this option to `true`. The source would use pulsar client's internal mechanism and commit cursor in two ways. For `Key_Shared` and `Shared` subscription, the cursor would be committed once the message is consumed. For `Exclusive` and `Failover` subscription, the cursor would be committed in a given interval.
pulsar.source.maxFetchRecords	100	Integer	The maximum number of records to fetch to wait when polling. A longer time increases throughput but also latency. A fetch batch might be finished earlier because of `pulsar.source.maxFetchTime`.
pulsar.source.maxFetchTime	10000	Long	The maximum time (in ms) to wait when fetching records. A longer time increases throughput but also latency. A fetch batch might be finished earlier because of `pulsar.source.maxFetchRecords`.
pulsar.source.partitionDiscoveryIntervalMs	30000	Long	The interval (in ms) for the Pulsar source to discover the new partitions. A non-positive value disables the partition discovery.
pulsar.source.transactionTimeoutMillis	10800000	Long	This option is used in `Shared` or `Key_Shared` subscription. You should configure this option when you do not enable the `pulsar.source.enableAutoAcknowledgeMessage` option. The value (in ms) should be greater than the checkpoint interval.
pulsar.source.verifyInitialOffsets	WARN_ON_MISMATCH	Enum	Upon (re)starting the source, check whether the expected message can be read. If failure is enabled, the application fails. Otherwise, it logs a warning. A possible solution is to adjust the retention settings in Pulsar or ignoring the check result. Possible values: "FAIL_ON_MISMATCH" "WARN_ON_MISMATCH"

Dynamic Partition Discovery #

To handle scenarios like topic scaling-out or topic creation without restarting the Flink job, the Pulsar source periodically discover new partitions under a provided topic-partition subscription pattern. To enable partition discovery, you can set a non-negative value for the PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS option:

// discover new partitions per 10 seconds
PulsarSource.builder()
    .setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS, 10000);

Partition discovery is enabled by default. The Pulsar connector queries the topic metadata every 30 seconds.

To disable partition discovery, you need to set a negative partition discovery interval.

Partition discovery is disabled for bounded data even if you set this option with a non-negative value.

Event Time and Watermarks #

By default, the message uses the timestamp embedded in Pulsar Message<byte[]> as the event time. You can define your own WatermarkStrategy to extract the event time from the message, and emit the watermark downstream:

env.fromSource(pulsarSource, new CustomWatermarkStrategy(), "Pulsar Source With Custom Watermark Strategy");

This documentation describes details about how to define a WatermarkStrategy.

Message Acknowledgement #

When a subscription is created, Pulsar retains all messages, even if the consumer is disconnected. The retained messages are discarded only when the connector acknowledges that all these messages are processed successfully. The Pulsar connector supports four subscription types, which makes the acknowledgement behaviors vary among different subscriptions.

Acknowledgement on Exclusive and Failover Subscription Types #

Exclusive and Failover subscription types support cumulative acknowledgment. In these subscription types, Flink only needs to acknowledge the latest successfully consumed message. All the message before the given message are marked with a consumed status.

The Pulsar source acknowledges the current consuming message when checkpoints are completed, to ensure the consistency between Flink’s checkpoint state and committed position on the Pulsar brokers.

If checkpointing is disabled, Pulsar source periodically acknowledges messages. You can use the PulsarSourceOptions.PULSAR_AUTO_COMMIT_CURSOR_INTERVAL option to set the acknowledgement period.

Pulsar source does NOT rely on committed positions for fault tolerance. Acknowledging messages is only for exposing the progress of consumers and monitoring on these two subscription types.

Acknowledgement on Shared and Key_Shared Subscription Types #

In Shared and Key_Shared subscription types, messages are acknowledged one by one. You can acknowledge a message in a transaction and commit it to Pulsar.

You should enable transaction in the Pulsar borker.conf file when using these two subscription types in connector:

transactionCoordinatorEnabled=true

The default timeout for Pulsar transactions is 3 hours. Make sure that that timeout is greater than checkpoint interval + maximum recovery time. A shorter checkpoint interval indicates a better consuming performance. You can use the PulsarSourceOptions.PULSAR_TRANSACTION_TIMEOUT_MILLIS option to change the transaction timeout.

If checkpointing is disabled or you can not enable the transaction on Pulsar broker, you should set PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE to true. The message is immediately acknowledged after consuming. No consistency guarantees can be made in this scenario.

All acknowledgements in a transaction are recorded in the Pulsar broker side.

Upgrading to the Latest Connector Version #

The generic upgrade steps are outlined in upgrading jobs and Flink versions guide. The Pulsar connector does not store any state on the Flink side. The Pulsar connector pushes and stores all the states on the Pulsar side. For Pulsar, you additionally need to know these limitations:

Do not upgrade the Pulsar connector and Pulsar broker version at the same time.
Always use a newer Pulsar client with Pulsar connector to consume messages from Pulsar.

Troubleshooting #

If you have a problem with Pulsar when using Flink, keep in mind that Flink only wraps PulsarClient or PulsarAdmin and your problem might be independent of Flink and sometimes can be solved by upgrading Pulsar brokers, reconfiguring Pulsar brokers or reconfiguring Pulsar connector in Flink.

Messages can be delayed on low volume topics #

When the Pulsar source connector reads from a low volume topic, users might observe a 10 seconds delay between messages. Pulsar buffers messages from topics by default. Before emitting to downstream operators, the number of buffered records must be equal or larger than PulsarSourceOptions.PULSAR_MAX_FETCH_RECORDS. If the data volume is low, it could be that filling up the number of buffered records takes longer than PULSAR_MAX_FETCH_TIME (default to 10 seconds). If that’s the case, it means that only after this time has passed the messages will be emitted.

To avoid this behaviour, you need to change either the buffered records or the waiting time.

Apache Pulsar Connector #

Dependency #

Pulsar Source #

Usage #

Topic-partition Subscription #

Flexible Topic Naming #

Subscribing Pulsar Topic Partition #

Setting Topic Patterns #

Deserializer #

Pulsar Subscriptions #

Starting Position #

Boundedness #

Source Configurable Options #

PulsarClient Options #

pulsar.client.authParamMap

pulsar.client.authParams

pulsar.client.authPluginClassName

pulsar.client.concurrentLookupRequest

pulsar.client.connectionTimeoutMs

pulsar.client.connectionsPerBroker

pulsar.client.enableBusyWait

pulsar.client.enableTransaction

pulsar.client.initialBackoffIntervalNanos

pulsar.client.keepAliveIntervalSeconds

pulsar.client.listenerName

pulsar.client.maxBackoffIntervalNanos

pulsar.client.maxLookupRedirects

pulsar.client.maxLookupRequest

pulsar.client.maxNumberOfRejectedRequestPerConnection

pulsar.client.memoryLimitBytes

pulsar.client.numIoThreads

pulsar.client.numListenerThreads

pulsar.client.operationTimeoutMs

pulsar.client.proxyProtocol

pulsar.client.proxyServiceUrl

pulsar.client.requestTimeoutMs

pulsar.client.serviceUrl

pulsar.client.sslProvider

pulsar.client.statsIntervalSeconds

pulsar.client.tlsAllowInsecureConnection

pulsar.client.tlsCiphers

pulsar.client.tlsHostnameVerificationEnable

pulsar.client.tlsProtocols

pulsar.client.tlsTrustCertsFilePath

pulsar.client.tlsTrustStorePassword

pulsar.client.tlsTrustStorePath

pulsar.client.tlsTrustStoreType

pulsar.client.useKeyStoreTls

pulsar.client.useTcpNoDelay

PulsarAdmin Options #

pulsar.admin.adminUrl

pulsar.admin.autoCertRefreshTime

pulsar.admin.connectTimeout

pulsar.admin.readTimeout

pulsar.admin.requestTimeout

Pulsar Consumer Options #

pulsar.consumer.ackReceiptEnabled

pulsar.consumer.ackTimeoutMillis

pulsar.consumer.acknowledgementsGroupTimeMicros

pulsar.consumer.autoAckOldestChunkedMessageOnQueueFull

pulsar.consumer.autoUpdatePartitionsIntervalSeconds

pulsar.consumer.consumerName

pulsar.consumer.cryptoFailureAction

pulsar.consumer.deadLetterPolicy.deadLetterTopic

pulsar.consumer.deadLetterPolicy.maxRedeliverCount

pulsar.consumer.deadLetterPolicy.retryLetterTopic

pulsar.consumer.expireTimeOfIncompleteChunkedMessageMillis

pulsar.consumer.maxPendingChunkedMessage

pulsar.consumer.maxTotalReceiverQueueSizeAcrossPartitions

pulsar.consumer.negativeAckRedeliveryDelayMicros

pulsar.consumer.poolMessages

pulsar.consumer.priorityLevel

pulsar.consumer.properties

pulsar.consumer.readCompacted

pulsar.consumer.receiverQueueSize

pulsar.consumer.replicateSubscriptionState

pulsar.consumer.retryEnable

pulsar.consumer.subscriptionMode

pulsar.consumer.subscriptionName

pulsar.consumer.subscriptionType