pyflink.dataset package¶
Module contents¶
Important classes of Flink Batch API:
ExecutionEnvironment
: The ExecutionEnvironment is the context in which a batch program is executed.
-
class
pyflink.dataset.
ExecutionEnvironment
(j_execution_environment)[source]¶ Bases:
object
The ExecutionEnvironment is the context in which a program is executed.
The environment provides methods to control the job execution (such as setting the parallelism) and to interact with the outside world (data access).
-
add_default_kryo_serializer
(type_class_name: str, serializer_class_name: str)[source]¶ Adds a new Kryo default serializer to the Runtime.
Example:
>>> env.add_default_kryo_serializer("com.aaa.bbb.TypeClass", "com.aaa.bbb.Serializer")
- Parameters
type_class_name – The full-qualified java class name of the types serialized with the given serializer.
serializer_class_name – The full-qualified java class name of the serializer to use.
-
execute
(job_name: str = None) → pyflink.common.job_execution_result.JobExecutionResult[source]¶ Triggers the program execution. The environment will execute all parts of the program that have resulted in a “sink” operation.
The program execution will be logged and displayed with the given job name.
- Parameters
job_name – Desired name of the job, optional.
- Returns
The result of the job execution, containing elapsed time and accumulators.
-
get_config
() → pyflink.common.execution_config.ExecutionConfig[source]¶ Gets the config object that defines execution parameters.
- Returns
An
ExecutionConfig
object, the environment’s execution configuration.
-
get_default_local_parallelism
() → int[source]¶ Gets the default parallelism that will be used for the local execution environment.
- Returns
The parallelism.
-
static
get_execution_environment
() → pyflink.dataset.execution_environment.ExecutionEnvironment[source]¶ Creates an execution environment that represents the context in which the program is currently executed. If the program is invoked standalone, this method returns a local execution environment. If the program is invoked from within the command line client to be submitted to a cluster, this method returns the execution environment of this cluster.
- Returns
The
ExecutionEnvironment
of the context in which the program is executed.
-
get_execution_plan
() → str[source]¶ Creates the plan with which the system will execute the program, and returns it as a String using a JSON representation of the execution data flow graph. Note that this needs to be called, before the plan is executed.
If the compiler could not be instantiated, or the master could not be contacted to retrieve information relevant to the execution planning, an exception will be thrown.
- Returns
The execution plan of the program, as a JSON String.
-
get_parallelism
() → int[source]¶ Gets the parallelism with which operation are executed by default.
- Returns
The parallelism.
-
get_restart_strategy
() → pyflink.common.restart_strategy.RestartStrategyConfiguration[source]¶ Returns the specified restart strategy configuration.
- Returns
The restart strategy configuration to be used.
-
register_type
(type_class_name: str)[source]¶ Registers the given type with the serialization stack. If the type is eventually serialized as a POJO, then the type is registered with the POJO serializer. If the type ends up being serialized with Kryo, then it will be registered at Kryo to make sure that only tags are written.
Example:
>>> env.register_type("com.aaa.bbb.TypeClass")
- Parameters
type_class_name – The full-qualified java class name of the type to register.
-
register_type_with_kryo_serializer
(type_class_name: str, serializer_class_name: str)[source]¶ Registers the given Serializer via its class as a serializer for the given type at the KryoSerializer.
Example:
>>> env.register_type_with_kryo_serializer("com.aaa.bbb.TypeClass", ... "com.aaa.bbb.Serializer")
- Parameters
type_class_name – The full-qualified java class name of the types serialized with the given serializer.
serializer_class_name – The full-qualified java class name of the serializer to use.
-
set_default_local_parallelism
(parallelism: int)[source]¶ Sets the default parallelism that will be used for the local execution environment.
- Parameters
parallelism – The parallelism.
-
set_parallelism
(parallelism: int)[source]¶ Sets the parallelism for operations executed through this environment. Setting a parallelism of x here will cause all operators to run with x parallel instances.
- Parameters
parallelism – The parallelism.
-
set_restart_strategy
(restart_strategy_configuration: pyflink.common.restart_strategy.RestartStrategyConfiguration)[source]¶ Sets the restart strategy configuration. The configuration specifies which restart strategy will be used for the execution graph in case of a restart.
Example:
>>> env.set_restart_strategy(RestartStrategies.no_restart())
- Parameters
restart_strategy_configuration – Restart strategy configuration to be set.
-