Class DataStream<T>

    • Constructor Detail

      • DataStream

        public DataStream​(StreamExecutionEnvironment environment,
                          Transformation<T> transformation)
        Create a new DataStream in the given execution environment with partitioning set to forward by default.
        Parameters:
        environment - The StreamExecutionEnvironment
    • Method Detail

      • getParallelism

        public int getParallelism()
        Gets the parallelism for this operator.
        Returns:
        The parallelism set for this operator.
      • getMinResources

        @PublicEvolving
        public ResourceSpec getMinResources()
        Gets the minimum resources for this operator.
        Returns:
        The minimum resources set for this operator.
      • getPreferredResources

        @PublicEvolving
        public ResourceSpec getPreferredResources()
        Gets the preferred resources for this operator.
        Returns:
        The preferred resources set for this operator.
      • getType

        public TypeInformation<T> getType()
        Gets the type of the stream.
        Returns:
        The type of the datastream.
      • clean

        protected <F> F clean​(F f)
        Invokes the ClosureCleaner on the given function if closure cleaning is enabled in the ExecutionConfig.
        Returns:
        The cleaned Function
      • union

        @SafeVarargs
        public final DataStream<T> union​(DataStream<T>... streams)
        Creates a new DataStream by merging DataStream outputs of the same type with each other. The DataStreams merged using this operator will be transformed simultaneously.
        Parameters:
        streams - The DataStreams to union output with.
        Returns:
        The DataStream.
      • connect

        public <R> ConnectedStreams<T,​R> connect​(DataStream<R> dataStream)
        Creates a new ConnectedStreams by connecting DataStream outputs of (possible) different types with each other. The DataStreams connected using this operator can be used with CoFunctions to apply joint transformations.
        Parameters:
        dataStream - The DataStream with which this stream will be connected.
        Returns:
        The ConnectedStreams.
      • keyBy

        public <K> KeyedStream<T,​K> keyBy​(KeySelector<T,​K> key)
        It creates a new KeyedStream that uses the provided key for partitioning its operator states.
        Parameters:
        key - The KeySelector to be used for extracting the key for partitioning
        Returns:
        The DataStream with partitioned state (i.e. KeyedStream)
      • keyBy

        public <K> KeyedStream<T,​K> keyBy​(KeySelector<T,​K> key,
                                                TypeInformation<K> keyType)
        It creates a new KeyedStream that uses the provided key with explicit type information for partitioning its operator states.
        Parameters:
        key - The KeySelector to be used for extracting the key for partitioning.
        keyType - The type information describing the key type.
        Returns:
        The DataStream with partitioned state (i.e. KeyedStream)
      • partitionCustom

        public <K> DataStream<T> partitionCustom​(Partitioner<K> partitioner,
                                                 KeySelector<T,​K> keySelector)
        Partitions a DataStream on the key returned by the selector, using a custom partitioner. This method takes the key selector to get the key to partition on, and a partitioner that accepts the key type.

        Note: This method works only on single field keys, i.e. the selector cannot return tuples of fields.

        Parameters:
        partitioner - The partitioner to assign partitions to keys.
        keySelector - The KeySelector with which the DataStream is partitioned.
        Returns:
        The partitioned DataStream.
        See Also:
        KeySelector
      • broadcast

        public DataStream<T> broadcast()
        Sets the partitioning of the DataStream so that the output elements are broadcasted to every parallel instance of the next operation.
        Returns:
        The DataStream with broadcast partitioning set.
      • shuffle

        @PublicEvolving
        public DataStream<T> shuffle()
        Sets the partitioning of the DataStream so that the output elements are shuffled uniformly randomly to the next operation.
        Returns:
        The DataStream with shuffle partitioning set.
      • forward

        public DataStream<T> forward()
        Sets the partitioning of the DataStream so that the output elements are forwarded to the local subtask of the next operation.
        Returns:
        The DataStream with forward partitioning set.
      • rebalance

        public DataStream<T> rebalance()
        Sets the partitioning of the DataStream so that the output elements are distributed evenly to instances of the next operation in a round-robin fashion.
        Returns:
        The DataStream with rebalance partitioning set.
      • rescale

        @PublicEvolving
        public DataStream<T> rescale()
        Sets the partitioning of the DataStream so that the output elements are distributed evenly to a subset of instances of the next operation in a round-robin fashion.

        The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations will distribute to one downstream operation while the other two upstream operations will distribute to the other downstream operations.

        In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.

        Returns:
        The DataStream with rescale partitioning set.
      • global

        @PublicEvolving
        public DataStream<T> global()
        Sets the partitioning of the DataStream so that the output values all go to the first instance of the next processing operator. Use this setting with care since it might cause a serious performance bottleneck in the application.
        Returns:
        The DataStream with shuffle partitioning set.
      • map

        public <R> SingleOutputStreamOperator<R> map​(MapFunction<T,​R> mapper)
        Applies a Map transformation on a DataStream. The transformation calls a MapFunction for each element of the DataStream. Each MapFunction call returns exactly one element. The user can also extend RichMapFunction to gain access to other features provided by the RichFunction interface.
        Type Parameters:
        R - output type
        Parameters:
        mapper - The MapFunction that is called for each element of the DataStream.
        Returns:
        The transformed DataStream.
      • flatMap

        public <R> SingleOutputStreamOperator<R> flatMap​(FlatMapFunction<T,​R> flatMapper)
        Applies a FlatMap transformation on a DataStream. The transformation calls a FlatMapFunction for each element of the DataStream. Each FlatMapFunction call can return any number of elements including none. The user can also extend RichFlatMapFunction to gain access to other features provided by the RichFunction interface.
        Type Parameters:
        R - output type
        Parameters:
        flatMapper - The FlatMapFunction that is called for each element of the DataStream
        Returns:
        The transformed DataStream.
      • process

        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> process​(ProcessFunction<T,​R> processFunction)
        Applies the given ProcessFunction on the input stream, thereby creating a transformed output stream.

        The function will be called for every element in the input streams and can produce zero or more output elements.

        Type Parameters:
        R - The type of elements emitted by the ProcessFunction.
        Parameters:
        processFunction - The ProcessFunction that is called for each element in the stream.
        Returns:
        The transformed DataStream.
      • process

        @Internal
        public <R> SingleOutputStreamOperator<R> process​(ProcessFunction<T,​R> processFunction,
                                                         TypeInformation<R> outputType)
        Applies the given ProcessFunction on the input stream, thereby creating a transformed output stream.

        The function will be called for every element in the input streams and can produce zero or more output elements.

        Type Parameters:
        R - The type of elements emitted by the ProcessFunction.
        Parameters:
        processFunction - The ProcessFunction that is called for each element in the stream.
        outputType - TypeInformation for the result type of the function.
        Returns:
        The transformed DataStream.
      • filter

        public SingleOutputStreamOperator<T> filter​(FilterFunction<T> filter)
        Applies a Filter transformation on a DataStream. The transformation calls a FilterFunction for each element of the DataStream and retains only those element for which the function returns true. Elements for which the function returns false are filtered. The user can also extend RichFilterFunction to gain access to other features provided by the RichFunction interface.
        Parameters:
        filter - The FilterFunction that is called for each element of the DataStream.
        Returns:
        The filtered DataStream.
      • project

        @PublicEvolving
        public <R extends TupleSingleOutputStreamOperator<R> project​(int... fieldIndexes)
        Initiates a Project transformation on a Tuple DataStream.
        Note: Only Tuple DataStreams can be projected.

        The transformation projects each Tuple of the DataSet onto a (sub)set of fields.

        Parameters:
        fieldIndexes - The field indexes of the input tuples that are retained. The order of fields in the output tuple corresponds to the order of field indexes.
        Returns:
        The projected DataStream
        See Also:
        Tuple, DataStream
      • join

        public <T2> JoinedStreams<T,​T2> join​(DataStream<T2> otherStream)
        Creates a join operation. See JoinedStreams for an example of how the keys and window can be specified.
      • countWindowAll

        public AllWindowedStream<T,​GlobalWindow> countWindowAll​(long size)
        Windows this DataStream into tumbling count windows.

        Note: This operation is inherently non-parallel since all elements have to pass through the same operator instance.

        Parameters:
        size - The size of the windows in number of elements.
      • countWindowAll

        public AllWindowedStream<T,​GlobalWindow> countWindowAll​(long size,
                                                                      long slide)
        Windows this DataStream into sliding count windows.

        Note: This operation is inherently non-parallel since all elements have to pass through the same operator instance.

        Parameters:
        size - The size of the windows in number of elements.
        slide - The slide interval in number of elements.
      • windowAll

        @PublicEvolving
        public <W extends WindowAllWindowedStream<T,​W> windowAll​(WindowAssigner<? super T,​W> assigner)
        Windows this data stream to a AllWindowedStream, which evaluates windows over a non key grouped stream. Elements are put into windows by a WindowAssigner. The grouping of elements is done by window.

        A Trigger can be defined to specify when windows are evaluated. However, WindowAssigners have a default Trigger that is used if a Trigger is not specified.

        Note: This operation is inherently non-parallel since all elements have to pass through the same operator instance.

        Parameters:
        assigner - The WindowAssigner that assigns elements to windows.
        Returns:
        The trigger windows data stream.
      • print

        @PublicEvolving
        public DataStreamSink<T> print()
        Writes a DataStream to the standard output stream (stdout).

        For each element of the DataStream the result of Object.toString() is written.

        NOTE: This will print to stdout on the machine where the code is executed, i.e. the Flink worker.

        Returns:
        The closed DataStream.
      • printToErr

        @PublicEvolving
        public DataStreamSink<T> printToErr()
        Writes a DataStream to the standard error stream (stderr).

        For each element of the DataStream the result of Object.toString() is written.

        NOTE: This will print to stderr on the machine where the code is executed, i.e. the Flink worker.

        Returns:
        The closed DataStream.
      • print

        @PublicEvolving
        public DataStreamSink<T> print​(String sinkIdentifier)
        Writes a DataStream to the standard output stream (stdout).

        For each element of the DataStream the result of Object.toString() is written.

        NOTE: This will print to stdout on the machine where the code is executed, i.e. the Flink worker.

        Parameters:
        sinkIdentifier - The string to prefix the output with.
        Returns:
        The closed DataStream.
      • printToErr

        @PublicEvolving
        public DataStreamSink<T> printToErr​(String sinkIdentifier)
        Writes a DataStream to the standard error stream (stderr).

        For each element of the DataStream the result of Object.toString() is written.

        NOTE: This will print to stderr on the machine where the code is executed, i.e. the Flink worker.

        Parameters:
        sinkIdentifier - The string to prefix the output with.
        Returns:
        The closed DataStream.
      • writeToSocket

        @PublicEvolving
        public DataStreamSink<T> writeToSocket​(String hostName,
                                               int port,
                                               SerializationSchema<T> schema)
        Writes the DataStream to a socket as a byte array. The format of the output is specified by a SerializationSchema.
        Parameters:
        hostName - host of the socket
        port - port of the socket
        schema - schema for serialization
        Returns:
        the closed DataStream
      • transform

        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> transform​(String operatorName,
                                                           TypeInformation<R> outTypeInfo,
                                                           OneInputStreamOperatorFactory<T,​R> operatorFactory)
        Method for passing user defined operators created by the given factory along with the type information that will transform the DataStream.

        This method uses the rather new operator factories and should only be used when custom factories are needed.

        Type Parameters:
        R - type of the return stream
        Parameters:
        operatorName - name of the operator, for logging purposes
        outTypeInfo - the output type of the operator
        operatorFactory - the factory for the operator.
        Returns:
        the data stream constructed.
      • setConnectionType

        protected DataStream<T> setConnectionType​(StreamPartitioner<T> partitioner)
        Internal function for setting the partitioner for the DataStream.
        Parameters:
        partitioner - Partitioner to set.
        Returns:
        The modified DataStream.
      • addSink

        public DataStreamSink<T> addSink​(SinkFunction<T> sinkFunction)
        Adds the given sink to this DataStream. Only streams with sinks added will be executed once the StreamExecutionEnvironment.execute() method is called.
        Parameters:
        sinkFunction - The object containing the sink's invoke function.
        Returns:
        The closed DataStream.
      • sinkTo

        @PublicEvolving
        public DataStreamSink<T> sinkTo​(Sink<T> sink,
                                        CustomSinkOperatorUidHashes customSinkOperatorUidHashes)
        Adds the given Sink to this DataStream. Only streams with sinks added will be executed once the StreamExecutionEnvironment.execute() method is called.

        This method is intended to be used only to recover a snapshot where no uids have been set before taking the snapshot.

        Parameters:
        customSinkOperatorUidHashes - operator hashes to support state binding
        sink - The user defined sink.
        Returns:
        The closed DataStream.
      • executeAndCollect

        public CloseableIterator<T> executeAndCollect()
                                               throws Exception
        Triggers the distributed execution of the streaming dataflow and returns an iterator over the elements of the given DataStream.

        The DataStream application is executed in the regular distributed manner on the target environment, and the events from the stream are polled back to this application process and thread through Flink's REST API.

        IMPORTANT The returned iterator must be closed to free all cluster resources.

        Throws:
        Exception
      • executeAndCollect

        public CloseableIterator<T> executeAndCollect​(String jobExecutionName)
                                               throws Exception
        Triggers the distributed execution of the streaming dataflow and returns an iterator over the elements of the given DataStream.

        The DataStream application is executed in the regular distributed manner on the target environment, and the events from the stream are polled back to this application process and thread through Flink's REST API.

        IMPORTANT The returned iterator must be closed to free all cluster resources.

        Throws:
        Exception
      • executeAndCollect

        public List<T> executeAndCollect​(int limit)
                                  throws Exception
        Triggers the distributed execution of the streaming dataflow and returns an iterator over the elements of the given DataStream.

        The DataStream application is executed in the regular distributed manner on the target environment, and the events from the stream are polled back to this application process and thread through Flink's REST API.

        Throws:
        Exception
      • executeAndCollect

        public List<T> executeAndCollect​(String jobExecutionName,
                                         int limit)
                                  throws Exception
        Triggers the distributed execution of the streaming dataflow and returns an iterator over the elements of the given DataStream.

        The DataStream application is executed in the regular distributed manner on the target environment, and the events from the stream are polled back to this application process and thread through Flink's REST API.

        Throws:
        Exception
      • collectAsync

        @Experimental
        public CloseableIterator<T> collectAsync()
        Sets up the collection of the elements in this DataStream, and returns an iterator over the collected elements that can be used to retrieve elements once the job execution has started.

        Caution: When multiple streams are being collected it is recommended to consume all streams in parallel to not back-pressure the job.

        Caution: Closing the returned iterator cancels the job! It is recommended to close all iterators once you are no longer interested in any of the collected streams.

        This method is functionally equivalent to collectAsync(Collector).

        Returns:
        iterator over the contained elements
      • collectAsync

        @Experimental
        public void collectAsync​(DataStream.Collector<T> collector)
        Sets up the collection of the elements in this DataStream, which can be retrieved later via the given DataStream.Collector.

        Caution: When multiple streams are being collected it is recommended to consume all streams in parallel to not back-pressure the job.

        Caution: Closing the iterator from the collector cancels the job! It is recommended to close all iterators once you are no longer interested in any of the collected streams.

        This method is functionally equivalent to collectAsync().

        This method is meant to support use-cases where the application of a sink is done via a Consumer<DataStream<T>>, where it wouldn't be possible (or inconvenient) to return an iterator.

        Parameters:
        collector - a collector that can be used to retrieve the elements
      • fullWindowPartition

        @PublicEvolving
        public PartitionWindowedStream<T> fullWindowPartition()
        Collect records from each partition into a separate full window. The window emission will be triggered at the end of inputs. For this non-keyed data stream(each record has no key), a partition contains all records of a subtask.
        Returns:
        The full windowed data stream on partition.