T
- The type of the DataSet, i.e., the type of the elements of the DataSet.@Public public abstract class DataSet<T> extends Object
A DataSet can be transformed into another DataSet by applying a transformation as for example
Modifier and Type | Field and Description |
---|---|
protected ExecutionEnvironment |
context |
Modifier | Constructor and Description |
---|---|
protected |
DataSet(ExecutionEnvironment context,
TypeInformation<T> typeInfo) |
Modifier and Type | Method and Description |
---|---|
AggregateOperator<T> |
aggregate(Aggregations agg,
int field)
|
protected static void |
checkSameExecutionContext(DataSet<?> set1,
DataSet<?> set2) |
<F> F |
clean(F f) |
<R> CoGroupOperator.CoGroupOperatorSets<T,R> |
coGroup(DataSet<R> other)
Initiates a CoGroup transformation.
|
List<T> |
collect()
Convenience method to get the elements of a DataSet as a List.
|
<R> GroupCombineOperator<T,R> |
combineGroup(GroupCombineFunction<T,R> combiner)
Applies a GroupCombineFunction on a non-grouped
DataSet . |
long |
count()
Convenience method to get the count (number of elements) of a DataSet.
|
<R> CrossOperator.DefaultCross<T,R> |
cross(DataSet<R> other)
Initiates a Cross transformation.
|
<R> CrossOperator.DefaultCross<T,R> |
crossWithHuge(DataSet<R> other)
Initiates a Cross transformation.
|
<R> CrossOperator.DefaultCross<T,R> |
crossWithTiny(DataSet<R> other)
Initiates a Cross transformation.
|
DistinctOperator<T> |
distinct()
Returns a distinct set of a
DataSet . |
DistinctOperator<T> |
distinct(int... fields)
|
<K> DistinctOperator<T> |
distinct(KeySelector<T,K> keyExtractor)
Returns a distinct set of a
DataSet using a KeySelector function. |
DistinctOperator<T> |
distinct(String... fields)
Returns a distinct set of a
DataSet using expression keys. |
protected void |
fillInType(TypeInformation<T> typeInfo)
Tries to fill in the type information.
|
FilterOperator<T> |
filter(FilterFunction<T> filter)
Applies a Filter transformation on a
DataSet . |
GroupReduceOperator<T,T> |
first(int n)
Returns a new set containing the first n elements in this
DataSet . |
<R> FlatMapOperator<T,R> |
flatMap(FlatMapFunction<T,R> flatMapper)
Applies a FlatMap transformation on a
DataSet . |
<R> JoinOperatorSetsBase<T,R> |
fullOuterJoin(DataSet<R> other)
Initiates a Full Outer Join transformation.
|
<R> JoinOperatorSetsBase<T,R> |
fullOuterJoin(DataSet<R> other,
JoinOperatorBase.JoinHint strategy)
Initiates a Full Outer Join transformation.
|
ExecutionEnvironment |
getExecutionEnvironment()
Returns the
ExecutionEnvironment in which this DataSet is registered. |
TypeInformation<T> |
getType()
Returns the
TypeInformation for the type of this DataSet. |
UnsortedGrouping<T> |
groupBy(int... fields)
|
<K> UnsortedGrouping<T> |
groupBy(KeySelector<T,K> keyExtractor)
Groups a
DataSet using a KeySelector function. |
UnsortedGrouping<T> |
groupBy(String... fields)
Groups a
DataSet using field expressions. |
IterativeDataSet<T> |
iterate(int maxIterations)
Initiates an iterative part of the program that executes multiple times and feeds back data
sets.
|
<R> DeltaIteration<T,R> |
iterateDelta(DataSet<R> workset,
int maxIterations,
int... keyPositions)
Initiates a delta iteration.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
join(DataSet<R> other)
Initiates a Join transformation.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
join(DataSet<R> other,
JoinOperatorBase.JoinHint strategy)
Initiates a Join transformation.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
joinWithHuge(DataSet<R> other)
Initiates a Join transformation.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
joinWithTiny(DataSet<R> other)
Initiates a Join transformation.
|
<R> JoinOperatorSetsBase<T,R> |
leftOuterJoin(DataSet<R> other)
Initiates a Left Outer Join transformation.
|
<R> JoinOperatorSetsBase<T,R> |
leftOuterJoin(DataSet<R> other,
JoinOperatorBase.JoinHint strategy)
Initiates a Left Outer Join transformation.
|
<R> MapOperator<T,R> |
map(MapFunction<T,R> mapper)
Applies a Map transformation on this DataSet.
|
<R> MapPartitionOperator<T,R> |
mapPartition(MapPartitionFunction<T,R> mapPartition)
Applies a Map-style operation to the entire partition of the data.
|
AggregateOperator<T> |
max(int field)
Syntactic sugar for
aggregate(Aggregations, int) using Aggregations.MAX as
the aggregation function. |
ReduceOperator<T> |
maxBy(int... fields)
Selects an element with maximum value.
|
AggregateOperator<T> |
min(int field)
Syntactic sugar for
aggregate(Aggregations, int) using Aggregations.MIN as
the aggregation function. |
ReduceOperator<T> |
minBy(int... fields)
Selects an element with minimum value.
|
DataSink<T> |
output(OutputFormat<T> outputFormat)
Emits a DataSet using an
OutputFormat . |
PartitionOperator<T> |
partitionByHash(int... fields)
Hash-partitions a DataSet on the specified key fields.
|
<K extends Comparable<K>> |
partitionByHash(KeySelector<T,K> keyExtractor)
Partitions a DataSet using the specified KeySelector.
|
PartitionOperator<T> |
partitionByHash(String... fields)
Hash-partitions a DataSet on the specified key fields.
|
PartitionOperator<T> |
partitionByRange(int... fields)
Range-partitions a DataSet on the specified key fields.
|
<K extends Comparable<K>> |
partitionByRange(KeySelector<T,K> keyExtractor)
Range-partitions a DataSet using the specified KeySelector.
|
PartitionOperator<T> |
partitionByRange(String... fields)
Range-partitions a DataSet on the specified key fields.
|
<K> PartitionOperator<T> |
partitionCustom(Partitioner<K> partitioner,
int field)
Partitions a tuple DataSet on the specified key fields using a custom partitioner.
|
<K extends Comparable<K>> |
partitionCustom(Partitioner<K> partitioner,
KeySelector<T,K> keyExtractor)
Partitions a DataSet on the key returned by the selector, using a custom partitioner.
|
<K> PartitionOperator<T> |
partitionCustom(Partitioner<K> partitioner,
String field)
Partitions a POJO DataSet on the specified key fields using a custom partitioner.
|
void |
print()
Prints the elements in a DataSet to the standard output stream
System.out of the JVM
that calls the print() method. |
DataSink<T> |
print(String sinkIdentifier)
Deprecated.
Use
printOnTaskManager(String) instead. |
DataSink<T> |
printOnTaskManager(String prefix)
Writes a DataSet to the standard output streams (stdout) of the TaskManagers that execute the
program (or more specifically, the data sink operators).
|
void |
printToErr()
Prints the elements in a DataSet to the standard error stream
System.err of the JVM
that calls the print() method. |
DataSink<T> |
printToErr(String sinkIdentifier)
Deprecated.
Use
printOnTaskManager(String) instead, or the PrintingOutputFormat . |
<OUT extends Tuple> |
project(int... fieldIndexes)
|
PartitionOperator<T> |
rebalance()
Enforces a re-balancing of the DataSet, i.e., the DataSet is evenly distributed over all
parallel instances of the following task.
|
ReduceOperator<T> |
reduce(ReduceFunction<T> reducer)
Applies a Reduce transformation on a non-grouped
DataSet . |
<R> GroupReduceOperator<T,R> |
reduceGroup(GroupReduceFunction<T,R> reducer)
Applies a GroupReduce transformation on a non-grouped
DataSet . |
<R> JoinOperatorSetsBase<T,R> |
rightOuterJoin(DataSet<R> other)
Initiates a Right Outer Join transformation.
|
<R> JoinOperatorSetsBase<T,R> |
rightOuterJoin(DataSet<R> other,
JoinOperatorBase.JoinHint strategy)
Initiates a Right Outer Join transformation.
|
<X> DataSet<X> |
runOperation(CustomUnaryOperation<T,X> operation)
Runs a
CustomUnaryOperation on the data set. |
SortPartitionOperator<T> |
sortPartition(int field,
Order order)
Locally sorts the partitions of the DataSet on the specified field in the specified order.
|
<K> SortPartitionOperator<T> |
sortPartition(KeySelector<T,K> keyExtractor,
Order order)
Locally sorts the partitions of the DataSet on the extracted key in the specified order.
|
SortPartitionOperator<T> |
sortPartition(String field,
Order order)
Locally sorts the partitions of the DataSet on the specified field in the specified order.
|
AggregateOperator<T> |
sum(int field)
Syntactic sugar for aggregate (SUM, field).
|
UnionOperator<T> |
union(DataSet<T> other)
Creates a union of this DataSet with an other DataSet.
|
DataSink<T> |
write(FileOutputFormat<T> outputFormat,
String filePath)
Writes a DataSet using a
FileOutputFormat to a specified location. |
DataSink<T> |
write(FileOutputFormat<T> outputFormat,
String filePath,
FileSystem.WriteMode writeMode)
Writes a DataSet using a
FileOutputFormat to a specified location. |
DataSink<T> |
writeAsCsv(String filePath)
Writes a
Tuple DataSet as CSV file(s) to the specified location. |
DataSink<T> |
writeAsCsv(String filePath,
FileSystem.WriteMode writeMode)
Writes a
Tuple DataSet as CSV file(s) to the specified location. |
DataSink<T> |
writeAsCsv(String filePath,
String rowDelimiter,
String fieldDelimiter)
Writes a
Tuple DataSet as CSV file(s) to the specified location with the specified
field and line delimiters. |
DataSink<T> |
writeAsCsv(String filePath,
String rowDelimiter,
String fieldDelimiter,
FileSystem.WriteMode writeMode)
Writes a
Tuple DataSet as CSV file(s) to the specified location with the specified
field and line delimiters. |
DataSink<String> |
writeAsFormattedText(String filePath,
FileSystem.WriteMode writeMode,
TextOutputFormat.TextFormatter<T> formatter)
Writes a DataSet as text file(s) to the specified location.
|
DataSink<String> |
writeAsFormattedText(String filePath,
TextOutputFormat.TextFormatter<T> formatter)
Writes a DataSet as text file(s) to the specified location.
|
DataSink<T> |
writeAsText(String filePath)
Writes a DataSet as text file(s) to the specified location.
|
DataSink<T> |
writeAsText(String filePath,
FileSystem.WriteMode writeMode)
Writes a DataSet as text file(s) to the specified location.
|
protected final ExecutionEnvironment context
protected DataSet(ExecutionEnvironment context, TypeInformation<T> typeInfo)
public ExecutionEnvironment getExecutionEnvironment()
ExecutionEnvironment
in which this DataSet is registered.ExecutionEnvironment
protected void fillInType(TypeInformation<T> typeInfo)
typeInfo
- The type information to fill in.IllegalStateException
- Thrown, if the type information has been accessed before.public TypeInformation<T> getType()
TypeInformation
for the type of this DataSet.TypeInformation
public <F> F clean(F f)
public <R> MapOperator<T,R> map(MapFunction<T,R> mapper)
The transformation calls a MapFunction
for
each element of the DataSet. Each MapFunction call returns exactly one element.
mapper
- The MapFunction that is called for each element of the DataSet.MapFunction
,
RichMapFunction
,
MapOperator
public <R> MapPartitionOperator<T,R> mapPartition(MapPartitionFunction<T,R> mapPartition)
This function is intended for operations that cannot transform individual elements,
requires no grouping of elements. To transform individual elements, the use of map()
and flatMap()
is preferable.
mapPartition
- The MapPartitionFunction that is called for the full DataSet.MapPartitionFunction
,
MapPartitionOperator
public <R> FlatMapOperator<T,R> flatMap(FlatMapFunction<T,R> flatMapper)
DataSet
.
The transformation calls a RichFlatMapFunction
for each element of the DataSet.
Each FlatMapFunction call can return any number of elements including none.
flatMapper
- The FlatMapFunction that is called for each element of the DataSet.RichFlatMapFunction
,
FlatMapOperator
,
DataSet
public FilterOperator<T> filter(FilterFunction<T> filter)
DataSet
.
The transformation calls a RichFilterFunction
for each element of the DataSet and
retains only those element for which the function returns true. Elements for which the
function returns false are filtered.
filter
- The FilterFunction that is called for each element of the DataSet.RichFilterFunction
,
FilterOperator
,
DataSet
public <OUT extends Tuple> ProjectOperator<?,OUT> project(int... fieldIndexes)
Tuple
DataSet
.
Note: Only Tuple DataSets can be projected using field indexes.
The transformation projects each Tuple of the DataSet onto a (sub)set of fields.
Additional fields can be added to the projection by calling project(int[])
.
Note: With the current implementation, the Project transformation loses type information.
fieldIndexes
- The field indexes of the input tuple that are retained. The order of
fields in the output tuple corresponds to the order of field indexes.Tuple
,
DataSet
,
ProjectOperator
public AggregateOperator<T> aggregate(Aggregations agg, int field)
Tuple
DataSet
.
Note: Only Tuple DataSets can be aggregated. The transformation applies a built-in
Aggregation
on a specified field of a Tuple DataSet. Additional
aggregation functions can be added to the resulting AggregateOperator
by calling
AggregateOperator.and(Aggregations, int)
.
agg
- The built-in aggregation function that is computed.field
- The index of the Tuple field on which the aggregation function is applied.Tuple
,
Aggregations
,
AggregateOperator
,
DataSet
public AggregateOperator<T> sum(int field)
field
- The index of the Tuple field on which the aggregation function is applied.AggregateOperator
public AggregateOperator<T> max(int field)
aggregate(Aggregations, int)
using Aggregations.MAX
as
the aggregation function.
Note: This operation is not to be confused with maxBy(int...)
,
which selects one element with maximum value at the specified field positions.
field
- The index of the Tuple field on which the aggregation function is applied.aggregate(Aggregations, int)
,
maxBy(int...)
public AggregateOperator<T> min(int field)
aggregate(Aggregations, int)
using Aggregations.MIN
as
the aggregation function.
Note: This operation is not to be confused with minBy(int...)
,
which selects one element with the minimum value at the specified field positions.
field
- The index of the Tuple field on which the aggregation function is applied.aggregate(Aggregations, int)
,
minBy(int...)
public long count() throws Exception
Exception
public List<T> collect() throws Exception
Exception
public ReduceOperator<T> reduce(ReduceFunction<T> reducer)
DataSet
.
The transformation consecutively calls a RichReduceFunction
until only a single element remains
which is the result of the transformation. A ReduceFunction combines two elements into one
new element of the same type.
reducer
- The ReduceFunction that is applied on the DataSet.RichReduceFunction
,
ReduceOperator
,
DataSet
public <R> GroupReduceOperator<T,R> reduceGroup(GroupReduceFunction<T,R> reducer)
DataSet
.
The transformation calls a RichGroupReduceFunction
once with the full DataSet.
The GroupReduceFunction can iterate over all elements of the DataSet and emit any number of
output elements including none.
reducer
- The GroupReduceFunction that is applied on the DataSet.RichGroupReduceFunction
,
GroupReduceOperator
,
DataSet
public <R> GroupCombineOperator<T,R> combineGroup(GroupCombineFunction<T,R> combiner)
DataSet
. A CombineFunction is similar
to a GroupReduceFunction but does not perform a full data exchange. Instead, the
CombineFunction calls the combine method once per partition for combining a group of results.
This operator is suitable for combining values into an intermediate format before doing a
proper groupReduce where the data is shuffled across the node for further reduction. The
GroupReduce operator can also be supplied with a combiner by implementing the RichGroupReduce
function. The combine method of the RichGroupReduce function demands input and output type to
be the same. The CombineFunction, on the other side, can have an arbitrary output type.combiner
- The GroupCombineFunction that is applied on the DataSet.public ReduceOperator<T> minBy(int... fields)
The minimum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0]
,
the results will be:
minBy(0)
: [0, 1]
minBy(1)
: [1, 0]
Example 2: Given a data set with elements [0, 0], [0, 1]
,
the results will be:
minBy(0, 1)
: [0, 0]
If multiple values with minimum value at the specified fields exist, a random one will be picked.
Internally, this operation is implemented as a ReduceFunction
.
fields
- Field positions to compute the minimum overReduceOperator
representing the minimumpublic ReduceOperator<T> maxBy(int... fields)
The maximum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0]
,
the results will be:
maxBy(0)
: [1, 0]
maxBy(1)
: [0, 1]
Example 2: Given a data set with elements [0, 0], [0, 1]
,
the results will be:
maxBy(0, 1)
: [0, 1]
If multiple values with maximum value at the specified fields exist, a random one will be picked.
Internally, this operation is implemented as a ReduceFunction
.
fields
- Field positions to compute the maximum overReduceOperator
representing the maximumpublic GroupReduceOperator<T,T> first(int n)
DataSet
.n
- The desired number of elements.public <K> DistinctOperator<T> distinct(KeySelector<T,K> keyExtractor)
DataSet
using a KeySelector
function.
The KeySelector function is called for each element of the DataSet and extracts a single key value on which the decision is made if two items are distinct or not.
keyExtractor
- The KeySelector function which extracts the key values from the DataSet
on which the distinction of the DataSet is decided.public DistinctOperator<T> distinct(int... fields)
Tuple
DataSet
using field position keys.
The field position keys specify the fields of Tuples on which the decision is made if two Tuples are distinct or not.
Note: Field position keys can only be specified for Tuple DataSets.
fields
- One or more field positions on which the distinction of the DataSet is decided.public DistinctOperator<T> distinct(String... fields)
DataSet
using expression keys.
The field expression keys specify the fields of a CompositeType
(e.g., Tuple or Pojo type) on which the
decision is made if two elements are distinct or not. In case of a AtomicType
, only the wildcard expression ("*") is
valid.
fields
- One or more field expressions on which the distinction of the DataSet is
decided.public DistinctOperator<T> distinct()
DataSet
.
If the input is a CompositeType
(Tuple or
Pojo type), distinct is performed on all fields and each field must be a key type
public <K> UnsortedGrouping<T> groupBy(KeySelector<T,K> keyExtractor)
DataSet
using a KeySelector
function. The KeySelector function is
called for each element of the DataSet and extracts a single key value on which the DataSet
is grouped.
This method returns an UnsortedGrouping
on which one of the following grouping
transformation can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to
get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate
transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
keyExtractor
- The KeySelector
function which extracts the key values from the
DataSet on which it is grouped.UnsortedGrouping
on which a transformation needs to be applied to obtain a
transformed DataSet.KeySelector
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public UnsortedGrouping<T> groupBy(int... fields)
Tuple
DataSet
using field position keys.
Note: Field position keys only be specified for Tuple DataSets.
The field position keys specify the fields of Tuples on which the DataSet is grouped. This
method returns an UnsortedGrouping
on which one of the following grouping
transformation can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to
get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate
transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
fields
- One or more field positions on which the DataSet will be grouped.UnsortedGrouping
on which a transformation needs to be applied to obtain a
transformed DataSet.Tuple
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public UnsortedGrouping<T> groupBy(String... fields)
DataSet
using field expressions. A field expression is either the name of a
public field or a getter method with parentheses of the DataSet
S underlying type. A
dot can be used to drill down into objects, as in "field1.getInnerField2()"
. This
method returns an UnsortedGrouping
on which one of the following grouping
transformation can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to
get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate
transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
fields
- One or more field expressions on which the DataSet will be grouped.UnsortedGrouping
on which a transformation needs to be applied to obtain a
transformed DataSet.Tuple
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> join(DataSet<R> other)
A Join transformation joins the elements of two DataSets
on key equality
and provides multiple ways to combine joining elements into one DataSet.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> join(DataSet<R> other, JoinOperatorBase.JoinHint strategy)
A Join transformation joins the elements of two DataSets
on key equality
and provides multiple ways to combine joining elements into one DataSet.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If null
is given,
then the optimizer will pick the join strategy.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> joinWithTiny(DataSet<R> other)
A Join transformation joins the elements of two DataSets
on key equality
and provides multiple ways to combine joining elements into one DataSet.
This method also gives the hint to the optimizer that the second DataSet to join is much smaller than the first one.
This method returns a JoinOperator.JoinOperatorSets
on which JoinOperator.JoinOperatorSets.where(String...)
needs to be called to define the join key of the first
joining (i.e., this) DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> joinWithHuge(DataSet<R> other)
A Join transformation joins the elements of two DataSets
on key equality
and provides multiple ways to combine joining elements into one DataSet.
This method also gives the hint to the optimizer that the second DataSet to join is much larger than the first one.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperatorSetsBase<T,R> leftOuterJoin(DataSet<R> other)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of the left DataSet (i.e. this
) that do not have a matching
element on the other side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperatorSetsBase
,
DataSet
public <R> JoinOperatorSetsBase<T,R> leftOuterJoin(DataSet<R> other, JoinOperatorBase.JoinHint strategy)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of the left DataSet (i.e. this
) that do not have a matching
element on the other side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If null
is given,
then the optimizer will pick the join strategy.JoinOperatorSetsBase
,
DataSet
public <R> JoinOperatorSetsBase<T,R> rightOuterJoin(DataSet<R> other)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of the right DataSet (i.e. other
) that do not have a matching
element on this
side are joined with null
and emitted to the resulting
DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperatorSetsBase
,
DataSet
public <R> JoinOperatorSetsBase<T,R> rightOuterJoin(DataSet<R> other, JoinOperatorBase.JoinHint strategy)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of the right DataSet (i.e. other
) that do not have a matching
element on this
side are joined with null
and emitted to the resulting
DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If null
is given,
then the optimizer will pick the join strategy.JoinOperatorSetsBase
,
DataSet
public <R> JoinOperatorSetsBase<T,R> fullOuterJoin(DataSet<R> other)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of both DataSets that do not have a matching element on the opposing side
are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.JoinOperatorSetsBase
,
DataSet
public <R> JoinOperatorSetsBase<T,R> fullOuterJoin(DataSet<R> other, JoinOperatorBase.JoinHint strategy)
An Outer Join transformation joins two elements of two DataSets
on key
equality and provides multiple ways to combine joining elements into one DataSet.
Elements of both DataSets that do not have a matching element on the opposing side
are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If null
is given,
then the optimizer will pick the join strategy.JoinOperatorSetsBase
,
DataSet
public <R> CoGroupOperator.CoGroupOperatorSets<T,R> coGroup(DataSet<R> other)
A CoGroup transformation combines the elements of two DataSets
into one
DataSet. It groups each DataSet individually on a key and gives groups of both DataSets with
equal keys together into a RichCoGroupFunction
.
If a DataSet has a group with no matching key in the other DataSet, the CoGroupFunction is
called with an empty group for the non-existing group.
The CoGroupFunction can iterate over the elements of both groups and return any number of elements including none.
This method returns a CoGroupOperator.CoGroupOperatorSets
on which one of the where
methods can be called to define the join key of the first joining (i.e., this) DataSet.
other
- The other DataSet of the CoGroup transformation.CoGroupOperator.CoGroupOperatorSets
,
CoGroupOperator
,
DataSet
public <R> CrossOperator.DefaultCross<T,R> cross(DataSet<R> other)
A Cross transformation combines the elements of two DataSets
into one
DataSet. It builds all pair combinations of elements of both DataSets, i.e., it builds a
Cartesian product.
The resulting CrossOperator.DefaultCross
wraps
each pair of crossed elements into a Tuple2
, with the element of the first input
being the first field of the tuple and the element of the second input being the second field
of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair
of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public <R> CrossOperator.DefaultCross<T,R> crossWithTiny(DataSet<R> other)
A Cross transformation combines the elements of two DataSets
into one
DataSet. It builds all pair combinations of elements of both DataSets, i.e., it builds a
Cartesian product. This method also gives the hint to the optimizer that the second DataSet
to cross is much smaller than the first one.
The resulting CrossOperator.DefaultCross
wraps
each pair of crossed elements into a Tuple2
, with the element of the first input
being the first field of the tuple and the element of the second input being the second field
of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair
of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public <R> CrossOperator.DefaultCross<T,R> crossWithHuge(DataSet<R> other)
A Cross transformation combines the elements of two DataSets
into one
DataSet. It builds all pair combinations of elements of both DataSets, i.e., it builds a
Cartesian product. This method also gives the hint to the optimizer that the second DataSet
to cross is much larger than the first one.
The resulting CrossOperator.DefaultCross
wraps
each pair of crossed elements into a Tuple2
, with the element of the first input
being the first field of the tuple and the element of the second input being the second field
of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair
of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public IterativeDataSet<T> iterate(int maxIterations)
IterativeDataSet.closeWith(DataSet)
. The data set given
to the closeWith(DataSet)
method is the data set that will be fed back and used as
the input to the next iteration. The return value of the closeWith(DataSet)
method is
the resulting data set after the iteration has terminated.
An example of an iterative computation is as follows:
DataSet<Double> input = ...;
DataSet<Double> startOfIteration = input.iterate(10);
DataSet<Double> toBeFedBack = startOfIteration
.map(new MyMapper())
.groupBy(...).reduceGroup(new MyReducer());
DataSet<Double> result = startOfIteration.closeWith(toBeFedBack);
The iteration has a maximum number of times that it executes. A dynamic termination can be
realized by using a termination criterion (see IterativeDataSet.closeWith(DataSet, DataSet)
).
maxIterations
- The maximum number of times that the iteration is executed.IterativeDataSet.closeWith(DataSet)
.IterativeDataSet
public <R> DeltaIteration<T,R> iterateDelta(DataSet<R> workset, int maxIterations, int... keyPositions)
iterate(int)
, but maintains state across the individual iteration steps. The
Solution set, which represents the current state at the beginning of each iteration can be
obtained via DeltaIteration.getSolutionSet()
()}.
It can be be accessed by joining (or CoGrouping) with it. The DataSet that represents the
workset of an iteration can be obtained via DeltaIteration.getWorkset()
. The solution set is updated
by producing a delta for it, which is merged into the solution set at the end of each
iteration step.
The delta iteration must be closed by calling DeltaIteration.closeWith(DataSet, DataSet)
. The two
parameters are the delta for the solution set and the new workset (the data set that will be
fed back). The return value of the closeWith(DataSet, DataSet)
method is the
resulting data set after the iteration has terminated. Delta iterations terminate when the
feed back data set (the workset) is empty. In addition, a maximum number of steps is given as
a fall back termination guard.
Elements in the solution set are uniquely identified by a key. When merging the solution set delta, contained elements with the same key are replaced.
NOTE: Delta iterations currently support only tuple valued data types. This restriction will be removed in the future. The key is specified by the tuple position.
A code example for a delta iteration is as follows
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
initialState.iterateDelta(initialFeedbackSet, 100, 0);
DataSet<Tuple2<Long, Long>> delta = iteration.groupBy(0).aggregate(Aggregations.AVG, 1)
.join(iteration.getSolutionSet()).where(0).equalTo(0)
.flatMap(new ProjectAndFilter());
DataSet<Tuple2<Long, Long>> feedBack = delta.join(someOtherSet).where(...).equalTo(...).with(...);
// close the delta iteration (delta and new workset are identical)
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(delta, feedBack);
workset
- The initial version of the data set that is fed back to the next iteration
step (the workset).maxIterations
- The maximum number of iteration steps, as a fall back safeguard.keyPositions
- The position of the tuple fields that is used as the key of the solution
set.DeltaIteration
public <X> DataSet<X> runOperation(CustomUnaryOperation<T,X> operation)
CustomUnaryOperation
on the data set. Custom operations are typically complex
operators that are composed of multiple steps.operation
- The operation to run.public UnionOperator<T> union(DataSet<T> other)
other
- The other DataSet which is unioned with the current DataSet.public PartitionOperator<T> partitionByHash(int... fields)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
fields
- The field indexes on which the DataSet is hash-partitioned.public PartitionOperator<T> partitionByHash(String... fields)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
fields
- The field expressions on which the DataSet is hash-partitioned.public <K extends Comparable<K>> PartitionOperator<T> partitionByHash(KeySelector<T,K> keyExtractor)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
keyExtractor
- The KeyExtractor with which the DataSet is hash-partitioned.KeySelector
public PartitionOperator<T> partitionByRange(int... fields)
Important:This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
fields
- The field indexes on which the DataSet is range-partitioned.public PartitionOperator<T> partitionByRange(String... fields)
Important:This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
fields
- The field expressions on which the DataSet is range-partitioned.public <K extends Comparable<K>> PartitionOperator<T> partitionByRange(KeySelector<T,K> keyExtractor)
Important:This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
keyExtractor
- The KeyExtractor with which the DataSet is range-partitioned.KeySelector
public <K> PartitionOperator<T> partitionCustom(Partitioner<K> partitioner, int field)
Note: This method works only on single field keys.
partitioner
- The partitioner to assign partitions to keys.field
- The field index on which the DataSet is to partitioned.public <K> PartitionOperator<T> partitionCustom(Partitioner<K> partitioner, String field)
Note: This method works only on single field keys.
partitioner
- The partitioner to assign partitions to keys.field
- The field index on which the DataSet is to partitioned.public <K extends Comparable<K>> PartitionOperator<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T,K> keyExtractor)
Note: This method works only on single field keys, i.e. the selector cannot return tuples of fields.
partitioner
- The partitioner to assign partitions to keys.keyExtractor
- The KeyExtractor with which the DataSet is partitioned.KeySelector
public PartitionOperator<T> rebalance()
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
public SortPartitionOperator<T> sortPartition(int field, Order order)
field
- The field index on which the DataSet is sorted.order
- The order in which the DataSet is sorted.public SortPartitionOperator<T> sortPartition(String field, Order order)
field
- The field expression referring to the field on which the DataSet is sorted.order
- The order in which the DataSet is sorted.public <K> SortPartitionOperator<T> sortPartition(KeySelector<T,K> keyExtractor, Order order)
Note that no additional sort keys can be appended to a KeySelector sort keys. To sort the partitions by multiple values using KeySelector, the KeySelector must return a tuple consisting of the values.
keyExtractor
- The KeySelector function which extracts the key values from the DataSet
on which the DataSet is sorted.order
- The order in which the DataSet is sorted.public DataSink<T> writeAsText(String filePath)
For each element of the DataSet the result of Object.toString()
is written.
Output files and directories
What output how writeAsText() method produces is depending on other circumstance
.
└── path1/
├── 1
├── 2
└── ...
Code Example
dataset.writeAsText("file:///path1");
.
└── path1
Code Example
// Parallelism is set to only this particular operation
dataset.writeAsText("file:///path1").setParallelism(1);
// This will creates the same effect but note all operators' parallelism are set to one
env.setParallelism(1);
...
dataset.writeAsText("file:///path1");
.
└── path1/
└── 1
Code Example
// fs.output.always-create-directory = true
dataset.writeAsText("file:///path1").setParallelism(1);
filePath
- The path pointing to the location the text file or files under the directory
is written to.TextOutputFormat
public DataSink<T> writeAsText(String filePath, FileSystem.WriteMode writeMode)
For each element of the DataSet the result of Object.toString()
is written.
filePath
- The path pointing to the location the text file is written to.writeMode
- Control the behavior for existing files. Options are NO_OVERWRITE and
OVERWRITE.TextOutputFormat
,
Output files and directories
public DataSink<String> writeAsFormattedText(String filePath, TextOutputFormat.TextFormatter<T> formatter)
For each element of the DataSet the result of TextOutputFormat.TextFormatter.format(Object)
is
written.
filePath
- The path pointing to the location the text file is written to.formatter
- formatter that is applied on every element of the DataSet.TextOutputFormat
,
Output files and directories
public DataSink<String> writeAsFormattedText(String filePath, FileSystem.WriteMode writeMode, TextOutputFormat.TextFormatter<T> formatter)
For each element of the DataSet the result of TextOutputFormat.TextFormatter.format(Object)
is
written.
filePath
- The path pointing to the location the text file is written to.writeMode
- Control the behavior for existing files. Options are NO_OVERWRITE and
OVERWRITE.formatter
- formatter that is applied on every element of the DataSet.TextOutputFormat
,
Output files and directories
public DataSink<T> writeAsCsv(String filePath)
Tuple
DataSet as CSV file(s) to the specified location.
Note: Only a Tuple DataSet can written as a CSV file.
For each Tuple field the result of Object.toString()
is written. Tuple fields are
separated by the default field delimiter "comma" (,)
.
Tuples are are separated by the newline character (\n
).
filePath
- The path pointing to the location the CSV file is written to.Tuple
,
CsvOutputFormat
,
Output files and directories
public DataSink<T> writeAsCsv(String filePath, FileSystem.WriteMode writeMode)
Tuple
DataSet as CSV file(s) to the specified location.
Note: Only a Tuple DataSet can written as a CSV file.
For each Tuple field the result of Object.toString()
is written. Tuple fields are
separated by the default field delimiter "comma" (,)
.
Tuples are are separated by the newline character (\n
).
filePath
- The path pointing to the location the CSV file is written to.writeMode
- The behavior regarding existing files. Options are NO_OVERWRITE and
OVERWRITE.Tuple
,
CsvOutputFormat
,
Output files and directories
public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter)
Tuple
DataSet as CSV file(s) to the specified location with the specified
field and line delimiters.
Note: Only a Tuple DataSet can written as a CSV file.
For each Tuple field the result of Object.toString()
is written.
filePath
- The path pointing to the location the CSV file is written to.rowDelimiter
- The row delimiter to separate Tuples.fieldDelimiter
- The field delimiter to separate Tuple fields.Tuple
,
CsvOutputFormat
,
Output files and directories
public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, FileSystem.WriteMode writeMode)
Tuple
DataSet as CSV file(s) to the specified location with the specified
field and line delimiters.
Note: Only a Tuple DataSet can written as a CSV file. For each Tuple field the
result of Object.toString()
is written.
filePath
- The path pointing to the location the CSV file is written to.rowDelimiter
- The row delimiter to separate Tuples.fieldDelimiter
- The field delimiter to separate Tuple fields.writeMode
- The behavior regarding existing files. Options are NO_OVERWRITE and
OVERWRITE.Tuple
,
CsvOutputFormat
,
Output files and directories
public void print() throws Exception
System.out
of the JVM
that calls the print() method. For programs that are executed in a cluster, this method needs
to gather the contents of the DataSet back to the client, to print it there.
The string written for each element is defined by the Object.toString()
method.
This method immediately triggers the program execution, similar to the collect()
and count()
methods.
Exception
printToErr()
,
printOnTaskManager(String)
public void printToErr() throws Exception
System.err
of the JVM
that calls the print() method. For programs that are executed in a cluster, this method needs
to gather the contents of the DataSet back to the client, to print it there.
The string written for each element is defined by the Object.toString()
method.
This method immediately triggers the program execution, similar to the collect()
and count()
methods.
Exception
print()
,
printOnTaskManager(String)
public DataSink<T> printOnTaskManager(String prefix)
To print the data to the console or stdout stream of the client process instead, use the
print()
method.
For each element of the DataSet the result of Object.toString()
is written.
prefix
- The string to prefix each line of the output with. This helps identifying
outputs from different printing sinks.print()
@Deprecated @PublicEvolving public DataSink<T> print(String sinkIdentifier)
printOnTaskManager(String)
instead.For each element of the DataSet the result of Object.toString()
is written.
sinkIdentifier
- The string to prefix the output with.@Deprecated @PublicEvolving public DataSink<T> printToErr(String sinkIdentifier)
printOnTaskManager(String)
instead, or the PrintingOutputFormat
.For each element of the DataSet the result of Object.toString()
is written.
sinkIdentifier
- The string to prefix the output with.public DataSink<T> write(FileOutputFormat<T> outputFormat, String filePath)
FileOutputFormat
to a specified location. This method adds a
data sink to the program.outputFormat
- The FileOutputFormat to write the DataSet.filePath
- The path to the location where the DataSet is written.FileOutputFormat
public DataSink<T> write(FileOutputFormat<T> outputFormat, String filePath, FileSystem.WriteMode writeMode)
FileOutputFormat
to a specified location. This method adds a
data sink to the program.outputFormat
- The FileOutputFormat to write the DataSet.filePath
- The path to the location where the DataSet is written.writeMode
- The mode of writing, indicating whether to overwrite existing files.FileOutputFormat
public DataSink<T> output(OutputFormat<T> outputFormat)
OutputFormat
. This method adds a data sink to the program.
Programs may have multiple data sinks. A DataSet may also have multiple consumers (data sinks
or transformations) at the same time.outputFormat
- The OutputFormat to process the DataSet.OutputFormat
,
DataSink
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.