@PublicEvolving public final class DataSetUtils extends Object
Modifier and Type | Method and Description |
---|---|
static <T> Utils.ChecksumHashCode |
checksumHashCode(DataSet<T> input)
Deprecated.
This method will be removed at some point.
|
static <T> DataSet<Tuple2<Integer,Long>> |
countElementsPerPartition(DataSet<T> input)
Method that goes over all the elements in each partition in order to retrieve the total
number of elements.
|
static int |
getBitSize(long value) |
static <T> PartitionOperator<T> |
partitionByRange(DataSet<T> input,
DataDistribution distribution,
int... fields)
Range-partitions a DataSet on the specified tuple field positions.
|
static <T,K extends Comparable<K>> |
partitionByRange(DataSet<T> input,
DataDistribution distribution,
KeySelector<T,K> keyExtractor)
Range-partitions a DataSet using the specified key selector function.
|
static <T> PartitionOperator<T> |
partitionByRange(DataSet<T> input,
DataDistribution distribution,
String... fields)
Range-partitions a DataSet on the specified fields.
|
static <T> MapPartitionOperator<T,T> |
sample(DataSet<T> input,
boolean withReplacement,
double fraction)
Generate a sample of DataSet by the probability fraction of each element.
|
static <T> MapPartitionOperator<T,T> |
sample(DataSet<T> input,
boolean withReplacement,
double fraction,
long seed)
Generate a sample of DataSet by the probability fraction of each element.
|
static <T> DataSet<T> |
sampleWithSize(DataSet<T> input,
boolean withReplacement,
int numSamples)
Generate a sample of DataSet which contains fixed size elements.
|
static <T> DataSet<T> |
sampleWithSize(DataSet<T> input,
boolean withReplacement,
int numSamples,
long seed)
Generate a sample of DataSet which contains fixed size elements.
|
static <R extends Tuple,T extends Tuple> |
summarize(DataSet<T> input)
Summarize a DataSet of Tuples by collecting single pass statistics for all columns.
|
static <T> DataSet<Tuple2<Long,T>> |
zipWithIndex(DataSet<T> input)
Method that assigns a unique
Long value to all elements in the input data set. |
static <T> DataSet<Tuple2<Long,T>> |
zipWithUniqueId(DataSet<T> input)
Method that assigns a unique
Long value to all elements in the input data set as
described below. |
public static <T> DataSet<Tuple2<Integer,Long>> countElementsPerPartition(DataSet<T> input)
input
- the DataSet received as inputpublic static <T> DataSet<Tuple2<Long,T>> zipWithIndex(DataSet<T> input)
Long
value to all elements in the input data set. The
generated values are consecutive.input
- the input data setpublic static <T> DataSet<Tuple2<Long,T>> zipWithUniqueId(DataSet<T> input)
Long
value to all elements in the input data set as
described below.
input
- the input data setpublic static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction)
withReplacement
- Whether element can be selected more than once.fraction
- Probability that each element is chosen, should be [0,1] without replacement,
and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected
to be selected multi times into sample on average.public static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction, long seed)
withReplacement
- Whether element can be selected more than once.fraction
- Probability that each element is chosen, should be [0,1] without replacement,
and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected
to be selected multi times into sample on average.seed
- random number generator seed.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement
- Whether element can be selected more than once.numSamples
- The expected sample size.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples, long seed)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement
- Whether element can be selected more than once.numSamples
- The expected sample size.seed
- Random number generator seed.public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, DataDistribution distribution, int... fields)
public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, DataDistribution distribution, String... fields)
public static <T,K extends Comparable<K>> PartitionOperator<T> partitionByRange(DataSet<T> input, DataDistribution distribution, KeySelector<T,K> keyExtractor)
public static <R extends Tuple,T extends Tuple> R summarize(DataSet<T> input) throws Exception
Example usage:
Dataset<Tuple3<Double, String, Boolean>> input = // [...]
Tuple3<NumericColumnSummary,StringColumnSummary, BooleanColumnSummary> summary = DataSetUtils.summarize(input)
summary.f0.getStandardDeviation()
summary.f1.getMaxLength()
Exception
@Deprecated public static <T> Utils.ChecksumHashCode checksumHashCode(DataSet<T> input) throws Exception
Exception
public static int getBitSize(long value)
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.