This documentation is for an out-of-date version of Apache Flink Machine Learning Library. We recommend you use the latest stable version.

One Hot Encoder #

One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

OneHotEncoder can transform multiple columns, returning an one-hot-encoded output vector column for each input column.

Input Columns #

Param name	Type	Default	Description
inputCols	Integer	`null`	Label index

Output Columns #

Param name	Type	Default	Description
outputCols	Vector	`null`	Encoded binary vector

Parameters #

Key	Default	Type	Required	Description
inputCols	`null`	String	yes	Input column names.
outputCols	`null`	String	yes	Output column names.
handleInvalid	`HasHandleInvalid.ERROR_INVALID`	String	No	Strategy to handle invalid entries. Supported values: `HasHandleInvalid.ERROR_INVALID`, `HasHandleInvalid.SKIP_INVALID`
dropLast	`true`	Boolean	no	Whether to drop the last category.

Examples #

import org.apache.flink.ml.feature.onehotencoder.OneHotEncoder;
import org.apache.flink.ml.feature.onehotencoder.OneHotEncoderModel;

List<Row> trainData = Arrays.asList(Row.of(0.0), Row.of(1.0), Row.of(2.0), Row.of(0.0));
Table trainTable = tEnv.fromDataStream(env.fromCollection(trainData)).as("input");

List<Row> predictData = Arrays.asList(Row.of(0.0), Row.of(1.0), Row.of(2.0));
Table predictTable = tEnv.fromDataStream(env.fromCollection(predictData)).as("input");

OneHotEncoder estimator = new OneHotEncoder().setInputCols("input").setOutputCols("output");
OneHotEncoderModel model = estimator.fit(trainTable);
Table outputTable = model.transform(predictTable)[0];

outputTable.execute().print();