This documentation is for an out-of-date version of Apache Flink Machine Learning Library. We recommend you use the latest stable version.
One Hot Encoder
One Hot Encoder #
One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
OneHotEncoder can transform multiple columns, returning an one-hot-encoded output vector column for each input column.
Input Columns #
Param name | Type | Default | Description |
---|---|---|---|
inputCols | Integer | null |
Label index |
Output Columns #
Param name | Type | Default | Description |
---|---|---|---|
outputCols | Vector | null |
Encoded binary vector |
Parameters #
Key | Default | Type | Required | Description |
---|---|---|---|---|
inputCols | null |
String | yes | Input column names. |
outputCols | null |
String | yes | Output column names. |
handleInvalid | HasHandleInvalid.ERROR_INVALID |
String | No | Strategy to handle invalid entries. Supported values: HasHandleInvalid.ERROR_INVALID , HasHandleInvalid.SKIP_INVALID |
dropLast | true |
Boolean | no | Whether to drop the last category. |
Examples #
import org.apache.flink.ml.feature.onehotencoder.OneHotEncoder;
import org.apache.flink.ml.feature.onehotencoder.OneHotEncoderModel;
List<Row> trainData = Arrays.asList(Row.of(0.0), Row.of(1.0), Row.of(2.0), Row.of(0.0));
Table trainTable = tEnv.fromDataStream(env.fromCollection(trainData)).as("input");
List<Row> predictData = Arrays.asList(Row.of(0.0), Row.of(1.0), Row.of(2.0));
Table predictTable = tEnv.fromDataStream(env.fromCollection(predictData)).as("input");
OneHotEncoder estimator = new OneHotEncoder().setInputCols("input").setOutputCols("output");
OneHotEncoderModel model = estimator.fit(trainTable);
Table outputTable = model.transform(predictTable)[0];
outputTable.execute().print();