This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.
$$ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \newcommand\rfrac[2]{^{#1}\!/_{#2}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $$

Stochastic Outlier Selection

Description

An outlier is one or multiple observations that deviates quantitatively from the majority of the data set and may be the subject of further investigation. Stochastic Outlier Selection (SOS) developed by Jeroen Janssens[1] is an unsupervised outlier-selection algorithm that takes as input a set of vectors. The algorithm applies affinity-based outlier selection and outputs for each data point an outlier probability. Intuitively, a data point is considered to be an outlier when the other data points have insufficient affinity with it.

Outlier detection has its application in a number of field, for example, log analysis, fraud detection, noise removal, novelty detection, quality control, sensor monitoring, etc. If a sensor turns faulty, it is likely that it will output values that deviate markedly from the majority.

For more information, please consult the PhD Thesis of Jeroens Janssens on Outlier Selection and One-Class Classification which introduces the algorithm.

Parameters

The stochastic outlier selection algorithm implementation can be controlled by the following parameters:

Parameters Description
Perplexity

Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference with SOS being a neighbor is not a binary property, but a probabilistic one, and therefore it a real number. Must be between 1 and n-1, where n is the number of points. A good starting point can be obtained by using the square root of the number of observations. (Default value: 30)

ErrorTolerance

The accepted error tolerance to reduce computational time when approximating the affinity. It will sacrifice accuracy in return for reduced computational time. (Default value: 1e-20)

MaxIterations

The maximum number of iterations to approximate the affinity of the algorithm. (Default value: 10)

Example

val data = env.fromCollection(List(
  LabeledVector(0.0, DenseVector(1.0, 1.0)),
  LabeledVector(1.0, DenseVector(2.0, 1.0)),
  LabeledVector(2.0, DenseVector(1.0, 2.0)),
  LabeledVector(3.0, DenseVector(2.0, 2.0)),
  LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
))

val sos = new StochasticOutlierSelection().setPerplexity(3)

val outputVector = sos
  .transform(data)
  .collect()

val expectedOutputVector = Map(
  0 -> 0.2790094479202896,
  1 -> 0.25775014551682535,
  2 -> 0.22136130977995766,
  3 -> 0.12707053787018444,
  4 -> 0.9922779902453757 // The outlier!
)

outputVector.foreach(output => expectedOutputVector(output._1) should be(output._2))

References

[1]J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, Tilburg, the Netherlands, 2012.