Eliminate Outliers

Description

The handler is designated for automatic elimination of outliers and extreme values in data sets. A user sets criteria of outliers and extreme values determination for each field of the source data set by specifying the allowable standard deviation or interquartile range. The outliers mean the data values that distinctly deviate from the average ones, the extreme values mean the values that deviate from the typical values to such extent that they are not consistent with the logics of the processes and events under research anymore.

Ports

Input

  •  Input data source: data table.

Output

  •  Output data set: the source table after processing.
  •  Outliers: the table that contains the source table rows in which outliers have been detected.
  •  Extreme values: the table that contains the source table rows in which extreme values have been detected.

Wizard

  • Source data ordered: it is required to select this checkbox when it is known that the numeric series is ordered, namely, its values are ordered in ascending or descending order (for example, by date or time). Availability of some elimination methods depends on the status of this checkbox. The checkbox status does not affect processing of logical and string fields.
  • Area of configuration of outliers and extreme values elimination methods contains a list of the fields available for processing. It is possible to select the checkbox that enables to define the processing avialability for each field. Upon selection of the field, it is possible to set the processing method.
  • Determination of outliers and extreme values: two detection methods are available:
    • Standard deviation: the criterion is deviation of the indicator value from the mean one more than by the set number of standard deviations. In this case, this parameter can be separately set for outliers and exreme values. This method can be used if it is known that data distribution is close to the normal one.
    • Interquartile range: the criterion is distance between the first and the third quartiles of indicator values distribution. If the indicator value deviates from the median more than by the set number of the interquartile range, it is considered to be anomalous. This parameter is set only for outliers and extreme values. This method can be also used when data distribution differs from the normal one.

The following elimination methods are available both for outliers and extreme values:

  • Leave unchanged.
  • Delete records: delete the records with anomalous values from data set.
  • Replace with average: replace anomalous values with the average column value.
  • Replace with median: replace anomalous values with the median calculated for the column.
  • Replace with most frequent: replace anomalous values with the most frequent column value. Replacement is performed using the average value from the most frequent histogram bin. The bin count varies according to the sample size: the larger the sample size, the higher number of bins.
  • Replace with set value: replace anomalous values with the manually recorded value.
  • Limit: replace anomalous values with the bound value from which the anomalous values start.

The spectrum of available methods is defined for each field by three data characteristics simultaneously (refer to data):

  • Degree of order
  • Type
  • Kind

Applicability table by the following features:

MethodUnordered setOrdered set
Discrete Continuous Discrete Continuous
Leave unchanged
Delete records
Replace with average
Replace with median
Replace with most frequent
Replace with set value
Limit

See also:

results matching ""

    No results matching ""