Eliminate Outliers Component
The Eliminate Outliers component automatically corrects outliers and extreme values in your datasets. Define criteria for each field by specifying the allowable standard deviation or interquartile range.
Outliers deviate significantly from the mean. Extreme values deviate so far from typical data that they no longer align with the logic of the processes or phenomena you are studying.
Ports
Input
Input data source: The data table you want to process.
Output
Output dataset: The processed data table.
Outliers: A table containing rows where the component detected outliers.
Extremes: A table containing rows where the component detected extreme values.
Configuration wizard
- Source data ordered: Check this box if your numeric series follows a specific order (e.g., sorted by date or time). This setting affects specific processing methods but does not change how the system handles logical or string fields.
- Input fields: Select fields for processing from this list. Enable the checkbox for a field and click it to configure its specific processing method.
- Determination of outliers and extreme values: Choose one of two identification methods:
- Standard deviation: Best for data that follows a normal distribution. The system flags values that deviate from the mean by more than your specified number of standard deviations. You can set different thresholds for outliers and extreme values.
- Interquartile width: This method calculates the distance between the 1st and 3rd quartiles. The system treats values as anomalies if they deviate from the median by more than your specified interquartile width. Use this for non-normal distributions (see Interquartile width).
Apply these processing methods to both outliers and extreme values:
- Leave unchanged: Keep the values as they are.
- Delete records: Remove rows containing anomalies from the dataset.
- Replace with average: Replace anomalies with the column's mean value.
- Replace with median: Replace anomalies with the column's median value.
- Replace with most frequent: Replace anomalies with the most common value. The system uses the mean value from the most likely interval, adjusting the number of intervals based on your sample size.
- Replace with set value: Enter a specific manual value to replace anomalies.
- Limit: Replace anomalies with the nearest boundary value of the threshold.
The available methods for each field depend on three characteristics (see data):
- Data type
- Ordered/unordered
- Discrete/continuous
The following table shows the applicability of different methods depending on these data properties:
| Method | Unordered dataset | Ordered dataset | ||
|---|---|---|---|---|
| Leave unchanged | ||||
| Delete records | ||||
| Replace with average | ||||
| Replace with median | ||||
| Replace with most frequent | ||||
| Replace with set value | ||||
| Limit | ||||
Read on: Smoothing Component
See also: