Duplicates and Inconsistencies
Description
Use this component to detect duplicate and inconsistent records in the source dataset.
Duplicates: Records where all input and output fields match exactly. They create redundancy and inflate dataset size without adding value.
Inconsistencies: Records where all input fields match, but at least one output field differs. They distort analysis and reduce model quality by violating expected data patterns.
The algorithm identifies records with identical input fields and checks whether their output fields match (duplicates) or differ (inconsistencies).
Let's analyze the following dataset for duplicates and inconsistencies. In the input port settings, we set the Input usage type for Field 1 and Field 2, and
Output usage type for Field 3 and Field 4.
Source table:
| Field 1 | Field 2 | Field 3 | Field 4 |
|---|---|---|---|
| 01.01.2019 | 2 | 1000 | 1500 |
| 21.05.2019 | 3 | 1000 | 1500 |
| 21.05.2019 | 3 | 700 | 1500 |
| 21.05.2019 | 3 | 700 | 1500 |
| 01.09.2019 | 4 | 1200 | 1700 |
| 01.09.2019 | 4 | 1200 | 1700 |
Output table:
| Duplicate | Duplicate group | Inconsistency | Inconsistency group | Field 1 | Field 2 | Field 3 | Field 4 |
|---|---|---|---|---|---|---|---|
| false | false | 01.01.2019 | 2 | 1 000 | 1 500 | ||
| false | true | 1 | 21.05.2019 | 3 | 1 000 | 1 500 | |
| true | 1 | true | 1 | 21.05.2019 | 3 | 700 | 1 500 |
| true | 1 | true | 1 | 21.05.2019 | 3 | 700 | 1 500 |
| true | 2 | false | 01.09.2019 | 4 | 1 200 | 1 700 | |
| true | 2 | false | 01.09.2019 | 4 | 1 200 | 1 700 |
The analysis found two duplicate groups and one inconsistency group.
Ports
Input
Input data source (data table): Accepts a data table. In the port settings, select
Input and
Output usage type for fields that you want to investigate.
Note: You must define at least one Input column.
Output
Output dataset: Returns a table with the following structure:
- Required fields:
- Duplicate: A
logicalvalue. Indicates if the row is a duplicate. - Duplicate group: Groups identical records (where both inputs and outputs match).
- Inconsistency: A
logicalvalue. Indicates if the row is inconsist with another row (i.e., their inputs match but outputs differ). - Inconsistency group: Groups records with matching inputs but differing outputs.
- Duplicate: A
- Required fields:
Note: Group numbering starts at 1.
Read on: Correlation Analysis