Data Quality Visualizer

This visualizer provides a comprehensive assessment of data quality for each field.

Interface

When you enter the visualizer, its toolbar provides the following tabs and buttons:

Summary: This tab provides the dataset overview. When you first enter the visualizer, it will display the fields available for processing. After selecting them and calculating the necessary statistics, data quality indicators and detected problems will be shown here.
Discrete: Click this button to review the data quality indicators calculated for discrete fields.
Continuous: Click this button to obtain a data quality overview for continuous fields.
Parameters: Click this to open a window for the selection and configuration of data quality indicators:
- Selection of indicators: Select the metrics for data quality assessment.
- Configure parameters: This is a tab for setting up outlier and extreme values detection methods. Here you can select one of two methods, change their settings, and set the maximum percentage of missing (null) values.
Absolute or relative values: This button switches between absolute and percent representation of quantitative values.
Export: Use this button to export a data quality report. Please note that only the current screen (e.g., Summary, Discrete, Continuous, etc.) will be exported. This command has a dropdown for choosing one of two available export formats:
- Export to XLS (html)
- Export to HTML The name of the format you select appears on the button.
Detailing: This command (found on the left of the toolbar) is only available for the Discrete and Continuous tabs. It shows the input dataset filtered by the specified metric for the selected field. For example, selecting a Nulls cell will show all records from the input set that have missing values in the selected field.
Histogram: The histogram is only available for the Discrete and Continuous tabs. The button opens a pane on the left with a histogram for the selected field.

Dataset statistics

Displayed on the right-side of the Summary tab. This pane can be hidden by clicking the Show/Hide icon on its left. To open it, click the icon again (it will be on the right side of the main window). The pane displays the following characteristics:

Determination of outliers and extreme values: Shows the identification method set in Configure parameters; the default method is Standard deviation.
Columns: The total number of columns in the dataset.
Rows: The total number of records in the dataset.
Fully filled fields: The percentage of fields that do not contain missing values.
Fully filled records: The percentage of records that do not contain missing values.
Suitable columns: The number of suitable columns. A column is considered unsuitable if:
- the percentage of missing values is higher than the Maximum percentage of null data (as set in Configure parameters)
- the Quality indicator (included in the Selection of indicators) of a discrete field is 0 (the field contains a constant)
EPV index: The ratio of the number of records to the number of columns.

Configuration

On opening the Data Quality visualizer, you will see the Summary tab listing all the dataset fields. Select the fields you need for processing by marking/unmarking their checkboxes. (By default, all fields are marked.) Note that the Discrete and Continuous tabs are initially disabled.

Figure 1. Selecting fields for processing

Available quality metrics:

Go to the Parameters tab to select the necessary metrics:

Metric	Data Type	Description
Quality indicator		Reflects the diversity of values in the field
Histogram		Histogram preview
Extremes		The number of records containing extreme values
Empty		The number of records with empty values
Trailing spaces		The number of records with values that end with a whitespace (this metric does not include records consisting only of whitespace characters)
Range of values		The value range in the format: `first value`...`last value` (for a sorted list of values)
Zeros		The number of records where values are zero
Box plot		Calculates the metrics: Minimum outlier, Lower quartile, Median, Upper quartile, Maximum outlier, Sample maximum
Minimum		Minimum value
Mean		Mean value
Mode		The most frequently occurring value¹
Nulls		The number of records in which a value for the field is missing
Values		Displays a chart showing the diversity of values across rows
Outliers		The number of records in which the value is an outlier
Whitespaces		The number of records with values consisting only of whitespace characters
String lengths		The range of string lengths in the format: minimum length – maximum length
Negative		The number of records with negative values
Infinities		The number of records where values are equal to infinity
Unique		The number of unique values
Maximum		Maximum value
Median		The median for the field
Monotony		Shows the uniformity of field values (Decreasing, Increasing, or Non-monotonic)

¹. If there are multiple mode values in the field, the first of these values is shown, and the multimodality icon is displayed. ↩

Configuring specific indicators

After selecting the necessary indicators, go to the Configure parameters tab of the Parameters window to set up the identification method for outliers and extreme values and the maximum percentage of null values.

Calculating and reviewing statistics

After selecting the fields and metrics for data quality analysis, click the Calculate statistics button.

Now the Discrete and Continuous tabs become available, displaying all the calculated indicators for discrete and continuous fields, respectively.

Next to each indicator, you will find the Options button. The options it provides are sorting fields by the selected metric and removing some calculated indicators from the table (or adding them again).

Figure 3. Sorting and adding/removing columns

When you select a cell and click the Detailing button (on the right), an additional pane opens at the bottom, highlighting all the records from that cell.

When you select a cell and click the Histogram button (on the right), a pane with a detailed histogram opens on the right side of the main window. For discrete fields, you can configure the histogram to show only specific values:

Detected problems

On the Summary tab, the Problems column displays the total count of problematic rows for the given field.

The Kinds of problems column provides some details:

Nulls: The number of records containing missing values.
Empty: The number of records containing empty values.
Zeros: The number of records containing zero values.
Whitespaces: The number of records containing only whitespace characters (space, non-breaking space, tab, etc.).
Infinities: The number of records containing the value infinity.
Constant: Indicates that the field contains only one unique value, although there may be missing values in the field.
Unique: Indicates that every value in this field is unique.
Conversion error: Appears if the field contains values that cannot be converted to the field's data type.
Not calculated: Marks fields that do not contain calculated metrics.
No data: Appears when attempting to analyze an empty dataset.
Extremes: The number of records in which the value is extreme.
Outliers: The number of records in which the value is an outlier.
Negative: The number of records with negative values.
Trailing spaces: The number of records with values that end with a whitespace character (this metric does not include records consisting only of whitespace characters).

Next to each kind of problem in the column, you will find the assessment of rows containing that problem: as a percentage to the total number of rows and as a quantity (in parentheses).

Read on: Cube Visualizer

Data Quality Visualizer

Interface

Dataset statistics

Configuration

Available quality metrics:

Configuring specific indicators

Calculating and reviewing statistics

Detected problems

results matching ""

No results matching ""

Data Quality Visualizer

Interface

Toolbar

Dataset statistics

Configuration

Available quality metrics:

Configuring specific indicators

Calculating and reviewing statistics

Detected problems

results matching ""

No results matching ""