Data Quality Visualizer

This visualizer provides a comprehensive assessment of data quality for each field.

Interface

Toolbar

When you enter the visualizer, its toolbar provides the following tabs and buttons:

  • Summary Summary: This tab provides the dataset overview. When you first enter the visualizer, it will display the fields available for processing. After selecting them and calculating the necessary statistics, data quality indicators and detected problems will be shown here.
  • Discrete Discrete: Click this button to review the data quality indicators calculated for discrete fields.
  • Continuous Continuous: Click this button to obtain a data quality overview for continuous fields.
  • Parameters Parameters: Click this to open a window for the selection and configuration of data quality indicators:
    • Selection of indicators Selection of indicators: Select the metrics for data quality assessment.
    • Configure parameters Configure parameters: This is a tab for setting up outlier and extreme values detection methods. Here you can select one of two methods, change their settings, and set the maximum percentage of missing (null) values.
  • Absolute or relative values Absolute or relative values: This button switches between absolute and percent representation of quantitative values.
  • Export Export: Use this button to export a data quality report. Please note that only the current screen (e.g., Summary, Discrete, Continuous, etc.) will be exported. This command has a dropdown for choosing one of two available export formats:
    • Export to XLS (html)
    • Export to HTML The name of the format you select appears on the button.
  • Drill-down Detailing: This command (found on the left of the toolbar) is only available for the Discrete and Continuous tabs. It shows the input dataset filtered by the specified metric for the selected field. For example, selecting a Nulls cell will show all records from the input set that have missing values in the selected field.
  • Histogram Histogram: The histogram is only available for the Discrete and Continuous tabs. The button opens a pane on the left with a histogram for the selected field.

Dataset statistics

Displayed on the right-side of the Summary tab. This pane can be hidden by clicking the  Show/Hide icon on its left. To open it, click the icon again (it will be on the right side of the main window). The pane displays the following characteristics:

  • Determination of outliers and extreme values: Shows the identification method set in Configure parameters; the default method is Standard deviation.
  • Columns: The total number of columns in the dataset.
  • Rows: The total number of records in the dataset.
  • Fully filled fields: The percentage of fields that do not contain missing values.
  • Fully filled records: The percentage of records that do not contain missing values.
  • Suitable columns: The number of suitable columns. A column is considered unsuitable if:
    • the percentage of missing values is higher than the Maximum percentage of null data (as set in Configure parameters)
    • the Quality indicator (included in the Selection of indicators) of a discrete field is 0 (the field contains a constant)
  • EPV index: The ratio of the number of records to the number of columns.

Configuration

On opening the Data Quality visualizer, you will see the Summary tab listing all the dataset fields. Select the fields you need for processing by marking/unmarking their checkboxes. (By default, all fields are marked.) Note that the Discrete and Continuous tabs are initially disabled.

Selecting fields for processing
Figure 1. Selecting fields for processing

Available quality metrics:

Go to the Parameters tab to select the necessary metrics:

Metric Data Type Description
Quality indicator Reflects the diversity of values in the field
Histogram Histogram preview
Extremes The number of records containing extreme values
Empty The number of records with empty values
Trailing spaces The number of records with values that end with a whitespace (this metric does not include records consisting only of whitespace characters)
Range of values The value range in the format: first value...last value (for a sorted list of values)
Zeros The number of records where values are zero
Box plot Calculates the metrics: Minimum outlier, Lower quartile, Median, Upper quartile, Maximum outlier, Sample maximum
Minimum Minimum value
Mean Mean value
Mode The most frequently occurring value1
Nulls The number of records in which a value for the field is missing
Values Displays a chart showing the diversity of values across rows
Outliers The number of records in which the value is an outlier
Whitespaces The number of records with values consisting only of whitespace characters
String lengths The range of string lengths in the format: minimum length – maximum length
Negative The number of records with negative values
Infinities The number of records where values are equal to infinity
Unique The number of unique values
Maximum Maximum value
Median The median for the field
Monotony Shows the uniformity of field values (Decreasing, Increasing, or Non-monotonic)
1. If there are multiple mode values in the field, the first of these values is shown, and the  multimodality icon is displayed.

Configuring specific indicators

After selecting the necessary indicators, go to the Configure parameters tab of the Parameters window to set up the identification method for outliers and extreme values and the maximum percentage of null values.

Metric settings
Figure 2. Metric settings

Calculating and reviewing statistics

After selecting the fields and metrics for data quality analysis, click the  Calculate statistics button.

Now the Discrete and Continuous tabs become available, displaying all the calculated indicators for discrete and continuous fields, respectively.

Next to each indicator, you will find the  Options button. The options it provides are sorting fields by the selected metric and removing some calculated indicators from the table (or adding them again).

Sorting and adding/removing columns
Figure 3. Sorting and adding/removing columns

When you select a cell and click the Drill-down Detailing button (on the right), an additional pane opens at the bottom, highlighting all the records from that cell.

Cell drill-down
Figure 4. Cell drill-down

When you select a cell and click the Histogram Histogram button (on the right), a pane with a detailed histogram opens on the right side of the main window. For discrete fields, you can configure the histogram to show only specific values:

Possible values
Figure 5. Possible values

Detected problems

On the Summary tab, the Problems column displays the total count of problematic rows for the given field.

The Kinds of problems column provides some details:

  • Nulls: The number of records containing missing values.
  • Empty: The number of records containing empty values.
  • Zeros: The number of records containing zero values.
  • Whitespaces: The number of records containing only whitespace characters (space, non-breaking space, tab, etc.).
  • Infinities: The number of records containing the value infinity.
  • Constant: Indicates that the field contains only one unique value, although there may be missing values in the field.
  • Unique: Indicates that every value in this field is unique.
  • Conversion error: Appears if the field contains values that cannot be converted to the field's data type.
  • Not calculated: Marks fields that do not contain calculated metrics.
  • No data: Appears when attempting to analyze an empty dataset.
  • Extremes: The number of records in which the value is extreme.
  • Outliers: The number of records in which the value is an outlier.
  • Negative: The number of records with negative values.
  • Trailing spaces: The number of records with values that end with a whitespace character (this metric does not include records consisting only of whitespace characters).

Next to each kind of problem in the column, you will find the assessment of rows containing that problem: as a percentage to the total number of rows and as a quantity (in parentheses).

Types of problems
Figure 6. Types of problems

Read on: Cube Visualizer

results matching ""

    No results matching ""