Column

To drill into the column-level statistics, click on the column name in the Datasets entity tree, or click on the linked column name in the Ingested Data Statistics table in the Schema tab.

Data Type

This column type that was automatically inferred by Upsolver during ingestion, and is useful for comparing source and target schemas to ensure data is stored in the correct and expected type.


Overview

The following information is provided for your selected column:

MeasurementDescription

Total Values

Total number of non NULL values. Values written before the METADATA_RETENTION period will not be counted.

Distinct Values

The count of distinct values within the column. Vlaues written before the METADATA_RETENTION period will not be counted.

Density in Rows

The percentage of rows with a non-null values for this column out of the total number of rows written.

Density in Parent

Shows the percentage of non-null child values compared to non-null parent records. Example: If order.item.id appears 10 times within order.item that appears 15 times, the density is 66.7%.

First Seen

The first date and time that a non-null value was seen in this column.

Last Seen

The last date and time that a non-null value was seen in this column.

Min Value

The lowest value in the column in the dataset, available for string, date, and numerical data.

Max Value

The highest value in the column in the dataset, available for string, date, and numerical data.

Written Rows Over Time

The Written Rows Over Time chart will help you become familiar with the volumes of data ingested, enabling you to determine an expected baseline. This visual displays the number of rows ingested over the selected timespan, enabling you to quickly discover spikes or drops in your data to troubleshoot unexpected volumes.

By default, the chart shows written rows from the initial ingestion through to now., however you can change this. Click on the Time View button, which displays Lifetime by default, to change the timespan displayed in the dataset report:


Top Values

The Top Values card displays the top 10K distinct values in the selected column and the percentage of rows for each value that appears within the dataset. This is particularly helpful for columns written to dimensions, as the values here should match the values in your analytics target:

You can download the results in the Top Value card to a column inspections CSV file to further investigate your data. Click on the Download icon in the top right-hand corner of the card to download your file. The file includes the list of distinct column values, the number of times each value appears in the dataset, and the percentage breakdown representing the density of the value within the column.


Distribution

The column data type determines the label of the distribution card, either String Length Distribution for a string type column, or Values Distribution for date and numeric types.

If you discover data that frequently falls outside of your remit, you can use expectations in your job to warn of errant rows, or eliminate them from your target dataset, thereby preventing bad data from polluting downstream analytics data.

String Length Distribution

The Distribution card enables you to easily discover anomalies or incorrect string data, for example if you are expecting a fixed-width character length, or values to be within a range.

The Min and Max stats provide real-time visibility into the strings in your columns. If you have a column whereby the minimum string length should be 13, such as a barcode field, you can use this card to check ingested values.

For example, the following chart displays a visual representation of the string length distribution for the values within an address1 column:

The chart instantly alerts us to any strings that may be too long or too short, indicating an an issue in our data.

Values Distribution

The Values Distribution chart makes it easy to visualize your data and find anomalies and outliers. It is easy to see the spread of values across the range within your dataset and also discover the Min and Max values within the column.

In the following example, we can see that the minimum value for the nettotal column is 1,331.02, and the maximum value in 4,457.55. However, 48.8K rows have a value of 0. As this is the nettotal value, we may need to investigate further to check if this is correct. It might be that we have missing data, or simply that we have a high number of customers in an open shopping session that have not added anything to their basket yet. If we don't want rows with a 0 value for nettotal to reach our data warehouse, we can add an expectation to the job to filter out these rows:

Last updated