Expectations
Learn how expectations can perform in-flight pre-processing to prevent bad data from reaching your target.
Data quality conditions can be added to your job to drop a row or trigger a warning when a column violates a predefined condition. Each expectation name should be unique to the job, and the predicate that determines whether the condition is violated must return a Boolean value. The SQL predicate can be any syntax supported in a WHERE
clause, however, aggregates are not supported.
Each expectation has an action of either DROP
or WARN
. If you define the action as DROP
, any row with a predicate that returns FALSE
, is dropped from the ingestion stream and is not loaded into the target. Rows violating the condition defined in an expectation with a WARN
action are loaded into the target.
To add an expectation when you create a new job, use the following syntax:
You can easily add an expectation to an existing job using the ALTER JOB
command. The syntax below shows how to alter a job you have already created:
To drop an expectation, use the ALTER JOB
command using the following syntax:
When you drop an expectation, the entry is dropped from the system.monitoring.expectations
system table, so if you want to retain a count of warnings or dropped rows, ensure you take a copy of these values prior to dropping the expectation. Furthermore, values are removed from the system.monitoring.jobs
table.
Dropping a job will also drop the associated expectations, so you will no longer see the count of warnings or dropped rows in the system tables. You can take a backup of this data if you want to retain it for reporting purposes.
Monitoring data quality with expectations
system.monitoring.expectations
When a row fails to meet an expectation, the system.monitoring.expectations
table is updated, and you can use the values in this table to report on the quality of your ingested data:
The
triggered_today
value is incremented each time an ingested row violates the condition, and the count shows all rows that violated the expectation since midnight (UTC time).The
total_triggered
value is incremented each time an ingested row violates the condition, and the count shows all rows that violated the expectation since the job started.The
action
column indicates the trigger type, either drop or warn, defined on the expectation.
Run the following query against the expectations system table to return the values for an expectation:
You can also view the counts for all expectations defined in a job using the following query, simply by changing the WHERE
clause:
system.monitoring.jobs
Furthermore, you can view the sum of all expectations triggered since midnight (UTC time) in the system.monitoring.jobs
table, which counts all triggered expectations defined in a job:
drop_expectations_triggered_today
shows the number ofDROP
expectation violations today.warn_expectations_triggered_today
counts the total rows that were dropped from the ingestion stream today and were not loaded into the target because they failed to meet a condition.rows_filtered_by_expectations_today
the number of rows dropped due to expectation violations today.
A row can violate multiple DROP
and WARN
conditions. For example, if a row triggers two warn expectations, and two drop expectations, the drop_expectations_triggered_today column will be incremented by two, and the warn_expectations_triggered_today column will be incremented by two, resulting in four violations. However, the rows_filtered_by_expectations_today column will only be incremented by one.
Run the following statement to return a count of warnings and dropped rows for a job:
Last updated