Expectations

Learn how expectations can perform in-flight pre-processing to prevent bad data from reaching your target.

Data quality conditions can be added to your job to drop a row or trigger a warning when a column violates a predefined condition. Each expectation name should be unique to the job, and the predicate that determines whether the condition is violated must return a Boolean value. The SQL predicate can be any syntax supported in a WHERE clause, however, aggregates are not supported.

Each expectation has an action of either DROP or WARN. If you define the action as DROP, any row with a predicate that returns FALSE, is dropped from the ingestion stream and is not loaded into the target. Rows violating the condition defined in an expectation with a WARN action are loaded into the target.

To add an expectation when you create a new job, use the following syntax:

CREATE JOB <job_name> ...   
WITH EXPECTATION <expectation_name> EXPECT <sql_predicate> 
ON VIOLATION { DROP | WARN }

You can easily add an expectation to an existing job using the ALTER JOB command. The syntax below shows how to alter a job you have already created:

ALTER JOB <job_name> 
ADD EXPECTATION <expectation_name> 
  EXPECT <sql_predicate> ON VIOLATION { DROP | WARN }

To drop an expectation, use the ALTER JOB command using the following syntax:

ALTER JOB <job_name> 
DROP EXPECTATION <expectation_name>

When you drop an expectation, the entry is dropped from the system.monitoring.expectations system table, so if you want to retain a count of warnings or dropped rows, ensure you take a copy of these values prior to dropping the expectation. Furthermore, values are removed from the system.monitoring.jobs table.

Dropping a job will also drop the associated expectations, so you will no longer see the count of warnings or dropped rows in the system tables. You can take a backup of this data if you want to retain it for reporting purposes.

Monitoring data quality with expectations

system.monitoring.expectations

When a row fails to meet an expectation, the system.monitoring.expectations table is updated, and you can use the values in this table to report on the quality of your ingested data:

  • The triggered_today value is incremented each time an ingested row violates the condition, and the count shows all rows that violated the expectation since midnight (UTC time).

  • The total_triggered value is incremented each time an ingested row violates the condition, and the count shows all rows that violated the expectation since the job started.

  • The action column indicates the trigger type, either drop or warn, defined on the expectation.

Run the following query against the expectations system table to return the values for an expectation:

SELECT expectation_name, triggered_today, triggered_total, action 
FROM  system.monitoring.expectations 
WHERE expectation_name = '<expectation_name>';

You can also view the counts for all expectations defined in a job using the following query, simply by changing the WHERE clause:

SELECT expectation_name, triggered_today, triggered_total, action 
FROM  system.monitoring.expectations 
WHERE job_name = '<job_name>';

system.monitoring.jobs

Furthermore, you can view the sum of all expectations triggered since midnight (UTC time) in the system.monitoring.jobs table, which counts all triggered expectations defined in a job:

  • drop_expectations_triggered_today shows the number of DROP expectation violations today.

  • warn_expectations_triggered_today counts the total rows that were dropped from the ingestion stream today and were not loaded into the target because they failed to meet a condition.

  • rows_filtered_by_expectations_today the number of rows dropped due to expectation violations today.

A row can violate multiple DROP and WARN conditions. For example, if a row triggers two warn expectations, and two drop expectations, the drop_expectations_triggered_today column will be incremented by two, and the warn_expectations_triggered_today column will be incremented by two, resulting in four violations. However, the rows_filtered_by_expectations_today column will only be incremented by one.

Run the following statement to return a count of warnings and dropped rows for a job:

SELECT drop_expectations_triggered_today, 
       warn_expectations_triggered_today, 
       rows_filtered_by_expectations_today
FROM system.monitoring.jobs
WHERE job_name = '<job_name>';

Last updated