Links
Comment on page

IS_DUPLICATE

This page describes how to use the IS_DUPLICATE function within your transformation jobs.
The IS_DUPLICATE function returns TRUE if it's not the first time a job sees the input value in the data within the interval specified in the windowSize.
The windowSize interval must be equal to or greater than the RUN_INTERVAL of the job, and be evenly divisible by the RUN_INTERVAL value. For example, if the RUN_INTERVAL is 5 minutes, the windowSize can be set to 5 minutes, 10 minutes, 15 minutes, and so on.
If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL.
The value being checked for duplication can be any data type.

Syntax

IS_DUPLICATE(windowSize, value)

Arguments

Name
Type
Description
windowSize
interval
The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job.
value
any
Value to check for duplicates. This can be any data type.

Returns

Boolean

Examples

Create a job and check for duplicates

The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL job option instructs Upsolver to run the job every minute:
CREATE SYNC JOB transform_orders_and_check_duplicates
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = true
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_is_duplicate MAP_COLUMNS_BY_NAME
SELECT
orderid AS order_id,
MD5(customer.email) AS customer_id,
customer_name,
nettotal AS total,
IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique,
$event_time AS partition_date
FROM default_glue_catalog.upsolver_samples.orders_raw_data
LET customer_name = customer.firstname || ' ' || customer.lastname -- create a computed column
WHERE ordertype = 'SHIPPING'
AND time_filter();
Transformation job to load orders and check for duplicate events arriving within an hourly interval.
The SELECT statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE function to check for rows with a duplicate orderid. The INTERVAL has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:
IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique
Last modified 2mo ago