Comment on page
IS_DUPLICATE
This page describes how to use the IS_DUPLICATE function within your transformation jobs.
The
IS_DUPLICATE
function returns TRUE if it's not the first time a job sees the input value
in the data within the interval specified in the windowSize
.The
windowSize
interval must be equal to or greater than the RUN_INTERVAL
of the job, and be evenly divisible by the RUN_INTERVAL
value. For example, if the RUN_INTERVAL
is 5 minutes, the windowSize
can be set to 5 minutes, 10 minutes, 15 minutes, and so on.If you want to check for duplicates only within the data processed by the current job execution, set the interval value to
RUN_INTERVAL
.The
value
being checked for duplication can be any data type.IS_DUPLICATE(windowSize, value)
Name | Type | Description |
---|---|---|
windowSize | interval | The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job. |
value | any | Value to check for duplicates. This can be any data type. |
Boolean
The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The
RUN_INTERVAL
job option instructs Upsolver to run the job every minute:CREATE SYNC JOB transform_orders_and_check_duplicates
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = true
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_is_duplicate MAP_COLUMNS_BY_NAME
SELECT
orderid AS order_id,
MD5(customer.email) AS customer_id,
customer_name,
nettotal AS total,
IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique,
$event_time AS partition_date
FROM default_glue_catalog.upsolver_samples.orders_raw_data
LET customer_name = customer.firstname || ' ' || customer.lastname -- create a computed column
WHERE ordertype = 'SHIPPING'
AND time_filter();
Transformation job to load orders and check for duplicate events arriving within an hourly interval.
The
SELECT
statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE
function to check for rows with a duplicate orderid. The INTERVAL
has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique
Last modified 2mo ago