IS_DUPLICATE
The IS_DUPLICATE
function returns TRUE if it's not the first time a job sees the input value
in the data within the interval specified in the windowSize
.
The windowSize
interval must be equal to or greater than the RUN_INTERVAL
of the job, and be evenly divisible by the RUN_INTERVAL
value. For example, if the RUN_INTERVAL
is 5 minutes, the windowSize
can be set to 5 minutes, 10 minutes, 15 minutes, and so on.
If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL
.
The value
being checked for duplication can be any data type.
Syntax
IS_DUPLICATE(windowSize, value)
Arguments
windowSize
interval
The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job.
value
any
Value to check for duplicates. This can be any data type.
Returns
Boolean
Example
Create a job and check for duplicates
The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL
job option instructs Upsolver to run the job every minute:
CREATE SYNC JOB transform_orders_and_check_duplicates
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = true
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_is_duplicate
MAP_COLUMNS_BY_NAME
SELECT
orderid AS order_id,
MD5(customer.email) AS customer_id,
customer_name,
nettotal AS total,
IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique,
$event_time AS partition_date
FROM default_glue_catalog.upsolver_samples.orders_raw_data
-- create a computed column
LET customer_name = customer.firstname || ' ' || customer.lastname
WHERE ordertype = 'SHIPPING'
AND time_filter();
The SELECT
statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE
function to check for rows with a duplicate orderid. The INTERVAL
has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:
IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique
Last updated