IS_DUPLICATE
This page describes how to use the IS_DUPLICATE function within your transformation jobs.
The IS_DUPLICATE
function returns TRUE if it's not the first time a job sees the input value
in the data within the interval specified in the windowSize
.
The windowSize
interval must be equal to or greater than the RUN_INTERVAL
of the job, and be evenly divisible by the RUN_INTERVAL
value. For example, if the RUN_INTERVAL
is 5 minutes, the windowSize
can be set to 5 minutes, 10 minutes, 15 minutes, and so on.
If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL
.
The value
being checked for duplication can be any data type.
Syntax
Arguments
Name | Type | Description |
---|---|---|
windowSize | interval | The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job. |
value | any | Value to check for duplicates. This can be any data type. |
Returns
Boolean
Examples
Create a job and check for duplicates
The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL
job option instructs Upsolver to run the job every minute:
Transformation job to load orders and check for duplicate events arriving within an hourly interval.
The SELECT
statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE
function to check for rows with a duplicate orderid. The INTERVAL
has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:
Last updated