IS_DUPLICATE
The IS_DUPLICATE
function returns TRUE if it's not the first time a job sees the input value
in the data within the interval specified in the windowSize
.
The windowSize
interval must be equal to or greater than the RUN_INTERVAL
of the job, and be evenly divisible by the RUN_INTERVAL
value. For example, if the RUN_INTERVAL
is 5 minutes, the windowSize
can be set to 5 minutes, 10 minutes, 15 minutes, and so on.
If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL
.
The value
being checked for duplication can be any data type.
Syntax
Arguments
Name | Type | Description |
---|---|---|
windowSize | interval | The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job. |
value | any | Value to check for duplicates. This can be any data type. |
Returns
Boolean
Example
Create a job and check for duplicates
The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL
job option instructs Upsolver to run the job every minute:
The SELECT
statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE
function to check for rows with a duplicate orderid. The INTERVAL
has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:
Last updated