IS_DUPLICATE

The IS_DUPLICATE function returns TRUE if it's not the first time a job sees the input value in the data within the interval specified in the windowSize.

The windowSize interval must be equal to or greater than the RUN_INTERVAL of the job, and be evenly divisible by the RUN_INTERVAL value. For example, if the RUN_INTERVAL is 5 minutes, the windowSize can be set to 5 minutes, 10 minutes, 15 minutes, and so on.

If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL.

The value being checked for duplication can be any data type.

Syntax

IS_DUPLICATE(windowSize, value)

Arguments

NameTypeDescription

windowSize

interval

The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job.

value

any

Value to check for duplicates. This can be any data type.

Returns

Boolean

Example

Create a job and check for duplicates

The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL job option instructs Upsolver to run the job every minute:

CREATE SYNC JOB transform_orders_and_check_duplicates
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = true	
    RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_is_duplicate 
  MAP_COLUMNS_BY_NAME
    SELECT 
      orderid AS order_id, 
      MD5(customer.email) AS customer_id, 
      customer_name, 
      nettotal AS total, 
      IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique,
      $event_time AS partition_date 
    FROM default_glue_catalog.upsolver_samples.orders_raw_data
    -- create a computed column
    LET customer_name = customer.firstname || ' ' || customer.lastname 
    WHERE ordertype = 'SHIPPING' 
    AND time_filter();

The SELECT statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE function to check for rows with a duplicate orderid. The INTERVAL has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:

IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique

Last updated