IS_DUPLICATE

This page describes how to use the IS_DUPLICATE function within your transformation jobs.

The IS_DUPLICATE function returns TRUE if it's not the first time a job sees the input value in the data within the interval specified in the windowSize.

The windowSize interval must be equal to or greater than the RUN_INTERVAL of the job, and be evenly divisible by the RUN_INTERVAL value. For example, if the RUN_INTERVAL is 5 minutes, the windowSize can be set to 5 minutes, 10 minutes, 15 minutes, and so on.

If you want to check for duplicates only within the data processed by the current job execution, set the interval value to RUN_INTERVAL.

The value being checked for duplication can be any data type.

Syntax

IS_DUPLICATE(windowSize, value)

Arguments

NameTypeDescription

windowSize

interval

The deduplication window size in minutes, hours, or the RUN_INTERVAL of the job.

value

any

Value to check for duplicates. This can be any data type.

Returns

Boolean

Examples

Create a job and check for duplicates

The following example creates a job named transform_orders_and_check_duplicates, which copies new events from the default_glue_catalog.upsolver_samples.orders_raw_data staging table into the default_glue_catalog.upsolver_samples.orders_is_duplicate table. The RUN_INTERVAL job option instructs Upsolver to run the job every minute:

CREATE SYNC JOB transform_orders_and_check_duplicates
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = true	
    RUN_INTERVAL = 1 MINUTE
    AS INSERT INTO default_glue_catalog.upsolver_samples.orders_is_duplicate MAP_COLUMNS_BY_NAME
    SELECT 
      orderid AS order_id, 
      MD5(customer.email) AS customer_id, 
      customer_name, 
      nettotal AS total, 
      IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique,
      $event_time AS partition_date 
    FROM default_glue_catalog.upsolver_samples.orders_raw_data
    LET customer_name = customer.firstname || ' ' || customer.lastname -- create a computed column
    WHERE ordertype = 'SHIPPING' 
    AND time_filter();

Transformation job to load orders and check for duplicate events arriving within an hourly interval.

The SELECT statement defines the list of columns that will be loaded into the target table. The list includes the IS_DUPLICATE function to check for rows with a duplicate orderid. The INTERVAL has been set to 1 HOUR, so any duplicate events that land within an hour of the first row arriving will return TRUE:

IS_DUPLICATE(INTERVAL '1' HOUR, orderid) AS is_unique

Last updated