Upsert Data to a Target Table

This quickstart article describes how to use INSERT and MERGE to update a row or insert a new row in your target table.

Upserting is a term we use to describe row transformation. When you create a job to upsert data, the job performs one of two functions:

Inserts a row in your target table with new data
Updates an existing row with the new data

You can us the INSERT or MERGE functions to upsert data. Examples of both are below.

Upsert using INSERT

To upsert data, your target table should have at least one primary key to match the rows between your source and target tables.

If your target table is an Upsolver-managed table, you can define a primary key as follows:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_upsert_with_insert
    (customer_email string)
PRIMARY KEY customer_email;

You can then insert your data into your target table. To do this, run the following job:

CREATE JOB insert_orders_upsert
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_upsert_with_insert 
    MAP_COLUMNS_BY_NAME

This job assumes you have your source data in a staging table my_catalog.my_schema.source_table. Furthermore, it assumes this table contains a primary key column that matches the primary key column of your target table.

Having a primary key in the target table means that if a new row arrives that shares a primary key value with an existing row, the old row is updated; otherwise, Upsolver inserts the new row into the table.

Finally, use a SELECT statement to apply transformations to your data:

SELECT customer.email AS customer_email, 
       COUNT(DISTINCT orderid) AS number_of_orders,
       SUM(nettotal) AS total_sales,
       MIN(orderdate) AS first_purchase,
       MAX(orderdate) AS last_purchase
FROM default_glue_catalog.upsolver_samples.orders_raw_data
WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
GROUP BY 1
HAVING COUNT(DISTINCT orderid::string) > 1;

Query your data with the following statement.

SELECT * 
FROM default_glue_catalog.upsolver_samples.orders_upsert_with_insert 
LIMIT 10;

Upsert using MERGE

As with the INSERT command, to upsert your data your target table should have at least one primary key to match the rows between your source and target tables.

Also as before, if your target table is an Upsolver-managed table, you can define a primary key as follows:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_upsert_with_merge
    (customer_email string)
PRIMARY KEY customer_email;

You can then merge your data into your target table by running the following job:

CREATE JOB merge_orders_upsert
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
AS MERGE INTO 
    default_glue_catalog.upsolver_samples.orders_upsert_with_merge AS target

Finally, apply transformations to your data using a SELECT statement.

USING (
	SELECT customer.email AS customer_email, 
		COUNT(DISTINCT orderid) AS number_of_orders,
		SUM(nettotal) AS total_sales,
	   	MIN(orderdate) AS first_purchase,
		MAX(orderdate) AS last_purchase
	FROM default_glue_catalog.upsolver_samples.orders_raw_data
	WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
	GROUP BY 1
	HAVING COUNT(DISTINCT orderid::string) > 1
) source
ON (target.customer_email = source.customer_email)
-- Update if primary keys match
WHEN MATCHED THEN REPLACE 
-- Insert if primary key is unique (new record)
WHEN NOT MATCHED THEN INSERT MAP_COLUMNS_BY_NAME;

Query your data with the following statement.

SELECT * 
FROM default_glue_catalog.upsolver_samples.orders_upsert_with_merge 
LIMIT 10;

Last updated 1 year ago