Upserting Data — Amazon S3 to Amazon Athena
This how-to guide shows how you can use the MERGE function to upsert data.
Last updated
This how-to guide shows how you can use the MERGE function to upsert data.
Last updated
Upserting is a term we use to describe row transformation. When you create a job to upsert data, you perform one of two functions:
The job inserts a row in your target table with new information; or
The job updates an existing row with the new data
Upsolver templates primarily use the INSERT
function. However, another method of upserting is the MERGE function. In the MERGE
function you merge data from one data stream with another reference data stream.
To join your data, you must first create an S3 connection. This connection enables you to configure the AWS IAM credentials that Upsolver needs to access the data.
Use your Access Key ID and your Secret Access Key to connect to your specific data lake.
Here’s the code:
When you deploy Upsolver, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. Upsolver recommends this to prevent failures, misconfigurations, and breaches.
You must create a job that reads from the source S3 bucket and writes to a staging table.
Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
Here’s the code to create the staging table:
Here's the code to create a job to load data from the your data source into your staging table:
After your data is prepared for transformation, you can take one of two paths to refine it:
INSERT
MERGE
Using INSERT
to transform your data involves four steps:
Create a table
Transform and INSERT
data into the newly-created table
Query the data
To upsert your data, your target table must have at least one primary key. Use this key to match rows between your target and source tables.
If your target table is an Upsolver-managed table, you can define a primary key as follows:
Here's the code:
You can now apply transformations to your data and write the results into your target table.
To do this, you can create an INSERT
job. In this scenario, the name of your target table is orders_upsert_with_insert and we aggregate the data by customer and then filter the data to include only repeat customers.
Here's the code:
Let's explain each new column:
number_of_orders: This column totals the number of orders made by each distinct customer, or orderid.
total_sales: This column takes the sum of the values in nettotal
from the original data.
first_purchase: This column lists the date of each customer's initial purchase by filtering through the orderdate column in the original data.
last_purchase: This column lists the date of each customer's last purchase by filtering through the orderdate column in the original data.
By grouping our orders by 1 and then specifying that the count must be greater than 1, you filter out all non-repeat customers.
Here's the code:
Merging data provides flexibility with regard to when and how rows are updated, inserted, or deleted.
Using MERGE
to transform your data involves four steps:
Create a table
Apply transformations and MERGE
data into the newly-created table
Query the data
To use your data, you must create a table to hold it.
To upsert your data, your target table should have at least one primary key that is used to match rows between your target and source tables. If your target table is an Upsolver-managed table, you can define a primary key as follows:
Here's the code:
You can now apply transformations on your data and write the results into your target table by creating a MERGE
job.
In this scenario, the name of your target table is orders_upsert_with_merge and we aggregate the data by customer and then filter the data so that the table includes only repeat customers.
Here's the code:
Let's explain each new column.
number_of_orders: This column totals the number of orders made by each distinct customer, or orderid.
total_sales: This column takes the sum of the values in nettotal
from the original data.
first_purchase: This column lists the date of each customer's initial purchase by filtering through the orderdate column in the original data.
last_purchase: This column lists the date of each customer's last purchase by filtering through the orderdate column in the original data.
By grouping your orders by 1 and then specifying that the count must be greater than 1, you filter out all non-repeat customers from the target table.
The condition in the ON
statement tells Upsolver which columns to match between the target and source tables. If you omit the ON
statement, there should be a natural join for all of the primary key and partition columns. Otherwise, there must be a column_condition for each primary key of the target table.
The final step is to query your data.
Here's the code:
This guide explained how to take raw data from Amazon S3 and refine it using both INSERT
and MERGE
jobs. Using only familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.
As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.
To launch the template:
Launch Upsolver by navigating to https://db.upsolver.com.
Sign up by entering your email address and setting a password.
When you have finished clicking through the onboarding panel, a panel with a list of templates displays.
Click the Upserting rows (sample data) template.