Upserting data in SQLake — Amazon S3 to Amazon Athena
This how-to guide shows how you can use the MERGE function to 'upsert' data.
Upserting is a term we use to describe row transformation. When you create a job to upsert data, you perform one of two functions:
the job inserts a row in your target table with new information; or
the job updates an existing row with the new data
SQLake templates primarily use the INSERT
function. However, another method of upserting is the MERGE function. In the MERGE
function you merge data from one data stream with another reference data stream.
Before you begin
Create a connection to your data source in S3
To join your data, you must first create an S3 connection. This connection enables you to configure the AWS IAM credentials that SQLake needs to access the data.
Use your Access Key ID and your Secret Access Key to connect to your specific data lake.
Here’s the code:
When you deploy SQLake, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. Upsolver recommends this to prevent failures, misconfigurations, and breaches.
Create a staging table to ingest your data
You must create a job that reads from the source S3 bucket and writes to a staging table.
SQLake automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
Here’s the code to create the staging table:
Here's the code to create a job to load data from the your data source into your staging table:
After your data is prepared for transformation, you can take one of two paths to refine it:
INSERT
MERGE
Use INSERT
to transform your data
INSERT
to transform your dataUsing INSERT to transform your data involves 4 steps:
Create a table
Transform and INSERT data into the newly-created table
Query the data
Step 1 - Create a table to hold your data
To upsert your data, your target table must have at least one primary key. Use this key to match rows between your target and source tables.
If your target table is an Upsolver-managed table, you can define a primary key as follows:
Here's the code:
Step 2 - Apply transformations and insert data into your table
You can now apply transformations to your data and write the results into your target table.
To do this, you can create an INSERT
job. In this scenario, the name of your target table is "default_glue_catalog.upsolver_samples.orders_upsert_with_insert
" and we aggregate the data by customer and then filter the data to include only repeat customers.
Here's the code:
Let's explain each new column:
number_of_orders
: This column totals the number of orders made by each distinct customer, ororderid
.total_sales
: This column takes the sum of the values innettotal
from the original data.first_purchase
: This column lists the date of each customer's initial purchase by filtering through theorderdate
column in the original data.last_purchase
: This column lists the date of each customer's last purchase by filtering through theorderdate
column in the original data.
By grouping our orders by 1 and then specifying that the count must be greater than 1, you filter out all non-repeat customers.
Step 3 - Query the data
Here's the code:
Use MERGE
to transform your data
MERGE
to transform your dataMerging data provides flexibility with regard to when and how rows are updated, inserted, or deleted.
Using MERGE to transform your data involves 4 steps:
Create a table
Apply transformations and MERGE data into the newly-created table
Query the data
Step 1 - Create a table to hold your data
To use your data, you must create a table to hold it.
To upsert your data, your target table should have at least one primary key that is used to match rows between your target and source tables. If your target table is an Upsolver-managed table, you can define a primary key as follows:
Here's the code:
Step 2 - Apply transformations and MERGE data into your table
You can now apply transformations on your data and write the results into your target table by creating a MERGE job.
In this scenario, the name of your target table is "default_glue_catalog.upsolver_samples.orders_upsert_with_merge
" and we aggregate the data by customer and then filter the data so that the table includes only repeat customers.
Here's the code:
Let's explain each new column.
number_of_orders
: This column totals the number of orders made by each distinct customer, ororderid
.total_sales
: This column takes the sum of the values innettotal
from the original data.first_purchase
: This column lists the date of each customer's initial purchase by filtering through theorderdate
column in the original data.last_purchase
: This column lists the date of each customer's last purchase by filtering through theorderdate
column in the original data.
By grouping your orders by 1 and then specifying that the count must be greater than 1, you filter out all non-repeat customers from the target table.
The condition in the ON
statement tells SQLake which columns to match between the target and source tables. If you omit the ON
statement, there should be a natural join for all of the primary key and partition columns. Otherwise, there must be a column_condition
for each primary key of the target table.
Step 3 - Query the data
The final step is to query your data.
Here's the code:
Conclusion
This guide explained how to take raw data from S3 and refine it using both INSERT
and MERGE
jobs.
Using only familiar SQL syntax, you can use SQLake to create data pipelines and organize your data to more easily perform analytics and ML.
As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.
Try it yourself
To launch the template:
Launch SQLake by navigating to https://db.upsolver.com
Sign up by entering your email address and setting a password.
When you have finished clicking through the onboarding panel, a panel with a list of templates displays.
Click the Upserting rows (sample data) template.
Last updated