Joining Data — Amazon S3 to Amazon Athena
This guide explains how to join data from two different tables, enabling you to query your data from Amazon Athena in Upsolver.
Last updated
This guide explains how to join data from two different tables, enabling you to query your data from Amazon Athena in Upsolver.
Last updated
As you use Upsolver to transform your data, you may need to combine data from multiple places — for example, data from different tables or even from different sources — to perform a particular analysis.
Using sample orders and sales data, this guide explains how to join two Amazon S3 data sources and output the joined data in an Amazon Athena table.
There are six steps to achieve this:
Create a connection to your data sources in S3
Create staging tables to ingest your data
Load data into your staging tables
Use a materialized view to collect the last sales interaction with the order
Create a target table for your joined data
Join orders with the last store employee who dealt with the order
When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See AWS Role Permissions for more information.
In this guide, the cluster has permission to read and write to the AWS Glue Data Catalog and to S3.
Create an S3 connection. This connection enables you to configure the AWS IAM credentials that Upsolver needs to access the data.
Use your Access Key ID and your Secret Access Key to connect to your specific data lake.
Here’s the code:
When you deploy Upsolver, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. Upsolver recommends this to prevent failures, misconfigurations, and breaches.
Create a job that reads from the source S3 bucket and writes to a staging table. Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
In this scenario you create individual staging tables for both data sources.
Here’s the code for the first staging table:
Repeat this step for the second data source.
Here’s the code for the second staging table:
Create a job to load data from the first data source into your staging table:
Create a job to load data from the second data source into your staging table:
To validate that both of your tables contain data, use SELECT
statements to view a couple of rows (it may take a few minutes for the data to display at first).
MATERIALIZED VIEW
to collect the last sales interaction with the orderTo join two tables in Upsolver, you must turn one of them into a materialized view that maps a set of key columns to aggregated values. Later on, when you join your main table with the materialized view, the records are matched based on the lookup table's key column values; you can then add the values from the aggregated columns to the matching record in the main table.
In this example, we turn your table sales_info_raw_data into a materialized view to later join with your main table with the orders data.
Here's the code:
Before you join your data, you must create a table to which to write the data. This table is created in the AWS Glue Data Catalog and is also available to query within Amazon Athena.
Here's the code:
The final step is to join your data. Use an INSERT
job to join your orders staging table and sales lookup table by matching the lookup table's key column to the corresponding column within the orders staging table. The job then inserts the data into the target Athena table you created in the prior step (Step 5).
Here’s the code:
Now you have successfully created a table joining physical store orders with the employee who completed the order.
To view the resulting table, you can query your data directly in Upsolver. To do this, run the following code:
In this how-to guide, you learned how to take raw data from two sources in S3, join the data, and then transform it as per your business needs.
You can use familiar SQL syntax in Upsolver to create data pipelines and organize your data to easily perform analytics and ML. In this way, as your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.
To launch the template:
Launch Upsolver by navigating to https://db.upsolver.com
Sign up by entering your email address and setting a password.
When you're done clicking through the onboarding panel, a panel with a list of templates displays.
Click the Join two data sources (sample data) template.
Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to and Deploy Upsolver on AWS.