Joining Data — Amazon S3 to Amazon Athena

This guide explains how to join data from two different tables, enabling you to query your data from Amazon Athena in Upsolver.

As you use Upsolver to transform your data, you may need to combine data from multiple places — for example, data from different tables or even from different sources — to perform a particular analysis.

Using sample orders and sales data, this guide explains how to join two Amazon S3 data sources and output the joined data in an Amazon Athena table.

There are six steps to achieve this:

  1. Create a connection to your data sources in S3

  2. Create staging tables to ingest your data

  3. Load data into your staging tables

  4. Use a materialized view to collect the last sales interaction with the order

  5. Create a target table for your joined data

  6. Join orders with the last store employee who dealt with the order

Before you begin

Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to Start Your Free Trial and Deploy Upsolver on AWS.

When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See AWS Role Permissions for more information.

In this guide, the cluster has permission to read and write to the AWS Glue Data Catalog and to S3.

Step 1

Create a connection to your data sources in S3

Create an S3 connection. This connection enables you to configure the AWS IAM credentials that Upsolver needs to access the data.

Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

Here’s the code:

CREATE S3 CONNECTION my_sample_data
       AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*'
       EXTERNAL_ID = '12345678'
       READ_ONLY = TRUE;

When you deploy Upsolver, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. Upsolver recommends this to prevent failures, misconfigurations, and breaches.

Step 2

Create staging tables to ingest your data

Create a job that reads from the source S3 bucket and writes to a staging table. Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.

In this scenario you create individual staging tables for both data sources.

Here’s the code for the first staging table:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;

Repeat this step for the second data source.

Here’s the code for the second staging table:

CREATE TABLE default_glue_catalog.upsolver_samples.sales_info_raw_data;

Step 3

Load data into your staging tables

Create a job to load data from the first data source into your staging table:

CREATE JOB load_raw_data_from_orders
    START_FROM = NOW
    CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples
    BUCKET = 'upsolver-samples'
    PREFIX = 'orders/'
INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Create a job to load data from the second data source into your staging table:

CREATE JOB load_raw_data_from_sales_info
   START_FROM = NOW
   CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples
   BUCKET = 'upsolver-samples'
   PREFIX = 'sales_info/'
INTO default_glue_catalog.upsolver_samples.sales_info_raw_data;

To validate that both of your tables contain data, use SELECT statements to view a couple of rows (it may take a few minutes for the data to display at first).

// SELECT from your first staging table
SELECT * 
FROM   default_glue_catalog.upsolver_samples.orders_raw_data 
LIMIT  100;

// SELECT from your second staging table
SELECT * 
FROM   default_glue_catalog.upsolver_samples.sales_info_raw_data 
LIMIT  100;

Step 4

Use MATERIALIZED VIEW to collect the last sales interaction with the order

To join two tables in Upsolver, you must turn one of them into a materialized view that maps a set of key columns to aggregated values. Later on, when you join your main table with the materialized view, the records are matched based on the lookup table's key column values; you can then add the values from the aggregated columns to the matching record in the main table.

In this example, we turn your table sales_info_raw_data into a materialized view to later join with your main table with the orders data.

Here's the code:

CREATE MATERIALIZED VIEW default_glue_catalog.upsolver_samples.physical_store_orders_materialized_view AS 
    SELECT orderid, 
       LAST(saleinfo.source) AS source,
       LAST(saleinfo.store.location.country) AS country,
       LAST(saleinfo.store.location.name) AS name,
       LAST(saleinfo.store.servicedby.employeeid) AS employeeid,
       LAST(saleinfo.store.servicedby.firstname) AS firstname,
       LAST(saleinfo.store.servicedby.lastname) AS lastname
    FROM default_glue_catalog.upsolver_samples.sales_info_raw_data
    GROUP BY orderid;

Step 5

Create a target table for your joined data

Before you join your data, you must create a table to which to write the data. This table is created in the AWS Glue Data Catalog and is also available to query within Amazon Athena.

Here's the code:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_transformed_data;

Step 6

Join orders with the last store employee who dealt with the order

The final step is to join your data. Use an INSERT job to join your orders staging table and sales lookup table by matching the lookup table's key column to the corresponding column within the orders staging table. The job then inserts the data into the target Athena table you created in the prior step (Step 5).

Here’s the code:

CREATE JOB join_two_tables_orders_with_last_employee
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.joined_orders_transformed_data 
   MAP_COLUMNS_BY_NAME
    /*
    	Use the SELECT statement below to choose your columns and perform the 
        intended transformations.
	
	In this example, we join the orders with the sales data to enrich each 
	order with the last store employee that managed the order.
	
	We also filter by the sales source type to only return orders that 
	were purchased in a physical store.
    */
    SELECT 
       s.orderid, 
       mv.employeeid AS employeeid, 
       mv.firstname AS firstname, 
       mv.lastname AS lastname
    FROM default_glue_catalog.upsolver_samples.orders_raw_data AS s
    LEFT JOIN default_glue_catalog.upsolver_samples.physical_store_orders_materialized_view AS mv
    ON mv.orderid = s.orderid
    WHERE mv.source = 'Store'
    AND TIME_FILTER();

Now you have successfully created a table joining physical store orders with the employee who completed the order.

Query your joined data

To view the resulting table, you can query your data directly in Upsolver. To do this, run the following code:

SELECT * 
FROM default_glue_catalog.upsolver_samples.orders_transformed_data 
LIMIT 100;

Conclusion

In this how-to guide, you learned how to take raw data from two sources in S3, join the data, and then transform it as per your business needs.

You can use familiar SQL syntax in Upsolver to create data pipelines and organize your data to easily perform analytics and ML. In this way, as your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Try it yourself

To launch the template:

  1. Launch Upsolver by navigating to https://db.upsolver.com

  2. Sign up by entering your email address and setting a password.

  3. When you're done clicking through the onboarding panel, a panel with a list of templates displays.

  4. Click the Join two data sources (sample data) template.

Last updated