Joining data in SQLake — Amazon S3 to Amazon Athena

This guide explains how to join data from two different tables, enabling you to query your data from Amazon Athena in Upsolver.

As you use Upsolver SQLake to transform your data, you may need to combine data from multiple places — for example, data from different tables or even from different sources — to perform a particular analysis.

Using sample orders and sales data, this guide explains how to join two Amazon S3 data sources and output the joined data in an Amazon Athena table.

There are 6 steps:

  1. Create a connection to your data sources in S3

  2. Create staging tables to ingest your data

  3. Load data into your staging tables

  4. Use a materialized view to collect the last sales interaction with the order

  5. Create a target table for your joined data

  6. Join orders with the last store employee who dealt with the order

Before you begin

Currently, SQLake only runs in your AWS Virtual Private Cloud (VPC). Learn how to set up your user account and deploy SQLake on AWS.

When you first deploy SQLake in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resources you might need to build data pipelines. Learn more about these permissions.

In this guide, the cluster has permissions to read and write to the AWS Glue Data Catalog and to S3.

Step 1

Create a connection to your data sources in S3

Create an S3 connection. This connection enables you to configure the AWS IAM credentials that SQLake needs to access the data.

Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

Here’s the code:

CREATE S3 CONNECTION my_sample_data
       AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*'
       EXTERNAL_ID = '12345678'
       READ_ONLY = TRUE;

When you deploy SQLake, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. Upsolver recommends this to prevent failures, misconfigurations, and breaches.

Step 2

Create staging tables to ingest your data

Create a job that reads from the source S3 bucket and writes to a staging table. SQLake automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.

In this scenario you create individual staging tables for both data sources.

Here’s the code for the first staging table:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;

Repeat this step for the second data source.

Here’s the code for the second staging table:

CREATE TABLE default_glue_catalog.upsolver_samples.sales_info_raw_data;

Step 3

Load data into your staging tables

Create a job to load data from the first data source into your staging table:

CREATE JOB load_raw_data_from_orders
   START_FROM = NOW
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples
                BUCKET = 'upsolver-samples'
                PREFIX = 'orders/'
   INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Create a job to load data from the second data source into your staging table:

CREATE JOB load_raw_data_from_sales_info
   START_FROM = NOW
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples
                BUCKET = 'upsolver-samples'
                PREFIX = 'sales_info/'
   INTO default_glue_catalog.upsolver_samples.sales_info_raw_data;

To validate that both of your tables contain data, use SELECT statements to view a couple rows. (It may take a few minutes for the data to display at first.)

// SELECT from your first staging table
SELECT * FROM default_glue_catalog.upsolver_samples.orders_raw_data limit 100;
// SELECT from your second staging table
SELECT * FROM default_glue_catalog.upsolver_samples.sales_info_raw_data limit 100;

Step 4

Use MATERIALIZED VIEW to collect the last sales interaction with the order

To join two tables in SQLake, you must turn one of them into a materialized view that maps a set of key columns to aggregated values. Later on, when you join your main table with the materialized view, the records are matched based on the lookup table's key column values; you can then add the values from the aggregated columns to the matching record in the main table.

In this example, turn your sales table, titled “default_glue_catalog.upsolver_samples.sales_info_raw_data”, into a materialized view to later join with your main table with the orders data.

Here's the code:

CREATE MATERIALIZED VIEW default_glue_catalog.upsolver_samples.physical_store_orders_materialized_view AS 
    SELECT orderid, 
       LAST(saleinfo.source) as source,
       LAST(saleinfo.store.location.country) as country,
       LAST(saleinfo.store.location.name) as name,
       LAST(saleinfo.store.servicedby.employeeid) as employeeid,
       LAST(saleinfo.store.servicedby.firstname) as firstname,
       LAST(saleinfo.store.servicedby.lastname) as lastname
    FROM default_glue_catalog.upsolver_samples.sales_info_raw_data
    GROUP BY orderid;

Step 5

Create a target table for your joined data

Before you join your data, you must create a table to which to write the data. This table is created in the AWS Glue Data Catalog and is also available to query within Amazon Athena.

Here's the code:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_transformed_data;

Step 6

Join orders with the last store employee who dealt with the order

The final step is to join your data. Use an INSERT job to join your orders staging table and sales lookup table by matching the lookup table's key column to the corresponding column within the orders staging table. The job then inserts the data into the target Athena table you created in the prior step (step 5).

Here’s the code:

CREATE JOB join_two_tables_orders_with_last_employee
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
    AS INSERT INTO default_glue_catalog.upsolver_samples.joined_orders_transformed_data MAP_COLUMNS_BY_NAME
    /*
    	Use the SELECT statement below to choose your columns and performed the desired transformations.
	
	In this example, we join the orders with the sales data to enrich each order with the last store employee that managed the order.
	We also filter by the sales source type to only return orders that were purchased in a physical store.
    */
    SELECT 
       s.orderid, 
       mv.employeeid AS employeeid, 
       mv.firstname AS firstname, 
       mv.lastname AS lastname
    FROM default_glue_catalog.upsolver_samples.orders_raw_data as s
    LEFT JOIN default_glue_catalog.upsolver_samples.physical_store_orders_materialized_view AS mv
    ON mv.orderid = s.orderid
    WHERE mv.source = 'Store'
    AND time_filter();

Now you have successfully created a table joining physical store orders with the employee who completed the order.

Query your joined data

To view the resulting table, you can query your data directly in SQLake. To do this, run the following code:

SELECT * FROM default_glue_catalog.upsolver_samples.orders_transformed_data limit 100;

Conclusion

In this how-to guide, you learned how to take raw data from two sources in S3, join the data, and then transform it as per your business needs.

You can use familiar SQL syntax in SQLake to create data pipelines and organize your data to more easily perform analytics and ML. In this way, as your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Try it yourself

To launch the template:

  1. Launch SQLake by navigating to https://db.upsolver.com

  2. Sign up by entering your email address and setting a password.

  3. When you're done clicking through the onboarding panel, a panel with a list of templates displays.

  4. Click the Join two data sources (sample data) template.

Last updated