Aggregating data in SQLake — Amazon S3 to Amazon Athena

This how-to guide shows you how to use SQLake to aggregate data.

Does it matter whether you apply filters to your data in the staging zone or aggregate your data instead?

Yes, and here's why: data aggregation applies almost exclusively to totals or summary statistics. This is essential. Frequently, clients may not be looking for a breakdown of their data; instead they may be looking for long-term trends and patterns they can best discern by looking at summaries and totals.

This guide shows you how to aggregate data in Amazon S3 to an Amazon Athena pipeline. There are 5 steps:

  1. Create a connection to your data source in S3

  2. Create a staging table to ingest your data

  3. Load data into the newly-created staging table

  4. Create an output table for the refined zone

  5. Aggregate and output data into Athena

Step 1

Create a connection to your data source in S3

To join your data, you must create an S3 connection. This connection enables you to configure the AWS IAM credentials that SQLake needs to access the data.

Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

Here’s the code:

CREATE S3 CONNECTION upsolver_s3_samples
    AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
    EXTERNAL_ID = 'SAMPLES'
    READ_ONLY = TRUE;

When you deploy SQLake, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connections. Upsolver recommends this to prevent failures, misconfigurations, and breaches.

Step 2

Create a staging table to ingest and store your data

You must create a job that reads from the source S3 bucket and writes to a staging table.

SQLake automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.

Here’s the code to create the staging table:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;

Step 3

Load your data into your staging table

Create a job to load data from your data source into your staging table:

CREATE JOB load_orders_raw_data_from_s3
    START_FROM = NOW
    CONTENT_TYPE = JSON
    AS COPY FROM S3 upsolver_s3_samples LOCATION = 's3://upsolver-samples/orders/'
    INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Step 4

Create an output table for the refined zone

Create an output table in the AWS Glue Data Catalog. Later on you connect this output table to Amazon Athena.

Here's the code:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_aggregated_data(
    cusomter_address_city string)
    PRIMARY KEY cusomter_address_city;

In this example you define a primary key column. A primary key is not necessary to perform aggregations, but it's useful when you must update the aggregation as new events arrive.

Step 5

Aggregate and output data into Amazon Athena

Since you've defined a primary key in the target table, any new record that arrives with the same primary key value automatically overwrites the old record.

Here's how to aggregate and output your data:

CREATE JOB group_testing
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
    AS INSERT INTO default_glue_catalog.upsolver_samples.orders_aggregated_data MAP_COLUMNS_BY_NAME
    /*
    	Use the SELECT statement below to choose your columns and performed the desired transformations.
	
	In this example, we aggregate the sample orders data based on the customer's city and compute the total revenue per city.
    */
    SELECT customer.address.city AS customer_address_city, 
       SUM(nettotal) as total_revenue,
       COUNT(DISTINCT orderid) AS num_orders
    FROM default_glue_catalog.upsolver_samples.orders_raw_data
    WHERE time_filter()
    GROUP BY customer.address.city;

In this particular example, you aggregate your sample orders by the customers' cities and then compute the total revenue per city.

Let's explain each column

  • The "nettotal" column from the original data is summed and renamed as the column to "total_revenue". Since we group by customer.address.city, this gives you the total revenue per city.

  • Each distinct order is counted in the column "num_orders". Since we group by customer.address.city, that means that the number of distinct orders is per city.

The final step is to query your data.

Here's the code:

SELECT * FROM default_glue_catalog.upsolver_samples.orders_aggregated_data LIMIT 10;

Conclusion

In this guide, you learned how to take raw data from S3 and refine it by aggregation before transferring it to the refined zone in Amazon Athena.

By adopting and implementing familiar SQL syntax, you can use SQLake to create data pipelines and organize your data to more easily perform analytics and ML.

As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Try it yourself

To launch the template:

  1. Launch SQLake by navigating to https://db.upsolver.com

  2. Sign up by entering your email address and setting a password.

  3. After you finish clicking through the onboarding panel, a panel with a list of templates displays.

  4. Click the Aggregating events (sample data) template.

Last updated