Aggregating Data — Amazon S3 to Amazon Athena
This how-to guide shows you how to use Upsolver to aggregate data.
Does it matter whether you apply filters to your data in the staging zone or aggregate your data instead? Yes, and here's why: data aggregation applies almost exclusively to totals or summary statistics. This is essential. Frequently, clients may not be looking for a breakdown of their data; instead, they may be looking for long-term trends and patterns they can best discern by looking at summaries and totals.
This guide shows you how to aggregate data in Amazon S3 to an Amazon Athena pipeline. There are five steps:
Create a connection to your data source in S3
Create a staging table to ingest your data
Load data into the newly-created staging table
Create an output table for the refined zone
Aggregate and output data into Athena
Step 1
Create a connection to your data source in S3
To join your data, you must create an S3 connection. This connection enables you to configure the AWS IAM credentials that Upsolver needs to access the data.
Use your Access Key ID and your Secret Access Key to connect to your specific data lake.
Here’s the code:
When you deploy Upsolver, the cluster will already have an IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connections. Upsolver recommends this to prevent failures, misconfigurations, and breaches.
Step 2
Create a staging table to ingest and store your data
You must create a job that reads from the source S3 bucket and writes to a staging table.
Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
Here’s the code to create the staging table:
Step 3
Load your data into your staging table
Create a job to load data from your data source into your staging table:
Step 4
Create an output table for the refined zone
Create an output table in the AWS Glue Data Catalog. Later on you connect this output table to Amazon Athena.
Here's the code:
In this example you define a primary key column. A primary key is not necessary to perform aggregations, but it's useful when you must update the aggregation as new events arrive.
Step 5
Aggregate and output data into Amazon Athena
Since you've defined a primary key in the target table, any new record that arrives with the same primary key value automatically overwrites the old record.
Here's how to aggregate and output your data:
In this particular example, you aggregate your sample orders by the customers' cities and then compute the total revenue per city.
Let's explain each column
The nettotal column from the original data is summed and renamed as the column to total_revenue. Since we group by customer.address.city, this gives you the total revenue per city.
Each distinct order is counted in the column num_orders. Since we group by customer.address.city, that means that the number of distinct orders is per city.
The final step is to query your data.
Here's the code:
Conclusion
In this guide, you learned how to take raw data from S3 and refine it by aggregation before transferring it to the refined zone in Amazon Athena.
By adopting and implementing familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.
As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.
Try it yourself
To launch the template:
Launch Upsolver by navigating to https://db.upsolver.com.
Sign up by entering your email address and setting a password.
After you finish clicking through the onboarding panel, a panel with a list of templates displays.
Click the Aggregating events (sample data) template.
Last updated