Real-time Data Ingestion — Amazon S3 to Amazon Athena
This guide explores a popular use case to optimize and transform event data from Amazon S3 to Amazon Athena.
Last updated
This guide explores a popular use case to optimize and transform event data from Amazon S3 to Amazon Athena.
Last updated
Amazon S3 makes it easy and cost-effective to store large amounts of data, whether it be structured, semi-structured, or unstructured. This is why popular analytics engines such as Amazon Athena use S3 to store and access data in data lakes. However, to supplement S3’s many pre-existing benefits, you can use certain practices to optimize the layout, structure, and size of data to improve query performance and decrease cost per query. These practices are not easy to implement, but with proper maintenance, you can keep up with the rapidly changing variety and velocity of data in your data lake.
You can structure data lakes in many different ways. Most commonly, though, data lakes are designed to include three zones: the raw zone, the staging zone, and the refined zone. This is Upsolver's recommended approach. It's simple to get started, scales with your data, and easily extends to support new use cases.
The raw zone stores all data. As the name implies, the data in question is uncleaned, unverified, and unoptimized for analysis. At this point, the data is only collected, stored, and cataloged.
In the staging zone, you expose data for easy discovery and convert it to an optimized file format such as Apache Parquet. It's also compressed -- which reduces the cost of storage -- and automatically compacted, in which small files merge into larger, efficient ones. These optimizations improve the hygiene of your data and the performance of query engines, resulting in reliable and fresh query results. While you can query the data now, it hasn't yet been transformed to match exact business needs. You define data pipelines that read and transform the data in the staging zone and store the results in the refined zone.
In this guide, we explore a popular use case: optimizing and transforming event data in JSON format. The data, which contains nested fields, typically originates from web applications and service logs, though it can originate from any source. Upsolver moves data from the raw zone of your data lake and transforms and stores it in the refined zone for further analysis by engines such as Amazon Athena.
With Upsolver, you can define business logic transformations using only SQL. Upsolver automatically writes the results to your data lake's refined zone. You can easily insert, update, and delete rows in the data lake with only a few lines of SQL. This results in consistent, reliable, and up-to-date data that analysts, data scientists, and other data practitioners at your company can find and use. No one needs to know how to write, troubleshoot, or optimize code.
The following diagram illustrates the raw, staging, and refined zones in Amazon S3, plus the AWS Glue Data Catalog that stores the schema and partition information of our datasets. Amazon Athena looks up the metadata in the catalog and queries the data that Upsolver has processed and stored in S3.
When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See AWS Role Permissions for more information.
In this guide, our cluster has permission to read and write to the AWS Glue Data Catalog and to S3. After you set up your user account and deploy Upsolver on AWS, there are five simple steps that lead you from connecting to S3 to analysis in Amazon Athena:
Connect to your S3 bucket in the raw zone
Create a staging table
Copy from S3
Create an output table for refined data in Amazon Athena
Prepare your data for the refined zone
To transfer your data from the raw zone, you must create a Upsolver connection. This connection gives you the ability to configure the AWS IAM credentials that Upsolver needs to access the data.
When you deploy Upsolver, the cluster will already have an IAM role attached that may not have permission to access the S3 path where your data resides. If the cluster lacks those permissions, you can assign a scope-down policy to the S3 connection. We recommend this approach to prevent failures, misconfigurations, and breaches.
Here's the code:
For future reference, your AWS_ROLE
can be copied from your AWS IAM user page. After you complete this step, you should find the my_sample_data
connection in your navigation tree.
Stage the data in the staging zone of your data lake. To do this, create a job that reads from the source S3 bucket and writes to a staging table that stores your data. Upsolver automatically converts your files into the Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
Here’s the code to create the table:
Create a job that loads your data into your staging table using the COPY FROM
command:
Let’s understand what this job does:
Automatically infers the schema and populates the column names and types in the Glue Data Catalog.
Upsolver uses the staging table to load your data and get it ready to transform in the next step.
You’re almost done: your data is now cataloged in the AWS Glue Data Catalog and the file format is optimized Parquet. You can start to query it in Amazon Athena. However, you may want to create a refined table that is partitioned and transformed to better meet a unique business need or use case. That's the next step.
INSERT
MERGE
To take full advantage of either of these jobs, we recommend you explicitly create your output table, which enables you to define partitions, primary keys, and more. Here’s an example:
To create an output table in AWS Glue Data Catalog:
Now that you have defined the output table in the staging zone, create an INSERT
job. This job reads data from the staging table, transforms it, and writes the results into the target table inside your refined zone.
To create an INSERT
job to read from staging and write to the refined zone:
Finally, to view your pipeline results, query your table using SELECT
.
In this guide, you learned how to take raw data from a source location (S3), stage the data, and then transform it to meet your business needs. The approach was modeled after a common data lake design pattern, in which data is transitioned through raw, staging, and refined zones. Our goal is to make it simple; by adopting and implementing familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.
Furthermore, as your business needs evolve, so can your data. In the future, you can create additional jobs that use the staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.
To launch the template:
Launch Upsolver by navigating to https://db.upsolver.com
Sign up by entering your email address and setting a password.
After you click through the onboarding panel, a panel with a list of templates is displayed.
Select the Amazon S3 to Athena (sample data) template.
Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to and Deploy Upsolver on AWS.
Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads. You can learn more about the properties you can pass to the operation here.
To start transforming your data, choose from two types of job:
You don’t need to define all of your output columns because Upsolver automatically adds any missing columns. If you want to strictly control which columns are added, you can define them in the statement. To do this, when you CREATE JOB
you can set the ADD_MISSING_COLUMNS
job property to FALSE.