Real-time data ingestion in Athena — Amazon S3 to Amazon Athena
This guide explores a popular use case to optimize and transform event data from Amazon S3 to Amazon Athena.
Amazon S3 makes it easy and cost-effective to store large amounts of data, whether it be structured, semi-structured, or unstructured. This is why popular analytics engines such as Amazon Athena use S3 to store and access data in data lakes. However, to supplement S3’s many pre-existing benefits, you can use certain practices to optimize the layout, structure, and size of data to improve query performance and decrease cost per query. These practices are not easy to implement, but with proper maintenance, you can keep up with the rapidly changing variety and velocity of data in your data lake.
You can structure data lakes in many different ways. Most commonly, though, data lakes are designed to include three zones: the raw zone, the staging zone, and the refined zone. This is Upsolver's recommended approach. It's simple to get started, scales with your data, and easily extends to support new use cases.
The raw zone stores all data. As the name implies, the data in question is uncleaned, unverified, and unoptimized for analysis. At this point, the data is only collected, stored, and cataloged.
In the staging zone, you expose data for easy discovery and convert it to an optimized file format such as Apache Parquet. It's also compressed -- which reduces the cost of storage -- and automatically compacted, in which small files merge into larger, efficient ones. These optimizations improve the hygiene of your data and the performance of query engines, resulting in reliable and fresh query results. While you can query the data now, it hasn't yet been transformed to match exact business needs. You define data pipelines that read and transform the data in the staging zone and store the results in the refined zone.
In this guide, we explore a popular use case: optimizing and transforming event data in JSON format. The data, which contains nested fields, typically originates from web applications and service logs, though it can originate from any source. SQLake moves data from the raw zone of your data lake and transforms and stores it in the refined zone for further analysis by engines such as Amazon Athena.
With Upsolver SQLake, you can define business logic transformations using only SQL. SQLake automatically writes the results to your data lake's refined zone. You can easily insert, update, and delete rows in the data lake with only a few lines of SQL. This results in consistent, reliable, and up-to-date data that analysts, data scientists, and other data practitioners at your company can find and use. No one needs to know how to write, troubleshoot, or optimize code.
The following diagram illustrates the raw, staging, and refined zones in Amazon S3, plus the AWS Glue Data Catalog that stores the schema and partition information of our datasets. Amazon Athena looks up the metadata in the catalog and queries the data that Upsolver has processed and stored in S3.
The process of transforming data in data lakes
When you first deploy SQLake in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. Find out more about these permissions. In this guide, our cluster has permissions to read and write to the AWS Glue Data Catalog and to S3.
After you set up your user account and deploy SQLake on AWS, there are 5 simple steps that lead you from connecting to S3 to analysis in Amazon Athena:
- 1.Connect to your S3 bucket in the raw zone
- 2.Create a staging table
- 3.Copy from S3
- 4.Create an output table for refined data in Amazon Athena
- 5.Prepare your data for the refined zone
To transfer your data from the raw zone, you must create a SQLake connection. This connection gives you the ability to configure the AWS IAM credentials that SQLake needs to access the data.
When you deploy SQLake, the cluster will already have an IAM role attached that may not have permission to access the S3 path where your data resides. If the cluster lacks those permissions, you can assign a scope-down policy to the S3 connection. We recommend this approach to prevent failures, misconfigurations, and breaches.
Here's the code:
CREATE S3 CONNECTION my_sample_data
AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*>'
EXTERNAL_ID = '12345678'
READ_ONLY = TRUE;
Stage the data in the staging zone of your data lake. To do this, create a job that reads from the source S3 bucket and writes to a staging table that stores your data. SQLake automatically converts your files into the Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
Here’s the code to create the table:
CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;
Create a job that loads your data into your staging table using the
CREATE JOB load_orders_raw_data_from_s3
START_FROM = NOW
CONTENT_TYPE = AUTO
AS COPY FROM S3 upsolver_s3_samples LOCATION = 's3://upsolver-samples/orders/'
Let’s understand what this job does:
- 1.Automatically infers the schema and populates the column names and types in the Glue Data Catalog.
- 2.Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads. You can learn more about the properties you can pass to the
CREATE JOBoperation here.
SQLake uses the staging table to load your data and get it ready to transform in the next step.
You’re almost done: your data is now cataloged in the Glue Data Catalog and the file format is optimized Parquet. You can start to query it in Amazon Athena. However, you may want to create a refined table that is partitioned and transformed to better meet a unique business need or use case. That's the next step.
To take full advantage of either of these jobs, we recommend you explicitly create your output table, which enables you to define partitions, primary keys, and more. Here’s an example:
To create an output table in AWS Glue Data Catalog:
CREATE TABLE default_glue_catalog.upsolver_samples.orders_transformed_data(
PARTITION BY partition_date;
You don’t need to define all of your output columns because SQLake automatically adds any missing columns. If you want to strictly control which columns are added, you can define them in the
CREATE TABLEstatement. To do this, when you
CREATE JOByou can set the
ADD_MISSING_COLUMNSjob property to
Now that you have defined the output table in the staging zone, create an
INSERTjob. This job reads data from the staging table, transforms it, and writes the results into the target table inside your refined zone.
To create an
INSERTjob to read from staging and write to the refined zone:
CREATE JOB transform_orders_and_insert_into_athena
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = true
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_transformed_data MAP_COLUMNS_BY_NAME
-- Use the SELECT statement to choose columns from the source and implement your business logic transformations.
orderid AS order_id, -- rename columns
MD5(customer.email) AS customer_id, -- hash or mask columns using built-in functions
customer_name, -- computed field defined later in the query
nettotal AS total,
$commit_time AS partition_date -- populate the partition column with the processing time of the event, automatically cast to DATE type
LET customer_name = customer.firstname || ' ' || customer.lastname -- create a computed column
WHERE ordertype = 'SHIPPING'
AND time_filter(); AND $commit_time BETWEEN execution_start_time() AND execution_end_time();
Finally, to view your pipeline results, query your table using
SELECT * FROM default_glue_catalog.upsolver_samples.orders_transformed_data limit 100;
In this guide, you learned how to take raw data from a source location (S3), stage the data, and then transform it to meet your business needs. The approach was modeled after a common data lake design pattern, in which data is transitioned through raw, staging, and refined zones. Our goal is to make it simple; by adopting and implementing familiar SQL syntax, you can use SQLake to create data pipelines and organize your data to easily perform analytics and ML.
And as your business needs evolve, so can your data. In the future, you can create additional jobs that use the staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh**.**
To launch the template:
- 2.Sign up by entering your email address and setting a password.
- 3.After you click through the onboarding panel, a panel with a list of templates displays.
- 4.Select the Amazon S3 to Athena (sample data) template.