Real-time Data Ingestion — Amazon S3 to Amazon Athena

This guide explores a popular use case to optimize and transform event data from Amazon S3 to Amazon Athena.

Amazon S3 makes it easy and cost-effective to store large amounts of data, whether it be structured, semi-structured, or unstructured. This is why popular analytics engines such as Amazon Athena use S3 to store and access data in data lakes. However, to supplement S3’s many pre-existing benefits, you can use certain practices to optimize the layout, structure, and size of data to improve query performance and decrease cost per query. These practices are not easy to implement, but with proper maintenance, you can keep up with the rapidly changing variety and velocity of data in your data lake.

You can structure data lakes in many different ways. Most commonly, though, data lakes are designed to include three zones: the raw zone, the staging zone, and the refined zone. This is Upsolver's recommended approach. It's simple to get started, scales with your data, and easily extends to support new use cases.

The raw zone stores all data. As the name implies, the data in question is uncleaned, unverified, and unoptimized for analysis. At this point, the data is only collected, stored, and cataloged.

In the staging zone, you expose data for easy discovery and convert it to an optimized file format such as Apache Parquet. It's also compressed -- which reduces the cost of storage -- and automatically compacted, in which small files merge into larger, efficient ones. These optimizations improve the hygiene of your data and the performance of query engines, resulting in reliable and fresh query results. While you can query the data now, it hasn't yet been transformed to match exact business needs. You define data pipelines that read and transform the data in the staging zone and store the results in the refined zone.

In this guide, we explore a popular use case: optimizing and transforming event data in JSON format. The data, which contains nested fields, typically originates from web applications and service logs, though it can originate from any source. Upsolver moves data from the raw zone of your data lake and transforms and stores it in the refined zone for further analysis by engines such as Amazon Athena.

With Upsolver, you can define business logic transformations using only SQL. Upsolver automatically writes the results to your data lake's refined zone. You can easily insert, update, and delete rows in the data lake with only a few lines of SQL. This results in consistent, reliable, and up-to-date data that analysts, data scientists, and other data practitioners at your company can find and use. No one needs to know how to write, troubleshoot, or optimize code.

The following diagram illustrates the raw, staging, and refined zones in Amazon S3, plus the AWS Glue Data Catalog that stores the schema and partition information of our datasets. Amazon Athena looks up the metadata in the catalog and queries the data that Upsolver has processed and stored in S3.

Before you begin

Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to Start Your Free Trial and Deploy Upsolver on AWS.

When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See AWS Role Permissions for more information.

In this guide, our cluster has permission to read and write to the AWS Glue Data Catalog and to S3. After you set up your user account and deploy Upsolver on AWS, there are five simple steps that lead you from connecting to S3 to analysis in Amazon Athena:

Connect to your S3 bucket in the raw zone
Create a staging table
Copy from S3
Create an output table for refined data in Amazon Athena
Prepare your data for the refined zone

Step 1

Connecting to your S3 bucket in the raw zone

To transfer your data from the raw zone, you must create a Upsolver connection. This connection gives you the ability to configure the AWS IAM credentials that Upsolver needs to access the data.

When you deploy Upsolver, the cluster will already have an IAM role attached that may not have permission to access the S3 path where your data resides. If the cluster lacks those permissions, you can assign a scope-down policy to the S3 connection. We recommend this approach to prevent failures, misconfigurations, and breaches.

Here's the code:

CREATE S3 CONNECTION my_sample_data
       AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*>'
       EXTERNAL_ID = '12345678'
       READ_ONLY = TRUE;

For future reference, your AWS_ROLE can be copied from your AWS IAM user page. After you complete this step, you should find the my_sample_data connection in your navigation tree.

Step 2

Creating a staging table

Stage the data in the staging zone of your data lake. To do this, create a job that reads from the source S3 bucket and writes to a staging table that stores your data. Upsolver automatically converts your files into the Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.

Here’s the code to create the table:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;

Step 3

Copying from S3

Create a job that loads your data into your staging table using the COPY FROM command:

CREATE JOB load_orders_raw_data_from_s3
    START_FROM = NOW
    CONTENT_TYPE = AUTO
AS COPY FROM S3 upsolver_s3_samples 
    LOCATION = 's3://upsolver-samples/orders/' 
INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Let’s understand what this job does:

Automatically infers the schema and populates the column names and types in the Glue Data Catalog.
Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads. You can learn more about the properties you can pass to the CREATE JOB operation here.

Upsolver uses the staging table to load your data and get it ready to transform in the next step.

You’re almost done: your data is now cataloged in the AWS Glue Data Catalog and the file format is optimized Parquet. You can start to query it in Amazon Athena. However, you may want to create a refined table that is partitioned and transformed to better meet a unique business need or use case. That's the next step.

Step 4

Creating an output table for refined data in Amazon Athena

To start transforming your data, choose from two types of transformation job:

INSERT
MERGE

To take full advantage of either of these jobs, we recommend you explicitly create your output table, which enables you to define partitions, primary keys, and more. Here’s an example:

To create an output table in AWS Glue Data Catalog:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_transformed_data
    (partition_date date)
PARTITION BY partition_date;

You don’t need to define all of your output columns because Upsolver automatically adds any missing columns. If you want to strictly control which columns are added, you can define them in the CREATE TABLE statement. To do this, when you CREATE JOB you can set the ADD_MISSING_COLUMNS job property to FALSE.

Step 5

Prepare your data for the refined zone

Now that you have defined the output table in the staging zone, create an INSERT job. This job reads data from the staging table, transforms it, and writes the results into the target table inside your refined zone.

To create an INSERT job to read from staging and write to the refined zone:

CREATE JOB transform_orders_and_insert_into_athena
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = true	
    RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_transformed_data    
  MAP_COLUMNS_BY_NAME
    -- Use the SELECT statement to choose columns from the source and 
    -- implement your business logic transformations.
    SELECT 
      orderid AS order_id, -- rename columns
      -- hash or mask columns using built-in functions
      MD5(customer.email) AS customer_id, 
      -- computed field defined later in the query
      customer_name,  
      nettotal AS total, 
      -- populate the partition column with the processing time of the event, 
      -- and automatically cast to DATE type
      $commit_time AS partition_date 
    FROM default_glue_catalog.upsolver_samples.orders_raw_data
    -- create a computed column
    LET customer_name = customer.firstname || ' ' || customer.lastname 
    WHERE ordertype = 'SHIPPING' 
    AND $commit_time BETWEEN run_start_time() AND run_end_time()    
    AND $commit_time BETWEEN execution_start_time() AND execution_end_time();

Read your data from Amazon Athena

Finally, to view your pipeline results, query your table using SELECT.

SELECT * 
FROM   default_glue_catalog.upsolver_samples.orders_transformed_data 
LIMIT  100;

Conclusion

In this guide, you learned how to take raw data from a source location (S3), stage the data, and then transform it to meet your business needs. The approach was modeled after a common data lake design pattern, in which data is transitioned through raw, staging, and refined zones. Our goal is to make it simple; by adopting and implementing familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.

Furthermore, as your business needs evolve, so can your data. In the future, you can create additional jobs that use the staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Try it yourself

To launch the template:

Launch Upsolver by navigating to https://db.upsolver.com
Sign up by entering your email address and setting a password.
After you click through the onboarding panel, a panel with a list of templates is displayed.
Select the Amazon S3 to Athena (sample data) template.

Last updated 1 year ago