AWS S3 to Athena use case

This article maps out a simple Upsolver use case where an S3 data source is turned into an Athena output.

A very typical simple use case is when you have a stream of data you want to query using Amazon Athena, with the events sent to the output as they arrive from the data source.

It is very common to partition the data by the date/time of the event (and not the date/time that the event arrived). This ensures that events that might arrive late are stored in the correct partitions, and it also means that when you query the data, only the relevant folders are queried, improving performance.

For this example, we use a demo Amazon S3 data source called Resource Utilization, which is a simple stream of data that is not nested.

Required Level of Expertise:

  • Fundamental

This use case will demonstrate how to:

Note: This use case only includes fields that do not require transformation and does not cover calculated fields and lookups.

Create an output

1. Go to the Data Sources page and click on the desired S3 data source.

2. Click New Output and select Amazon Athena.

3. Name this output and click Next. Note: The time field is automatically added to the schema.

Add fields

Add the simple fields that either require no data transformation or only data type transformation.

1. In the Schema tab, select the desired fields; for example:

  • In data > tags, select:

    • AWS REGION

    • BUILD_BRANCH

    • BUILD_NUMBER

    • CLOUD

    • CLUSTER_ID

    • CLUSTER_ORGANIZATION_ID.

  • In data > fields, select:

    • n_cpus

    • available_usage

Number fields are automatically mapped to type DOUBLE to preserve accuracy, and some fields with numbers may appear as type STRING.

If a field is comprised of integers, you can change the column type to BIGINT.

2. Update the output column names as required.

If you wish, toggle the view from UI to SQL to edit the output directly in SQL. See: Transform with SQL

Partition the data

The data is stored in folders according to the time field and by default, this is set to Daily. Through the Properties tab, this can be set to Yearly, Monthly, Daily, or Hourly.

2. Select Date and then select the desired function. In this example, we will be usingPARSE_DATE(string date, string format). See: Functions

3. Specify the date format in Java DateTimeFormatter format (e.g. yyyy-MM-dd HH:mm:ss).

4. Select the date field.

5. Name this field.

6. Click Preview to see the date conversion.

If N/A appears it means that it was unable to convert the date and that you need to update the format.

7. Check Add to Output Schema to add it to your output.

8. Click Save.

9. Select this field as the Time Partition field and close the Manage Partitions window.

If you experience an error updating the output, make sure your output is not currently empty.

Your parsed date field should now be added to your output. You can view its code by toggling from UI to SQL or going to the Calculated Fields tab.

Run the output

Click Run to run the output to view the results. See: Run an output

Last updated