Pre-aggregate data for efficiency and performance

Increase storage and processing efficiency by pre-aggregating data before loading to data warehouses or log processors.

Raw data can be very voluminous. Analysts might only need summarized data that's applicable to business. By pre-aggregating the data as it is ingested, you’re able to improve query performance and minimize storage and compute costs.

The objective of this module is to create a summary table in Athena that contains the totals for each item sold for a given day.

Desired Output would be an item_summary table with below schema:

1. Create an Amazon Athena data output

Click on DATA OUTPUT > NEW and SELECT Amazon Athena output. Give the output a NAME > item_summary and choose DATA SOURCES > Orders Click on NEXT to continue.

2. Add aggregation to the data stream

From the Upsolver Transformation Screen, add Itemid to the output by clicking the + sign next to the field. Then click Make Aggregated Click OK if the warning pops-up.

Click on Add Calculated Field. Find and SELECT the TO_DATE function. Enter DATETIME > data.orderDate and NAME > Check the data by clicking on PREVIEW. If everything looks ok, click on SAVE.

Click on Add Aggregation Find and SELECT COUNT(*) and click on SAVE. This will count the number of items for a given period of time window in a stream. Click away the warning if they appear.

Add another aggregation by clicking on Add Aggregation again. This time select SUM from the list and choose data.items[].price from the list. Click on SAVE. Dismiss any warnings that you may encounter.

Add the last aggregation by clicking on Add Aggregation. This time select AVG from the list and choose data.items[].price from the list. Click on SAVE. Dismiss any warnings that you may encounter.

3. Rename the fields to match the output schema

Change the output column names to data_items_itemid > items_id sum_data_items_price > sum_items_price, and avg_data_items_price > avg_items_price Be very careful renaming the fields, we will use them later.

4. Define the partition strategy for writing data to Amazon Athena

Proper partitioning strategy can significant improve query performance with Amazon Athena. Upsolver manages partitions and compactions automatically.

Click on Manage Partitions and choose Time Partition Field > date and Time Partition Size > Daily Click on CLOSE. You will see Partitioned by event time (Daily partitions) shown on the screen instead of processing time.

5. Verify your SQL and data and define run parameter

Switch to the SQL view. Your SQL should look similar to the following. If your SQL looks different, copy and paste it to your environment.

SET "date" = TO_DATE(data.orderDate);
// GENERATED @ 2020-10-12T04:13:03.089790Z
       AGGREGATION_TIME() AS processing_time:TIMESTAMP,
       data.items[].itemId AS items_id:STRING,
       "date" AS _date:TIMESTAMP,
       COUNT(*) AS "count":DOUBLE,
       SUM(data.items[].price) AS sum_items_price:DOUBLE,
       AVG(data.items[].price) AS avg_items_price:DOUBLE
  FROM "Orders"  
    GROUP BY data.items[].itemId,

Click on PREVIEW to verify your data. Click on RUN to define the Amazon Athena. Choose the DATABASE NAME > upsolver and TABLE NAME > item_summary Click on NEXT.

6. Start writing data to Amazon Athena and verify data

Leave everything on the Run page as default and click on DEPLOY.

The output is continuous stream by leaving ENDING AT as Never,

Upsolver starts writing data to Amazon Athena. You can click on PROGRESS to monitor output progress. This may take a minute.

7. Query item_summary table for aggregated data

When Upsolver creates an upserting output to Athena, it creates two objects: a table with the suffice _underlying_data and a view with the output name. When querying, you should query the VIEW, and not the underlying data.

Using Athena you can now query the data set created by Upsolver (it may take a few minutes before the data is available). Some sample queries that can be run are below: Click here to login to Athena in order to run below queries

// The number of each item sold

SELECT items_id,
FROM upsolver.item_summary
GROUP BY  items_id LIMIT 10
// The average price of the items

SELECT "_date",
         sum("sum_items_price")/sum("count") AS avg_price
FROM upsolver.item_summary
GROUP BY  "_date" LIMIT 10

Last updated