Pre-aggregate data for efficiency and performance
Increase storage and processing efficiency by pre-aggregating data before loading to data warehouses or log processors.
Raw data can be very voluminous. Analysts might only need summarized data that's applicable to business. By pre-aggregating the data as it is ingested, you’re able to improve query performance and minimize storage and compute costs.
The objective of this module is to create a summary table in Athena that contains the totals for each item sold for a given day.
Desired Output would be an item_summary table with below schema:
Reduce data volume and increase processing efficiency by summarizing data before writing to destination
Click on DATA OUTPUT > NEW and SELECT Amazon Athena output. Give the output a NAME > item_summary and choose DATA SOURCES > Orders Click on NEXT to continue.
Create a Amazon Athena output called item_summary
From the Upsolver Transformation Screen, add Itemid to the output by clicking the + sign next to the field. Then click Make Aggregated Click OK if the warning pops-up.
Upsolver Transformation Screen - Make Aggregated
Click on Add Calculated Field. Find and SELECT the TO_DATE function. Enter DATETIME > data.orderDate and NAME > root.date Check the data by clicking on PREVIEW. If everything looks ok, click on SAVE.
Transform date format to be more readable
Click on Add Aggregation Find and SELECT COUNT(*) and click on SAVE. This will count the number of items for a given period of time window in a stream. Click away the warning if they appear.
Count the number of items for a given period of time
Add another aggregation by clicking on Add Aggregation again. This time select SUM from the list and choose data.items.price from the list. Click on SAVE. Dismiss any warnings that you may encounter.
Calculate the total price of an item for a given period of time
Add the last aggregation by clicking on Add Aggregation. This time select AVG from the list and choose data.items.price from the list. Click on SAVE. Dismiss any warnings that you may encounter.
Calculate the average price of an item for a given period of time
Change the output column names to data_items_itemid > items_id sum_data_items_price > sum_items_price, and avg_data_items_price > avg_items_price Be very careful renaming the fields, we will use them later.
Rename fields to item_id, sum_items_price and avg_items_price
Proper partitioning strategy can significant improve query performance with Amazon Athena. Upsolver manages partitions and compactions automatically.
Click on Manage Partitions and choose Time Partition Field > date and Time Partition Size > Daily Click on CLOSE. You will see Partitioned by event time (Daily partitions) shown on the screen instead of processing time.
Partition data by day based on event time
Switch to the SQL view. Your SQL should look similar to the following. If your SQL looks different, copy and paste it to your environment.
SET "date" = TO_DATE(data.orderDate);
// GENERATED @ 2020-10-12T04:13:03.089790Z
SELECT PARTITION_TIME("date") AS _date_1:TIMESTAMP,
AGGREGATION_TIME() AS processing_time:TIMESTAMP,
data.items.itemId AS items_id:STRING,
"date" AS _date:TIMESTAMP,
COUNT(*) AS "count":DOUBLE,
SUM(data.items.price) AS sum_items_price:DOUBLE,
AVG(data.items.price) AS avg_items_price:DOUBLE
GROUP BY data.items.itemId,
APPEND ON DUPLICATE
Click on PREVIEW to verify your data. Click on RUN to define the Amazon Athena. Choose the DATABASE NAME > upsolver and TABLE NAME > item_summary Click on NEXT.
Leave everything on the Run page as default and click on DEPLOY.
The output is continuous stream by leaving ENDING AT as Never,
Upsolver starts writing data to Amazon Athena. You can click on PROGRESS to monitor output progress. This may take a minute.
When Upsolver creates an upserting output to Athena, it creates two objects: a table with the suffice
_underlying_dataand a view with the output name. When querying, you should query the VIEW, and not the underlying data.
Using Athena you can now query the data set created by Upsolver (it may take a few minutes before the data is available). Some sample queries that can be run are below: Click here to login to Athena in order to run below queries
// The number of each item sold
GROUP BY items_id LIMIT 10
How many of each item were sold
// The average price of the items
sum("sum_items_price")/sum("count") AS avg_price
GROUP BY "_date" LIMIT 10
Average price of the items