Upsolver
Search…
Pre-aggregate data for efficiency and performance
Increase storage and processing efficiency by pre-aggregating data before loading to data warehouses or log processors.
Raw data can be very voluminous. Analysts might only need summarized data that's applicable to business. By pre-aggregating the data as it is ingested, you’re able to improve query performance and minimize storage and compute costs.
The objective of this module is to create a summary table in Athena that contains the totals for each item sold for a given day.
Desired Output would be an item_summary table with below schema:
Reduce data volume and increase processing efficiency by summarizing data before writing to destination

1. Create an Amazon Athena data output

Click on DATA OUTPUT > NEW and SELECT Amazon Athena output. Give the output a NAME > item_summary and choose DATA SOURCES > Orders Click on NEXT to continue.
Create a Amazon Athena output called item_summary

2. Add aggregation to the data stream

From the Upsolver Transformation Screen, add Itemid to the output by clicking the + sign next to the field. Then click Make Aggregated Click OK if the warning pops-up.
Upsolver Transformation Screen - Make Aggregated
Click on Add Calculated Field. Find and SELECT the TO_DATE function. Enter DATETIME > data.orderDate and NAME > root.date Check the data by clicking on PREVIEW. If everything looks ok, click on SAVE.
Transform date format to be more readable
Click on Add Aggregation Find and SELECT COUNT(*) and click on SAVE. This will count the number of items for a given period of time window in a stream. Click away the warning if they appear.
Count the number of items for a given period of time
Add another aggregation by clicking on Add Aggregation again. This time select SUM from the list and choose data.items[].price from the list. Click on SAVE. Dismiss any warnings that you may encounter.
Calculate the total price of an item for a given period of time
Add the last aggregation by clicking on Add Aggregation. This time select AVG from the list and choose data.items[].price from the list. Click on SAVE. Dismiss any warnings that you may encounter.
Calculate the average price of an item for a given period of time

3. Rename the fields to match the output schema

Change the output column names to data_items_itemid > items_id sum_data_items_price > sum_items_price, and avg_data_items_price > avg_items_price Be very careful renaming the fields, we will use them later.
Rename fields to item_id, sum_items_price and avg_items_price

4. Define the partition strategy for writing data to Amazon Athena

Proper partitioning strategy can significant improve query performance with Amazon Athena. Upsolver manages partitions and compactions automatically.
Click on Manage Partitions and choose Time Partition Field > date and Time Partition Size > Daily Click on CLOSE. You will see Partitioned by event time (Daily partitions) shown on the screen instead of processing time.
Partition data by day based on event time

5. Verify your SQL and data and define run parameter

Switch to the SQL view. Your SQL should look similar to the following. If your SQL looks different, copy and paste it to your environment.
1
SET "date" = TO_DATE(data.orderDate);
2
// GENERATED @ 2020-10-12T04:13:03.089790Z
3
SELECT PARTITION_TIME("date") AS _date_1:TIMESTAMP,
4
AGGREGATION_TIME() AS processing_time:TIMESTAMP,
5
data.items[].itemId AS items_id:STRING,
6
"date" AS _date:TIMESTAMP,
7
COUNT(*) AS "count":DOUBLE,
8
SUM(data.items[].price) AS sum_items_price:DOUBLE,
9
AVG(data.items[].price) AS avg_items_price:DOUBLE
10
FROM "Orders"
11
GROUP BY data.items[].itemId,
12
"date",
13
PARTITION_TIME("date")
14
APPEND ON DUPLICATE
15
Copied!
Click on PREVIEW to verify your data. Click on RUN to define the Amazon Athena. Choose the DATABASE NAME > upsolver and TABLE NAME > item_summary Click on NEXT.

6. Start writing data to Amazon Athena and verify data

Leave everything on the Run page as default and click on DEPLOY.
The output is continuous stream by leaving ENDING AT as Never,
Upsolver starts writing data to Amazon Athena. You can click on PROGRESS to monitor output progress. This may take a minute.

7. Query item_summary table for aggregated data

When Upsolver creates an upserting output to Athena, it creates two objects: a table with the suffice _underlying_data and a view with the output name. When querying, you should query the VIEW, and not the underlying data.
Using Athena you can now query the data set created by Upsolver (it may take a few minutes before the data is available). Some sample queries that can be run are below: Click here to login to Athena in order to run below queries
1
// The number of each item sold
2
3
SELECT items_id,
4
SUM("count")
5
FROM upsolver.item_summary
6
GROUP BY items_id LIMIT 10
Copied!
How many of each item were sold
1
// The average price of the items
2
3
SELECT "_date",
4
sum("sum_items_price")/sum("count") AS avg_price
5
FROM upsolver.item_summary
6
GROUP BY "_date" LIMIT 10
Copied!
Average price of the items