Aggregate and Output Data

This quickstart article shows you how to aggregate data when transforming data in the staging zone.

Using familiar SQL syntax in your transformation job, you can aggregate the data in your staging table and load it into your target table.

Here's the code to create a job that loads aggregated data into your target table:

CREATE JOB group_testing
    START_FROM = BEGINNING
    RUN_INTERVAL = 1 MINUTE
    ADD_MISSING_COLUMNS = TRUE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_aggregated_data 
    MAP_COLUMNS_BY_NAME
       SELECT customer.address.city AS customer_address_city, 
          SUM(nettotal) AS total_revenue,
          COUNT(DISTINCT orderid) AS num_orders
       FROM default_glue_catalog.upsolver_samples.orders_raw_data
       WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
       GROUP BY customer.address.city;

Let's understand what this code does.

In this example, you create a job named group_testing, and set the target table for your data as orders_aggregated_data. The job option START_FROM instructs Upsolver to include all historical data, and the RUN_INTERVAL option specifies that the job should execute every 1 minute. The option ADD_MISSING_COLUMNS ensures that all columns that are added to the source table in the future, are also added to the target table.

In the SELECT statement, we are aggregating our sample orders by customer.address.city, calculating the total revenue per city. Let's break it down:

SUM(nettotal) AS total_revenue: in this line, we sum the nettotal column and rename it to total_revenue. This provides us with the total revenue by city.
COUNT(DISTINCT orderid) AS num_orders: we then count each the number of orders by city into a new column named num_orders.
GROUP BY customer.address.city: this line groups our previous columns by customer.address.city, enabling us to view total revenue and the number of orders by city.

When the job is running and we have data in our target table, the final step is to query the data.

Here's the code for querying your data:

SELECT * 
FROM default_glue_catalog.upsolver_samples.orders_aggregated_data 
LIMIT 10;

Last updated 1 year ago