Join multiple data streams for real-time analytics
This article provides a walkthrough on how to join multiple data streams for real-time analytics.
Performing a join on multiple data streams is easy with Upsolver.
This guide provides the instructions on joining:
impressions: primary data source with ad campaign information
clicks: secondary data stream tracking number of clicks on an ad
1. Click on Outputs on the left and then New on the right upper corner.
2. Select Amazon Athena as the data output.
3. Click Add to add as many data sources as you need. Click Next to continue.
1. Select the SQL window from the upper right hand corner.
2. The sample SQL below performs a
LEFT OUTER JOINbetween
Behind the scenes, the
LEFT OUTER JOINis creating a lookup table, enabling users to index data by a set of keys and then retrieve the results in milliseconds.
SELECT data.id AS impression_id,
data.win_timestamp AS impression_time,
data.campaign_id AS campaign_id,
data.exch_user AS user_id,
IF_ELSE(click_data.click_time IS NULL, 0, 1) AS is_click
-- if click_time exists returns 1, else returns 0
FROM Impressions LEFT OUTER JOIN
(SELECT data.id AS imp_id,
LAST(data.click_timestamp) AS click_time
-- last click_timestamp is the click_time
FROM Clicks GROUP BY data.id)
AS click_data WAIT 10 MINUTES ON click_data.imp_id = data.id
-- wait for 10 minutes before performing the join since clicks usually arrive after impressions.
1. Define storage, database, and table information for your Athena environment and click Next.
2. Define the compute cluster that you would like to use and the time range of the data you would like to output.
3. Click Deploy.
1. Check to make sure the output data is up to date by clicking on the Progress tab.
2. Run a query in Athena to make sure you get the correct results.
SUM(is_click)/COUNT(*) AS CTR
-- to get the click through rate:
-- divide the sum of clicks for a campaign with the total number of impressions
GROUP BY campaign_id