Full vs. Partial Inbound Data Records
Q: How should I configure my pipelines if, sometimes, only some of the record elements (columns) are being streamed in the update event, instead of all of the fields.
A: To better understand the answer, below is an example of a simple employee record.
{
"employee": {
"name": "sonoo",
"salary": 56000,
"married": true
}
}
If every new inbound update record in the data source contains all of the data elements, all that's needed is to create a simple data pipeline that moves all the data elements over to the output and replaces the whole record upon arrival:
SELECT
employee.name as employee_name,
employee.salary as salary,
employee.married as married
FROM "employee_source"
REPLACE ON DUPLICATE employee.name
NOTE: If you have any NULLs (or other special cases) in the input, they may require special handling in the output .
However, if the update record is partial, as in the example below, a different approach should be taken.
{
"event":{
"type": "Update",
"time": 16598776000
}
"employee": {
"name": "sonoo",
"married": false
}
}
In this case the the query handling the source stream events should be an aggregation in order to always have the last valid data in the stream. Note that in this scenario the window duration should be closely monitored because the LAST value will be taken from the entire time frame.
SELECT
employee.name as employee_name,
last(employee.salary) as salary,
last(employee.married) as married
FROM "employee_source"
WHERE event.type = 'Update'
GROUP BY employee.name
REPLACE ON DUPLICATE employee.name
Last updated