Working with Arrays
Upsolver supports arrays natively and the data transformations are defined on fields directly.
Understanding how arrays are handled for data transformations will give you better control of your output and assist you in achieving your intended data structure.
Transformations on array fields
In Upsolver, when performing a calculation on inputs that are arrays, the transformation is defined on the field names and the values of those fields are then passed into the calculation at runtime.
Example 1: Calculation without an array
To start with the simplest calculation possible, let's first look at calculations that do not involve arrays.
Suppose you have the following data:
Given the calculation:
Upsolver calculates:
Example 2: Calculation with an array
To see what happens when one input is an array, suppose you have the following data:
Given the calculation:
Upsolver performs the addition operation for each value of the first input with the value of the second input.
As a result, the calculation of is evaluated as:
Example 3: Calculation with two arrays
Now suppose both of your inputs are arrays:
Given the calculation:
In this case, the following calculation is performed:
Since two arrays are being added together, Upsolver will perform a Cartesian product on the arrays before performing the addition operation on the resulting pairs.
By taking the Cartesian product, we combine the two arrays into an array of ordered pairs:
Here we ended up with an array of pairs. In general, Cartesian product calculations result in elements.
Finally, we add the values within each pair together to get our end result:
As Cartesian product calculations are typically undesirable, the next sections describe how to avoid them.
Array context: Define transformations
To demonstrate how Upsolver handles transformations and understand what happens internally, the examples below use hierarchical data and discuss how to flatten this data to a tabular format later on.
Example 1
Assume the following input event:
This sample data includes an array of salaries that include the base rate and a bonus value, and we are interested in calculating the final amount to be paid for every employee.
In other words, we want to get an output that looks like this:
We use SQL syntax to quickly demonstrate the transformation; the same applies for transformations defined using the UI.
We define the following SELECT
statement:
If you preview the result of the above query, you will find that it is not the desired result:
As you can see, the results is two independent arrays: one for employees and one for salaries. Moreover, the array of salaries has 4 items instead of the 2 items we wanted.
Let's examine the query more carefully to pinpoint the source of the issue.
First of all, since the calculated salary
field had 2 fields from within an array as inputs, we selected it as an array salary[]
. However, we chose to write it outside of the context of the input array data.salaries[]
.
In layman's terms, the context of a field is the field's location within the nested structure.
As a result, we actually wrote the array to a field named salary
in the root of our object:
With the calculated field placed outside the array, it has no local context when performing the calculation. Therefore, Upsolver takes all the values available and passes them into the SUM
function.
This resulted in the following calculation:
As mentioned previously, this leads to a Cartesian product of the two arrays before performing the calculation on the resulting pair values:
Thus, our calculated field becomes:
We can see this is why salary
is an array field; and because we lost the context of our calculation, we also don't know which salary is associated with which employee.
Solution
To fix this issue, all we need to do is change the context of the calculation.
Step 1: Change where we save the calculated field
By changing the target field name from salary
to data.salaries[].salary
, we are able to avoid the Cartesian product and compute an array with the 2 salaries we are interested in:
However, we still need to match the salary values to the correct employee.
Step 2: Add context to calculated values
As our query sent the results to the fields toPay.employee
and toPay.salary
, these two fields, while both nested under toPay
, are not within the same array record.
To associate an employee with their corresponding salary, we place the fields within a shared array:
Now that the result fields are within a shared array, the result is as expected:
Example 2: Using the ZIP
function
ZIP
functionIn some cases, the data may arrive in a slightly different format:
In this case, we want the first element of each array to be related to the first element of every other array, and likewise for each following element.
Unlike the first example, we can't achieve the desired calculated salary by simply placing the calculation in the correct context as there is no shared context array.
We can, however, create the required context by zipping together the arrays using the ZIP
function:
The ZIP
function takes in an optional comma-separated list of field names. If the field names aren't provided, it defaults to a list of the array inputs (field1, field2,...
).
The ZIP
function stitches the arrays together on an element-by-element basis: the first element with the first element, the second element with the second element, etc.
Using the ZIP
function here allows access to a calculated field that is in the exact same structure as the first example:
Then, to define the output, you can use the output query that was used initially:
Flattening data
Now that we understand how transformations on arrays work and the role of context and position, we will discuss flattening arrays. This is useful when we want to convert nested objects with arrays into flat tables.
For example, if we have the following event:
The resulting table looks as follows:
1
Oleg
123
2
Oleg
123
3
Oleg
123
As the data was flattened based on the values[]
array that contains three values, the table shows three rows from our one source event.
Flatten multiple arrays
Suppose we have the following data:
If we were to simply SELECT
the arrays as is:
With two independent arrays selected without a shared context, this query results in a Cartesian product:
apple
fruit
NY
fruit
apple
city
NY
city
To avoid this, you should use the ZIP
function to combine the arrays into a single context:
Then the result is as expected:
apple
fruit
NY
city
Last updated