# Working with arrays

Upsolver supports arrays natively and the data transformations are defined on fields directly.

Understanding how arrays are handled for data transformations will give you better control of your output and assist you in achieving the desired data structure.

In Upsolver, when performing a calculation on inputs that are arrays, the transformation is defined on the field names, and the values of those fields are then passed into the calculation at runtime.

To start with the simplest calculation possible, let's first look at calculations that do not involve arrays.

Suppose you have the following data:

{

"data": {

"value1": 1,

"value2": 2

}

}

Given the calculation:

$\text{data.value1} + \text{data.value2}$

Upsolver calculates:

$1 + 2 = 3$

To see what happens when one input is an array, suppose you have the following data:

{

"data": {

"value1": [1,2],

"value2": 2

}

}

Given the calculation:

$\text{data.value1[]} + \text{data.value2}$

Upsolver performs the addition operation for each value of the first input with the value of the second input.

As a result, the calculation of

$[1,2] + 2$

is evaluated as:$[1+2, 2+2] = [3,4]$

Now suppose both of your inputs are arrays:

{

"data": {

"value1": [1,2],

"value2": [20,30]

}

}

Given the calculation:

$\text{data.value1[]} + \text{data.value2[]}$

In this case, the following calculation is performed:

$[1,2] + [20,30]$

Since two arrays are being added together, Upsolver will perform a Cartesian product on the arrays before performing the addition operation on the resulting pairs.

By taking the Cartesian product, we combine the two arrays into an array of ordered pairs:

$[1,2] \times [20,30] = [(1,20), (1,30), (2,20), (2,30)]$

Here we ended up with an array of

$2 \times 2 = 4$

pairs. In general, Cartesian product calculations result in $\text{length(value1[])} \cdot \text{length(value2[])}$

elements. Finally, we add the values within each pair together to get our end result:

$[1+20, 1+30, 2+20, 2+30] = [21,31,22,32]$

As Cartesian product calculations are typically undesirable, the next sections describe how to avoid them.

To demonstrate how Upsolver handles transformations and understand what happens internally, the examples below use hierarchical data and discuss how to flatten this data to a tabular format later on.

Assume the following input event:

{

"data": {

"salaries": [{

"employee": "Jhon",

"baseRate": 100000,

"bonus": 23000

},

{

"employee": "Jacob",

"baseRate": 78000,

"bonus": 12000

}]

}

}

This sample data includes an array of salaries that include the base rate and a bonus value, and we are interested in calculating the final amount to be paid for every employee.

In other words, we want to get an output that looks like this:

{

"toPay": [{

"employee": "Jhon",

"salary": 123000

},

{

"employee": "Jacob",

"salary": 100000

}]

}

We use SQL syntax to quickly demonstrate the transformation; the same applies for transformations defined using the UI.

We define the following

`SELECT`

statement:SELECT data.salaries[].employee as toPay.employee,

salary[] as toPay.salary

FROM my_data_source

LET salary = data.salaries[].baseRate + data.salaries[].bonus

If you preview the result of the above query, you will find that it is not the desired result:

{

"toPay": {

"employee": [

"Jhon",

"Jacob"

],

"salary": [

123000,

112000,

101000,

90000

]

}

}

As you can see, the results is two independent arrays: one for employees and one for salaries. Moreover, the array of salaries has 4 items instead of the 2 items we wanted.

Let's examine the query more carefully to pinpoint the source of the issue.

First of all, since the calculated

`salary`

field had 2 fields from within an array as inputs, we selected it as an array `salary[]`

. However, we chose to **write it outside of the context of the input array****data.salaries[]**

**.**In layman's terms, the context of a field is the field's location within the nested structure.

As a result, we actually

**wrote the array to a field named**`salary`

in the root of our object:{

"data": {

"salaries": [{

"employee": "Jhon",

"baseRate": 100000,

"bonus": 23000

},

{

"employee": "Jacob",

"baseRate": 78000,

"bonus": 12000

}

],

"salary": [

123000,

112000,

101000,

90000

]

}

}

With the calculated field placed outside the array, it has no local context when performing the calculation. Therefore, Upsolver takes all the values available and passes them into the

`SUM`

function. This resulted in the following calculation:

$[10000, 78000] + [23000, 12000]$

As mentioned previously, this leads to a Cartesian product of the two arrays before performing the calculation on the resulting pair values:

$[100000, 78000] \times [23000, 12000]$

Thus, our calculated field becomes:

$[100000 + 23000, 100000 + 12000, 78000 + 23000, 78000 +12000]$

$= [123000, 112000, 101000, 90000]$

We can see this is why

`salary`

is an array field; and because we lost the context of our calculation, we also don't know which salary is associated with which employee.To fix this issue, all we need to do is change the context of the calculation.

SELECT data.salaries[].employee as toPay.employee,

data.salaries[].salary as toPay.salary

FROM salaries

LET data.salaries[].salary = data.salaries[].baseRate + data.salaries[].bonus

By changing the target field name from

`salary`

to `data.salaries[].salary`

, we are able to avoid the Cartesian product and compute an array with the 2 salaries we are interested in:{

"data": {

"toPay": {

"employee": [

"Jhon",

"Jacob"

],

"salary": [

123000,

90000

]

}

}

}

However, we still need to match the salary values to the correct employee.

As our query sent the results to the fields

`toPay.employee`

and `toPay.salary`

, these two fields, while both nested under `toPay`

, are not within the same array record.To associate an employee with their corresponding salary, we place the fields within a shared array:

SELECT data.salaries[].employee as toPay[].employee,

data.salaries[].salary as toPay[].salary

FROM salaries

LET data.salaries[].salary = data.salaries[].baseRate + data.salaries[].bonus;

Now that the result fields are within a shared array, the result is as expected:

{

"toPay": [

{

"employee": "Jhon",

"salary": 123000

},

{

"employee": "Jacob",

"salary": 90000

}

]

}

In some cases, the data may arrive in a slightly different format:

{

"data": {

"employee": ["Jhon", "Jacob"],

"baseRate": [100000, 78000],

"bonus": [23000, 12000]

}

}

In this case, we want the first element of each array to be related to the first element of every other array, and likewise for each following element.

Unlike the first example, we can't achieve the desired calculated salary by simply placing the calculation in the correct context as there is no shared context array.

We can, however, create the required context by zipping together the arrays using the

`ZIP`

function:LET salaries = ZIP('employee,baseRate,bonus',

data.employee[],

data.baseRate[],

data.bonus[]);

The

`ZIP`

function takes in an optional comma-separated list of field names. If the field names aren't provided, it defaults to a list of the array inputs (`field1, field2,...`

). The

`ZIP`

function stitches the arrays together on an element-by-element basis: the first element with the first element, the second element with the second element, etc.Using the

`ZIP`

function here allows access to a calculated field that is in the exact same structure as the first example: {

"data": {

"salaries": [

{

"employee": "Jhon",

"baseRate": 100000,

"bonus": 23000

},

{

"employee": "Jacob",

"baseRate": 78000,

"bonus": 12000

}

]

}

}

Then, to define the output, you can use the output query that was used initially:

SELECT

salaries[].employee AS toPay[].employee,

salaries[].salary AS toPay[].salary

FROM "salaries"

LET salaries = ZIP('employee,baseRate,bonus',

data.employee[],

data.baseRate[],

data.bonus[]),

salaries.salary = salaries[].baseRate + salaries[].bonus;

Now that we understand how transformations on arrays work and the role of context and position, we will discuss flattening arrays. This is useful when we want to convert nested objects with arrays into flat tables.

For example, if we have the following event:

{

"values": [1,2,3],

"name": "Oleg",

"id": 123

}

UNNEST(SELECT data.values[] as value,

data.name as name,

data.id as id

FROM my_data_source)

The resulting table looks as follows:

value | name | id |
---|---|---|

1 | Oleg | 123 |

2 | Oleg | 123 |

3 | Oleg | 123 |

As the data was flattened based on the

`values[]`

array that contains three values, the table shows three rows from our one source event.Suppose we have the following data:

{

"data": {

"values": ["apple", "NY"],

"type": ["fruit", "city"]

}

}

If we were to simply

`SELECT`

the arrays as is:UNNEST(SELECT data.values[] as value,

data.type[] as type

FROM my_data_source)

With two independent arrays selected without a shared context, this query results in a Cartesian product:

value | type |
---|---|

apple | fruit |

NY | fruit |

apple | city |

NY | city |

To avoid this, you should use the

`ZIP`

function to combine the arrays into a single context:UNNEST(SELECT zipped[].value as value

zipped[].type as type

FROM my_data_source

LET zipped = ZIP('type,value', data.type[], data.values[]))

Then the result is as expected:

value | type |
---|---|

apple | fruit |

NY | city |

Last modified 7mo ago