Working with arrays

Upsolver supports arrays natively and the data transformations are defined on fields directly.

Understanding how arrays are handled for data transformations will give you better control of your output and assist you in achieving the desired data structure.

Transformations on array fields

In Upsolver, when performing a calculation on inputs that are arrays, the transformation is defined on the field names, and the values of those fields are then passed into the calculation at runtime.

Example 1: Calculation without an array

To start with the simplest calculation possible, let's first look at calculations that do not involve arrays.

Suppose you have the following data:

{
    "data": {
        "value1": 1, 
        "value2": 2
    }
} 

Given the calculation:

data.value1+data.value2\text{data.value1} + \text{data.value2}

Upsolver calculates:

1+2=31 + 2 = 3

Example 2: Calculation with an array

To see what happens when one input is an array, suppose you have the following data:

{
    "data": {
        "value1": [1,2], 
        "value2": 2
    }
}

Given the calculation:

data.value1[]+data.value2\text{data.value1[]} + \text{data.value2}

Upsolver performs the addition operation for each value of the first input with the value of the second input.

As a result, the calculation of [1,2]+2[1,2] + 2 is evaluated as:

[1+2,2+2]=[3,4][1+2, 2+2] = [3,4]

Example 3: Calculation with two arrays

Now suppose both of your inputs are arrays:

{
    "data": {
        "value1": [1,2], 
        "value2": [20,30]
    }
}

Given the calculation:

data.value1[]+data.value2[]\text{data.value1[]} + \text{data.value2[]}

In this case, the following calculation is performed:

[1,2]+[20,30][1,2] + [20,30]

Since two arrays are being added together, Upsolver will perform a Cartesian product on the arrays before performing the addition operation on the resulting pairs.

By taking the Cartesian product, we combine the two arrays into an array of ordered pairs:

[1,2]×[20,30]=[(1,20),(1,30),(2,20),(2,30)][1,2] \times [20,30] = [(1,20), (1,30), (2,20), (2,30)]

Here we ended up with an array of 2×2=42 \times 2 = 4 pairs. In general, Cartesian product calculations result in length(value1[])length(value2[])\text{length(value1[])} \cdot \text{length(value2[])} elements.

Finally, we add the values within each pair together to get our end result:

[1+20,1+30,2+20,2+30]=[21,31,22,32][1+20, 1+30, 2+20, 2+30] = [21,31,22,32]

As Cartesian product calculations are typically undesirable, the next sections describe how to avoid them.

Array context: Define transformations

To demonstrate how Upsolver handles transformations and understand what happens internally, the examples below use hierarchical data and discuss how to flatten this data to a tabular format later on.

Example 1

Assume the following input event:

{    
    "data": {
        "salaries": [{
            "employee": "Jhon",
            "baseRate": 100000,
            "bonus": 23000
        },
        {
            "employee": "Jacob",
            "baseRate": 78000,
            "bonus": 12000
        }]
    }
}

This sample data includes an array of salaries that include the base rate and a bonus value, and we are interested in calculating the final amount to be paid for every employee.

In other words, we want to get an output that looks like this:

{
    "toPay": [{
        "employee": "Jhon",
        "salary": 123000
    },
    {
        "employee": "Jacob",
        "salary": 100000
    }]
}

We use SQL syntax to quickly demonstrate the transformation; the same applies for transformations defined using the UI.

We define the following SELECT statement:

SELECT data.salaries[].employee as toPay.employee, 
       salary[] as toPay.salary 
    FROM my_data_source
    LET salary = data.salaries[].baseRate + data.salaries[].bonus

If you preview the result of the above query, you will find that it is not the desired result:

{
    "toPay": {
        "employee": [
            "Jhon",
            "Jacob"
        ],
        "salary": [
            123000,
            112000,
            101000,
            90000
        ]
    }
}

As you can see, the results is two independent arrays: one for employees and one for salaries. Moreover, the array of salaries has 4 items instead of the 2 items we wanted.

Let's examine the query more carefully to pinpoint the source of the issue.

First of all, since the calculated salary field had 2 fields from within an array as inputs, we selected it as an array salary[]. However, we chose to write it outside of the context of the input array data.salaries[].

In layman's terms, the context of a field is the field's location within the nested structure.

As a result, we actually wrote the array to a field named salary in the root of our object:

{
    "data": {
        "salaries": [{
            "employee": "Jhon",
            "baseRate": 100000,
            "bonus": 23000
        },
        {
            "employee": "Jacob",
            "baseRate": 78000,
            "bonus": 12000
        }
        ],
        "salary": [
            123000,
            112000,
            101000,
            90000
        ]
    }
}

With the calculated field placed outside the array, it has no local context when performing the calculation. Therefore, Upsolver takes all the values available and passes them into the SUM function.

This resulted in the following calculation:

[10000,78000]+[23000,12000][10000, 78000] + [23000, 12000]

As mentioned previously, this leads to a Cartesian product of the two arrays before performing the calculation on the resulting pair values:

[100000,78000]×[23000,12000][100000, 78000] \times [23000, 12000]

Thus, our calculated field becomes:

[100000+23000,100000+12000,78000+23000,78000+12000][100000 + 23000, 100000 + 12000, 78000 + 23000, 78000 +12000]
=[123000,112000,101000,90000]= [123000, 112000, 101000, 90000]

We can see this is why salary is an array field; and because we lost the context of our calculation, we also don't know which salary is associated with which employee.

Solution

To fix this issue, all we need to do is change the context of the calculation.

Step 1: Change where we save the calculated field

SELECT data.salaries[].employee as toPay.employee, 
       data.salaries[].salary as toPay.salary 
    FROM salaries
    LET data.salaries[].salary = data.salaries[].baseRate + data.salaries[].bonus

By changing the target field name from salary to data.salaries[].salary, we are able to avoid the Cartesian product and compute an array with the 2 salaries we are interested in:

{
    "data": {
        "toPay": {
            "employee": [
                "Jhon",
                "Jacob"
            ],
            "salary": [
                123000,
                90000
            ]
        }
    }
}

However, we still need to match the salary values to the correct employee.

Step 2: Add context to calculated values

As our query sent the results to the fields toPay.employee and toPay.salary, these two fields, while both nested under toPay, are not within the same array record.

To associate an employee with their corresponding salary, we place the fields within a shared array:

SELECT data.salaries[].employee as toPay[].employee,
       data.salaries[].salary as toPay[].salary 
    FROM salaries
    LET data.salaries[].salary = data.salaries[].baseRate + data.salaries[].bonus;

Now that the result fields are within a shared array, the result is as expected:

{
    "toPay": [
        {
            "employee": "Jhon",
            "salary": 123000
        },
        {
            "employee": "Jacob",
            "salary": 90000
        }
    ]
}

Example 2: Using the ZIP function

In some cases, the data may arrive in a slightly different format:

{
    "data": {
        "employee": ["Jhon", "Jacob"],
        "baseRate": [100000, 78000],
        "bonus": [23000, 12000]
    }
}

In this case, we want the first element of each array to be related to the first element of every other array, and likewise for each following element.

Unlike the first example, we can't achieve the desired calculated salary by simply placing the calculation in the correct context as there is no shared context array.

We can, however, create the required context by zipping together the arrays using the ZIP function:

LET salaries = ZIP('employee,baseRate,bonus', 
               data.employee[], 
               data.baseRate[], 
               data.bonus[]);

The ZIP function takes in an optional comma-separated list of field names. If the field names aren't provided, it defaults to a list of the array inputs (field1, field2,...).

The ZIP function stitches the arrays together on an element-by-element basis: the first element with the first element, the second element with the second element, etc.

Using the ZIP function here allows access to a calculated field that is in the exact same structure as the first example:

{
    "data": {
        "salaries": [
            {
                "employee": "Jhon",
                "baseRate": 100000,
                "bonus": 23000
            },
            {
            	"employee": "Jacob",
                "baseRate": 78000,
                "bonus": 12000
            }
        ]
    }
}

Then, to define the output, you can use the output query that was used initially:

SELECT 
       salaries[].employee AS toPay[].employee,
       salaries[].salary AS toPay[].salary
  FROM "salaries"  
  LET salaries = ZIP('employee,baseRate,bonus', 
                     data.employee[], 
                     data.baseRate[], 
                     data.bonus[]),
      salaries.salary = salaries[].baseRate + salaries[].bonus;

Flattening data

Now that we understand how transformations on arrays work and the role of context and position, we will discuss flattening arrays. This is useful when we want to convert nested objects with arrays into flat tables.

For example, if we have the following event:

{
    "values": [1,2,3],
    "name": "Oleg",
    "id": 123
}

You can flatten the array by using the UNNEST operator:

UNNEST(SELECT data.values[] as value,
              data.name as name,
              data.id as id
           FROM my_data_source)

The resulting table looks as follows:

valuenameid

1

Oleg

123

2

Oleg

123

3

Oleg

123

As the data was flattened based on the values[] array that contains three values, the table shows three rows from our one source event.

Flatten multiple arrays

Suppose we have the following data:

{
    "data": {
        "values": ["apple", "NY"],
        "type": ["fruit", "city"]
    }
}

If we were to simply SELECT the arrays as is:

UNNEST(SELECT data.values[] as value,
              data.type[] as type
           FROM my_data_source)

With two independent arrays selected without a shared context, this query results in a Cartesian product:

valuetype

apple

fruit

NY

fruit

apple

city

NY

city

To avoid this, you should use the ZIPfunction to combine the arrays into a single context:

UNNEST(SELECT zipped[].value as value
              zipped[].type as type
           FROM my_data_source
           LET zipped = ZIP('type,value', data.type[], data.values[]))

Then the result is as expected:

valuetype

apple

fruit

NY

city

Last updated