Iceberg Schema Evolution

Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.

Schema evolution is critical for systems that need to handle data that can change over time. The rules outlined below provide a structured approach to managing varying types and dimensions of data for a field, as it evolves from one data type to another.

The guiding principle is to have all data saved to storage without losing precision, enabling users to identify cases where their data types differ for a given field. This approach ensures that data integrity is maintained while accommodating changes in data types and structures, and is essential in environments where the schema is dynamic.

Guiding Principles

All data is written to the table.
Data doesn't lose precision when it written to the table.
When data is ingested with conflicting data types, it is written in a way that enables users to identify and fix changes. Essentially, Upsolver provides full flexibility to manage dynamic changes.

End Result

Assuming the data contains a field with varying types, Upsolver attempts to define a more suitable type. However, when this fails, Upsolver adds an extra field with the type suffix to identify the evolved type. You can use the Written Data Statistics report in Datasets to understand your table. For example:

In this case, the column my_field was initially discovered with numeric values and, later on, string values were found. Therefore, a new field named my_field_string was defined and will be populated with all future values. The original field, my_field, will be populated when the values are numeric or, more specifically, whole numbers.

When this situation arises, you can decide how to handle the varying types. For instance, one solution would be to modify queries on this table to use a COALESCE function:

COALESCE(my_field_string, CAST(my_field AS VARCHAR)) AS my_field

How Schema Evolution Behaves

Initial Field Creation: When a field is encountered for the first time, the system uses the widest type and largest dimension. For example, if the first data type for field a is [][][], this becomes the main representation. Please note that when you create a table in Upsolver you can define fields as part of the table creation. That definition always precedes any fields Upsolver might automatically detect in the data.
Handling Narrower Types: If a narrower data type is later encountered for an existing field, it is converted to fit into the existing field. For example, if a::long is discovered when a::string is already defined in the table, the long data is converted to a string. See the Reference below describing how this works.
Wider or Non-Convertible Types: If a wider or non-convertible type is encountered, a new field is created for this type. For example, if a::boolean already exists and a::date is now added, new fields for a_string::string and a_date::date are created. The reasoning behind these two fields is:
1. We do not want to alter the existing field, so it should remain as it is and will be populated when Boolean values are found in the data.
2. a_string::string is added as a common denominator field into which all types of primitives can be written.
3. a_date::date is added only for date values, as specific type fields are always added to make it easier to address any potential data cleaning.
Records Vs. Arrays Vs. Primitives: The order of precedence when deciding which type receives the original name is as follows:
1. Regardless of type, an existing field always retains its name. This means that if in a task time of the related job we see data with one type, and in a later task time we encounter another type, the first data type will already be used for the column that is created in Iceberg. It remains as is, and the next data type might trigger the creation of an additional field.
2. Records.
3. Arrays - max dimension and widest type.
4. Primitives - widest type (a private case of an array with 0 dimensions).
Field Naming Conventions: The naming convention includes a suffix indicating the type and dimension. For instance, coordinates[][][]:bigint becomes coordinates_array3_long.
Names Must be Non Empty. Although the Iceberg spec allows empty names (an empty string), many query engines will fail if you have such fields. Therefore, Upsolver will drop any such fields.
Field Documentation: When a field is added with a type suffix, Upsolver also creates a field documentation in the form of upsolver_evolved_from:<field_name>. For instance, let's say we first see a field named size, with the value 4. We would define it with a type of bigint, which translates to long in Iceberg. Then, during the next task time, the data contains the same field with a value of 2.3. Upsolver then adds a field named size_double with type double, and the field documentation, upsolver_evolved_from:size. Upsolver uses this documentation to distinguish between cases where a type suffix exists because that is how the field was defined, e.g. start_date of type date, and cases such as the above example where it was added due to schema evolution.
Routing of Values: Values are routed to fields with the exact type and to all other compatible fields. This means converting to a wider type as well as an array with a greater dimension. So, for example, if we have the following types for the same field:
1. long
2. date
3. string
4. date[]
5. string[]
Then all values can, and will, be routed to the last type of string[].
Edge Case #1: Handling Double and Long Types: When a new field is encountered for the first time and has both double and long types, only the double field is used, and no extra field for long values is needed. This is a special exception to the other rules.
Edge Case #2: Handling Timestamp and Long Types: Contrary to the general rule of maintaining precision, long values can be converted to timestamps. In practice, this will happen when a field first appears as a timestamp and then as a long. Please note, this only refers to routing of data, meaning long data can be written into timestamp fields when they exist for the same field. When you have the two types together within the same task time for the same field, you will still get two fields, unlike the case of long and double. You also end up with two fields if the timestamp field is found before the long field, but not vice versa. For example, task time t has the field started as a timestamp field. Then task time t+1 has it as long. A new field named started_long will be defined, but started will also be populated.

Reference - Calculating the Widest Type for Primitives

Guiding Principles

The general principle is that types are converted to wider types so precision is not lost. An exception to this rule is allowing bigint types to be converted into timestamp values, because this conversion is ubiquitous.

Furthermore, for some type combinations, such as date and Boolean, neither of the types can be considered wider than the other. In such cases, a varchar type is used as the common denominator.

Original Type

Wider Than

Narrower Than

Incompatible With

Boolean

Nothing

Long, Double, String

Date, Timestamp

Long (Bigint)

Boolean, Timestamp*

Double, String

Date, Timestamp

Double

Boolean, Long

String

Date, Timestamp

Date

Nothing

Long, Double, Timestamp, String

Boolean

Timestamp

Date

Long, Double, String

Boolean

String (Varchar)

Everything else

Nothing

*While a long value is not narrower than a timestamp, it can be converted into a timestamp.

Examples

Example 1

In the first interval of data, the field a[]::long exists. Then, in a later interval, the data contains the field a::string.

The widest type between long and string is string

The largest dimension between 0, for the primitive a::string, and 1 for the array a[]::long, is 1.

Therefore two new fields are defined:

a_string::string - this will contain future values that have a single dimension. Because string is the widest primitive type, this means that even if the data suddenly contains a::date or a::boolean, for example, the Boolean will be converted into string and stored in this field.
a_array_string[]::string - all future values that are either primitive or arrays of any primitive type with a dimension of 1 will be converted and stored into this field.

Furthermore, the original field still exists and continues to be populated with values.

Future values that can be converted to long and are contained in an array with a dimension of 1, or a primitive that has a type that is convertible to long, will be stored in the original field a[]::long.

Example 2

All of the following fields arrive together for the first time:

a.b::string
a.b[]::string
a.b[][]::string
a.b[][][]::string

The following fields will be defined, and the largest dimension acquires the original name:

a.b_string::string
a.b_array_string[]::string
a.b_array2_string[][]::string
a.b[][][]::string

Explanation: the largest dimension takes the original name, but a field for each discovered type is defined.

Example 3

The following fields arrive together for the first time:

a.b::boolean
a.b::long
a.b::double

The following fields will be defined, and the widest type can accept all values:

a.b::boolean
a.b::double

Explanation: we define a field for each type discovered, except when finding a long and a double, in which case only a double is defined.

Example 4

Data arrives with the following fields:

a.b.c::string
a.b[].x::string
a.b[][].y::string
a.b[][][].z::string

The following fields are defined:

a.b[][][].c::string
a.b[][][].x::string
a.b[][][].y::string
a.b[][][].z::string
a.b_record.c::String
a.b_array_record[].x::string
a.b_array2_record[][].y::string

Explanation: the largest dimension receives all nested fields. Smaller dimensions then have an extra field specific to where the data appeared originally when ingested by Upsolver.

Last updated 1 year ago