Iceberg Schema Evolution
Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.
Last updated
Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.
Last updated
Schema evolution is critical for systems that need to handle data that can change over time. The rules outlined below provide a structured approach to managing varying types and dimensions of data for a field, as it evolves from one data type to another.
The guiding principle is to have all data saved to storage without losing precision, enabling users to identify cases where their data types differ for a given field. This approach ensures that data integrity is maintained while accommodating changes in data types and structures, and is essential in environments where the schema is dynamic.
All data is written to the table.
Data doesn't lose precision when it written to the table.
When data is ingested with conflicting data types, it is written in a way that enables users to identify and fix changes. Essentially, Upsolver provides full flexibility to manage dynamic changes.
Assuming the data contains a field with varying types, Upsolver attempts to define a more suitable type. However, when this fails, Upsolver adds an extra field with the type suffix to identify the evolved type. You can use the report in to understand your table. For example:
In this case, the column my_field was initially discovered with numeric values and, later on, string values were found. Therefore, a new field named my_field_string was defined and will be populated with all future values. The original field, my_field, will be populated when the values are numeric or, more specifically, whole numbers.
When this situation arises, you can decide how to handle the varying types. For instance, one solution would be to modify queries on this table to use a COALESCE
function:
Initial Field Creation: When a field is encountered for the first time, the system uses the widest type and largest dimension. For example, if the first data type for field a
is [][][]
, this becomes the main representation. Please note that when you create a table in Upsolver you can define fields as part of the table creation. That definition always precedes any fields Upsolver might automatically detect in the data.
Handling Narrower Types: If a narrower data type is later encountered for an existing field, it is converted to fit into the existing field. For example, if a::long
is discovered when a::string
is already defined in the table, the long data is converted to a string. See the Reference below describing how this works.
Wider or Non-Convertible Types: If a wider or non-convertible type is encountered, a new field is created for this type. For example, if a::boolean
already exists and a::date
is now added, new fields for a_string::string
and a_date::date
are created. The reasoning behind these two fields is:
We do not want to alter the existing field, so it should remain as it is and will be populated when Boolean values are found in the data.
a_string::string
is added as a common denominator field into which all types of primitives can be written.
a_date::date
is added only for date values, as specific type fields are always added to make it easier to address any potential data cleaning.
Records Vs. Arrays Vs. Primitives: The order of precedence when deciding which type receives the original name is as follows:
Regardless of type, an existing field always retains its name. This means that if in a task time of the related job we see data with one type, and in a later task time we encounter another type, the first data type will already be used for the column that is created in Iceberg. It remains as is, and the next data type might trigger the creation of an additional field.
Records.
Arrays - max dimension and widest type.
Primitives - widest type (a private case of an array with 0 dimensions).
Field Naming Conventions: The naming convention includes a suffix indicating the type and dimension. For instance, coordinates[][][]:bigint
becomes coordinates_array3_long
.
Names Must be Non Empty. Although the Iceberg spec allows empty names (an empty string), many query engines will fail if you have such fields. Therefore, Upsolver will drop any such fields.
Field Documentation: When a field is added with a type suffix, Upsolver also creates a field documentation in the form of upsolver_evolved_from:<field_name>
. For instance, let's say we first see a field named size, with the value 4. We would define it with a type of bigint
, which translates to long
in Iceberg. Then, during the next task time, the data contains the same field with a value of 2.3. Upsolver then adds a field named size_double with type double
, and the field documentation, upsolver_evolved_from:size. Upsolver uses this documentation to distinguish between cases where a type suffix exists because that is how the field was defined, e.g. start_date of type date
, and cases such as the above example where it was added due to schema evolution.
Routing of Values: Values are routed to fields with the exact type and to all other compatible fields. This means converting to a wider type as well as an array with a greater dimension. So, for example, if we have the following types for the same field:
long
date
string
date[]
string[]
Then all values can, and will, be routed to the last type of string[]
.
Edge Case #1: Handling Double and Long Types: When a new field is encountered for the first time and has both double
and long
types, only the double
field is used, and no extra field for long values is needed. This is a special exception to the other rules.
Edge Case #2: Handling Timestamp and Long Types: Contrary to the general rule of maintaining precision, long
values can be converted to timestamps
. In practice, this will happen when a field first appears as a timestamp and then as a long. Please note, this only refers to routing of data, meaning long data can be written into timestamp fields when they exist for the same field. When you have the two types together within the same task time for the same field, you will still get two fields, unlike the case of long and double. You also end up with two fields if the timestamp field is found before the long field, but not vice versa. For example, task time t
has the field started as a timestamp
field. Then task time t+1
has it as long
. A new field named started_long will be defined, but started will also be populated.
The general principle is that types are converted to wider types so precision is not lost. An exception to this rule is allowing bigint
types to be converted into timestamp
values, because this conversion is ubiquitous.
Furthermore, for some type combinations, such as date
and Boolean,
neither of the types can be considered wider than the other. In such cases, a varchar
type is used as the common denominator.
*While a long
value is not narrower than a timestamp
, it can be converted into a timestamp
.
In the first interval of data, the field a[]::long
exists. Then, in a later interval, the data contains the field a::string
.
The widest type between long
and string
is string
The largest dimension between 0, for the primitive a::string,
and 1 for the array a[]::long
, is 1.
Therefore two new fields are defined:
a_string::string
- this will contain future values that have a single dimension. Because string
is the widest primitive type, this means that even if the data suddenly contains a::date
or a::boolean
, for example, the Boolean will be converted into string and stored in this field.
a_array_string[]::string
- all future values that are either primitive or arrays of any primitive type with a dimension of 1 will be converted and stored into this field.
Furthermore, the original field still exists and continues to be populated with values.
Future values that can be converted to long
and are contained in an array with a dimension of 1, or a primitive that has a type that is convertible to long, will be stored in the original field a[]::long
.
All of the following fields arrive together for the first time:
The following fields will be defined, and the largest dimension acquires the original name:
Explanation: the largest dimension takes the original name, but a field for each discovered type is defined.
The following fields arrive together for the first time:
The following fields will be defined, and the widest type can accept all values:
Explanation: we define a field for each type discovered, except when finding a long
and a double
, in which case only a double
is defined.
Data arrives with the following fields:
The following fields are defined:
Explanation: the largest dimension receives all nested fields. Smaller dimensions then have an extra field specific to where the data appeared originally when ingested by Upsolver.
Original Type | Wider Than | Narrower Than | Incompatible With |
---|---|---|---|
Boolean
Nothing
Long, Double, String
Date, Timestamp
Long (Bigint)
Boolean, Timestamp*
Double, String
Date, Timestamp
Double
Boolean, Long
String
Date, Timestamp
Date
Nothing
Long, Double, Timestamp, String
Boolean
Timestamp
Date
Long, Double, String
Boolean
String (Varchar)
Everything else
Nothing
Nothing