Schema Detection and Evolution

Learn how Upsolver's schema detection and evolution feature automatically handles changes from your source schema.

Schema detection

Data ingestion

When you create a table in Upsolver to stage your data, you have the option of defining column names with specified types, or you can create a table with no columns.

Regardless, the ingestion job dynamically expands the staging table's column list as new columns are encountered during ingestion and Upsolver automatically infers the column type.

The primary difference between the explicitly defined columns and the dynamically added columns lies in how Upsolver treats its data types when new data arrives in an unexpected type.

See Schema evolution.

Note that column names are case-insensitive. Therefore if you have multiple columns with the same name in different cases (e.g. column_name, Column_Name, COLUMN_NAME, etc.), it is non-deterministic as to which column's data ends up in your table.

Data transformation

For Upsolver transformation jobs writing to data lake tables, there is a job option named ADD_MISSING_COLUMNS that allows you to configure whether or not the target table's column list should be dynamically expanded based on the data being written.

By default, this option is FALSE, meaning that any column written by the job must already be defined within the target table; otherwise, the job fails. Furthermore, you cannot have a SELECT * statement within your transformation job. This applies to jobs writing to Snowflake tables, where ADD_MISSING_COLUMNS is not an available job option.

However, when ADD_MISSING_COLUMNS is TRUE, the target table's columns list expands as new columns are encountered in the same way as data ingestion.

Note that column names are case-insensitive. This means that if you have multiple columns with the same name in different cases (e.g. column_name, Column_Name, COLUMN_NAME, etc.), it is non-deterministic as to which column's data ends up in your table.

Schema evolution

Data ingestion

Unexpected data type

There are times when you may have unexpected values arrive in your raw data; for example, the values in your staging table's nettotal column have historically been doubles, but suddenly the nettotal from a new event that arrives is a string.

If you had explicitly defined the nettotal column type to be a double when creating the staging table, then the column type does not change and the new event's nettotal shows up as null when you query the data.

However, Upsolver stores all of your data in its original type; this means that you are able to recover the null value by creating a transformation job that casts the nettotal column to a string.

If this nettotal column had not been explicitly defined and was instead added dynamically through Upsolver's schema detection where the data type was inferred, in this case, the nettotal column is automatically upcast to be a string instead of a double.

The order of Upsolver's upcast is as follows:

Booleandatetimestampbigintdecimaldoublestring

Note that to have a decimal type column, it must be defined when creating the table; Upsolver does not infer data to be of that type.

Additionally, if a column contains Boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is no appropriate cast between them.

Column name change

As mentioned earlier, column names in Upsolver are case insensitive. Therefore if you have multiple columns with the same name in different cases (e.g. column_name, Column_Name, COLUMN_NAME, etc.), it is non-deterministic as to which column's data ends up in your table.

However, this also means that for any column name changes in your raw data, so long as it is only a case-wise change (e.g. from column_name to Column_Name), then your staging table remains unaffected.

If the change is beyond a case change (e.g. from column_name to columnname), then it is added as a new column in your staging table.

Note that you can easily combine these old and new columns by using COALESCE in a transformation job.

Data transformation

Unexpected data type

When performing a simple SELECT statement in your transformation job (meaning that a column is only selected without having any additional transformations applied), data that arrives in an unexpected type is handled the same way as staging jobs.

Take for example the case where values in your nettotal column have historically all been doubles, but suddenly the nettotal from a new event arrives as a string.

If you had explicitly defined the nettotal column type to be double when creating the target table, then the column type does not change and the new event's nettotal shows up as null when you query the data.

However, Upsolver stores all of your data in its original type, meaning that you are able to recover the null value by casting the nettotal column to a string in another transformation job.

If this nettotal column had not been explicitly defined and was instead dynamically added through Upsolver's schema detection where the data type was inferred, in this case, the nettotal column is automatically upcast to be a string instead of a double.

The order of Upsolver's upcast is as follows:

Booleandatetimestampbigintdecimaldoublestring

Note that to have a decimal type column, it must be defined when creating the table; Upsolver does not infer data to be of that type.

Additionally, if a column contains Boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is no appropriate cast between them.

However, in the case where columns that receive unexpected values have additional transformations applied on top of them, Upsolver's behavior differs slightly. This is due to the fact that certain transformations expect certain data types.

Looking again at our nettotal column example, let's say that the transformation job contains ABS(nettotal). Even if the nettotal column type had not been explicitly defined (meaning that Upsolver would typically upcast it to a string due to the new data that arrived), since the ABS function requires a numeric input, Upsolver continues to cast the data as a double in order to perform the transformation.

Last updated