Schema detection and evolution
When you create a table in SQLake to stage your data, you have the option of defining certain column names with specified types, or you can create the table with no columns defined at all.
Regardless, the
COPY FROM
job dynamically expands the staging table's columns list as new columns are encountered during ingestion and SQLake automatically infers the column type.The primary difference between the explicitly defined columns and the dynamically added ones lies in how SQLake treats their data types when new data arrives in an unexpected type.
Note that column names are case-insensitive. This means that if you have multiple columns with the same name in different cases (e.g.
column_name
, Column_Name
, COLUMN_NAME
, etc.), it is non-deterministic as to which column's data ends up in your table.For SQLake transformation jobs writing to data lake tables, there is a job option
ADD_MISSING_COLUMNS
that allows you to configure whether or not the target table's column list should be dynamically expanded based on the data being written.By default, this option is false, meaning that any column being written by the job must already be defined within the target table; otherwise, the job fails. This also means that you cannot have
SELECT *
within your transformation. The same thing applies to jobs writing to Snowflake tables, where ADD_MISSING_COLUMNS
is not an available job option.On the other hand, when
ADD_MISSING_COLUMNS = TRUE
, the target table's columns list expands as new columns are encountered in the same way as data ingestion.Note that column names are case-insensitive. This means that if you have multiple columns with the same name in different cases (e.g.
column_name
, Column_Name
, COLUMN_NAME
, etc.), it is non-deterministic as to which column's data ends up in your table.There are times when you may have unexpected values arrive in your raw data; for example, the values in your staging table's
nettotal
column have historically all been doubles, but suddenly the nettotal
from a new event that arrives is instead a string.If you had explicitly defined the
nettotal
column type to be double when creating the staging table, then the column type does not change and the new event's nettotal
shows up as null when you query the data.However, SQLake stores all of your data in its original type; this means that you are able to recover the null value by creating a transformation job that casts the
nettotal
column to a string.If this
nettotal
column had not been explicitly defined and was instead dynamically added through SQLake's schema detection where the data type was inferred, in this case, the nettotal
column is automatically upcast to be a string instead of a double.The order of SQLake's upcast is as follows:
boolean
→ date
→ timestamp
→ bigint
→ decimal
→ double
→ string
Note that to have a
decimal
type column, it must be defined when creating the table; SQLake does not infer data to be of that type.Additionally, if a column contains boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is not appropriate cast between them.
As mentioned earlier, column names in SQLake are case insensitive. This means that if you have multiple columns with the same name in different cases (e.g.
column_name
, Column_Name
, COLUMN_NAME
, etc.), it is non-deterministic as to which column's data ends up in your table.However, this also means that for any column name changes in your raw data, so long as it is only a case-wise change (e.g. from
column_name
to Column_Name
), then your staging table remains unaffected.If the change is beyond a case change (e.g. from
column_name
to columnname
), then it is added as a new column in your staging table.Note that you can easily combine these old and new columns by using
COALESCE
in a transformation job.When performing a simple select in your transformation job (meaning that a column is only selected without having any additional transformations applied), data that arrives in an unexpected type is handled the same way as for staging jobs.
Take for example the case where values in your
nettotal
column have historically all been doubles, but suddenly the nettotal
from a new event that arrives is instead a string.If you had explicitly defined the
nettotal
column type to be double when creating the target table, then the column type does not change and the new event's nettotal
shows up as null when you query the data.However, SQLake stores all of your data in its original type, meaning that you are able to recover the null value by casting the
nettotal
column to a string in another transformation job.If this
nettotal
column had not been explicitly defined and was instead dynamically added through SQLake's schema detection where the data type was inferred, in this case, the nettotal
column is automatically upcast to be a string instead of a double.The order of SQLake's upcast is as follows:
boolean
→ date
→ timestamp
→ bigint
→ decimal
→ double
→ string
Note that to have a decimal type column, it must be defined when creating the table; SQLake does not infer data to be of that type.
Additionally, if a column contains boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is no appropriate cast between them.
However, in the case where columns that receive unexpected values have additional transformations applied on top of them, SQLake's behavior differs slightly. This is due to the fact that certain transformations expect certain data types.
Looking again at our
nettotal
column example, let's say that the transformation job contains ABS(nettotal)
. Even if the nettotal
column type had not been explicitly defined (meaning that SQLake would typically upcast it to a string due to the new data that arrived), since the ABS
function requires a numeric input, SQLake continues to cast the data as a double in order to perform the transformation.Last modified 1mo ago