Redshift Schema Evolution
Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.
Last updated
Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.
Last updated
When writing data from Upsolver to Amazon Redshift, schema evolution is enabled for the following job types:
Ingestion Job: Directly writing data to a Redshift table using the COPY FROM
syntax.
Transformation Job: Using a SELECT *
expression mapped to the root, for example:
Replication Job: Continuously replicating CDC data to Redshift.
Schema evolution in Upsolver means that the system will dynamically create and modify Redshift tables to accommodate changes in the incoming schema. Specifically, Upsolver handles the addition of new columns when new data types are encountered.
Upsolver maps its primitive types to Redshift column types as follows:
Upsolver Type | Redshift Column Type |
---|---|
Nested fields and arrays are mapped to the Redshift type SUPER
.
A key challenge in schema evolution is resolving data type conflicts. For instance, if a field initially ingested as BIGINT
changes to VARCHAR
, Upsolver must handle this discrepancy.
Upsolver resolves data type conflicts in the following way:
If the new type is narrower than the original type e.g., varchar to bigint, the value will be inserted into the original column if casting is possible.
If the new type is wider than the original type e.g. bigint to varchar, a new column of the new type will be created with a suffix indicating the new type e.g. _varchar
for string data. The new value will be written to both the new column and the previous column if a cast is possible. For example, in the case of bigint to varchar, if the value of the varchar is '2'
, it will also be written as 2
to the original bigint column.
Initial Data: {"my_field": 1.2}
Redshift Column: my_field
of type FLOAT8
.
New Data: {"my_field": "str"}
Redshift Columns:
my_field
remains FLOAT8
, with NULL
for the new string data.
my_field_varchar
is created to store the value "str".
Other types will be cast using Redshift's CAST
expression.
The following type tuples will not be cast:
Float8 - Timestamp
Float8 - Date
Float8 - Boolean
Boolean - Date
Boolean - Timestamp
From Type | To Type | Cast Expression |
---|---|---|
DOUBLE
FLOAT8
BOOLEAN
BOOLEAN
STRING
VARCHAR (max size 65535)
TIMESTAMP
TIMESTAMP
DATE
DATE
BIGINT
BIGINT
BigInt
Timestamp
TIMESTAMP 'epoch' + source_column * INTERVAL '0.001 second'
Timestamp
BigInt
EXTRACT(EPOCH FROM source_column)::BIGINT * 1000 + EXTRACT(MILLISECOND FROM source_column)
BigInt
Date
'1970-01-01'::DATE + source_column * INTERVAL '1 day'
Date
BigInt
(source_column - '1970-01-01'::DATE)::BIGINT
Float8
BigInt
CASE WHEN source_column >= ${Long.MinValue} AND source_column <= ${Long.MaxValue} THEN source_column::BIGINT END
Super
Varchar
JSON_SERIALIZE(source_column)
Boolean
Varchar
CASE source_column WHEN true THEN 'true' ELSE 'false' END
Varchar
BigInt
CASE WHEN source_column ~ '^[0-9]+$$' THEN source_column::BIGINT ELSE NULL END
Varchar
Float8
CASE WHEN source_column ~ '^[+-]?[0-9]*\.?[0-9]+$$' THEN source_column::FLOAT8 ELSE NULL END
Varchar
Timestamp
CASE WHEN source_column ~ '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$$' THEN source_column::TIMESTAMP ELSE NULL END
Varchar
Date
CASE WHEN source_column ~ '^\d{4}-\d{2}-\d{2}$$' THEN source_column::DATE ELSE NULL END
Varchar
Boolean
CASE LOWER(source_column) WHEN 'true' THEN true WHEN 'false' THEN false END