Redshift Schema Evolution

Understand how Upsolver manages changes to your source schema to keep your jobs running successfully.

When integrating Upsolver with Amazon Redshift, schema evolution becomes necessary for the following job types:

  1. Ingestion Job: Directly writing data to a Redshift table.

  2. Transformation Job: Using a SELECT * expression mapped to the root, e.g.,

INSERT INTO redshift_connection.schema_name.table_name MAP_COLUMNS_BY_NAME SELECT * FROM source_table

  1. Replication Job: Continuously syncing data between sources.

Schema Evolution Overview

Schema evolution in Upsolver means that the system will dynamically create and modify Redshift tables to accommodate changes in the incoming data schema. Specifically, Upsolver handles the addition of new columns as new data types are encountered.

Type Mapping

Upsolver maps its primitive types to Redshift column types as follows:

Upsolver TypeRedshift Column Type

DOUBLE

FLOAT8

BOOLEAN

BOOLEAN

STRING

VARCHAR (max size 65535)

TIMESTAMP

TIMESTAMP

DATE

DATE

BIGINT

BIGINT

Nested fields and arrays are mapped to the Redshift type SUPER.

Handling Conflicts

One of the key challenges in schema evolution is resolving data type conflicts. For instance, if a field initially ingested as BIGINT changes to VARCHAR, Upsolver must handle this discrepancy.

Upsolver resolves data type conflicts in the following way:

  • If the new type is narrower than the original type (e.g., Varchar to Bigint), the value will be inserted into the original column if casting is possible.

  • If the new type is wider than the original type (e.g., Bigint to Varchar), a new column of the new type will be created with a suffix indicating the new type (e.g., _varchar for string data). The new value will be written to both the new column and the previous column if a cast is possible. For example, in the case of Bigint to Varchar, if the value of the Varchar is '2', it will also be written as 2 to the original Bigint column.

Example

  1. Initial Data: {"my_field": 1.2}

    • Redshift Column: my_field of type FLOAT8.

  2. New Data: {"my_field": "str"}

    • Redshift Columns:

      • my_field remains FLOAT8, with NULL for the new string data.

      • my_field_varchar is created to store the value "str".

Special Cast Handling

From TypeTo TypeCast Expression

BigInt

Timestamp

TIMESTAMP 'epoch' + source_column * INTERVAL '0.001 second'

Timestamp

BigInt

EXTRACT(EPOCH FROM source_column)::BIGINT * 1000 + EXTRACT(MILLISECOND FROM source_column)

BigInt

Date

'1970-01-01'::DATE + source_column * INTERVAL '1 day'

Date

BigInt

(source_column - '1970-01-01'::DATE)::BIGINT

Float8

BigInt

CASE WHEN source_column >= ${Long.MinValue} AND source_column <= ${Long.MaxValue} THEN source_column::BIGINT END

Super

Varchar

JSON_SERIALIZE(source_column)

Boolean

Varchar

CASE source_column WHEN true THEN 'true' ELSE 'false' END

Varchar

BigInt

CASE WHEN source_column ~ '^[0-9]+$$' THEN source_column::BIGINT ELSE NULL END

Varchar

Float8

CASE WHEN source_column ~ '^[+-]?[0-9]*\.?[0-9]+$$' THEN source_column::FLOAT8 ELSE NULL END

Varchar

Timestamp

CASE WHEN source_column ~ '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$$' THEN source_column::TIMESTAMP ELSE NULL END

Varchar

Date

CASE WHEN source_column ~ '^\d{4}-\d{2}-\d{2}$$' THEN source_column::DATE ELSE NULL END

Varchar

Boolean

CASE LOWER(source_column) WHEN 'true' THEN true WHEN 'false' THEN false END

Other types will be cast using Redshift's CAST expression.

The following type tuples will not be cast:

  • Float8 - Timestamp

  • Float8 - Date

  • Float8 - Boolean

  • Boolean - Date

  • Boolean - Timestamp

Last updated