CDC Known Limitations

This article provides an in-depth look at the specific limitations associated with using Upsolver for Change Data Capture (CDC) to replicate data into data lakes or Snowflake. Understanding these limitations is crucial for effectively managing data workflows and ensuring data integrity.

Limitations

1. Empty Tables

Tables that are empty, containing no rows, will not be replicated. Ensure tables have data if replication is necessary.

2. Null-Value Columns

Columns populated entirely by null values will not be replicated. This may cause schema discrepancies between the source and target environments.

3. Conflicting Data Types Across Tables

In scenarios where multiple tables have columns with the same name but different data types, conflicts can occur during replication:

If both tables are updated simultaneously, and one column type is date while another is int, the column will be replicated as date in the target environment. This conflict also applies to columns with types long and timestamp.

4. Nested Column Type Limitations

Nested data types(like JSON) come with several limitations:

Missing Fields: JSON fields that are null are omitted during replication.
Nulls in Arrays: Null values within JSON arrays are skipped.
Empty Arrays: Empty arrays or arrays with only nulls will be treated as null.
Type Casting in Arrays: Arrays containing elements of different types will be cast to varchar (e.g., [1, 'str'] becomes ['1', 'str']).

5. TOAST Values

For PostgreSQL sources, fields stored as TOAST require a full replica identity for replication, affecting large data fields.

6. Non-Replication of Default Values

Default values defined in database schemas are not replicated. This can affect how data appears in the target system if defaults are relied upon.

7. Data Type Upcasting

Upsolver converts original data types to a set of supported primitive types:

Integer Types: All are mapped to bigint.
Floating Point and Decimal Types: All are mapped to double.
Decimal Type is mapped to double.

Upsolver supported primitive types are:

String
Bigint
Double
Boolean
Date
Timestamp (milliseconds precision)

8. Unsupported Truncate Events

Truncate operations, which delete all rows in a table, are not supported in CDC replication.

9. Commit Synchronization

There is no mechanism to synchronize commits between target tables, which may result in temporal data discrepancies.

10. Handling Changes in Column Types

A change in a source column's data type results in the creation of a new column in the target. For example, if a column changes from bigint to varchar, a new column with a suffix (e.g., col_string) will be created for the varchar values.

See Schema Evolution for more details.

11. Dropped and Renamed Columns

Dropped Columns: Columns dropped in the source are not dropped in the target, potentially leading to outdated schema representations.
Renamed Columns: Renamed columns will not be renamed in the target; instead, an additional column with the new name will be created.

12. Column Order

The order of the created columns in the target tables is not guaranteed and may not match the order in the source tables.

Conclusion

These limitations highlight the challenges and considerations when using Upsolver for CDC with data lakes or Snowflake. Planning and understanding these constraints is essential for effective data management and integration strategies, ensuring that the replicated data is accurate and consistent with business needs.

Last updated 10 months ago