MySQL

This article describes how to ingest CDC data from your MySQL database.

Prerequisites

Ensure you have a connection to your MySQL database and that it has been enabled for CDC.

Please read the guide to Enable CDC on MySQL for more information.

Create a CDC job that reads from your database

You can create a job to ingest your data from MySQL into a staging table in the data lake.

Jump to

Ingest to the data lake

After completing the prerequisites, you can create your staging tables. The example below creates a table without defining columns or data types, as these will be inferred automatically by Upsolver, though you can define columns if required:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data()
    PARTITIONED BY $event_date;

Upsolver recommends partitioning by the system column $event_date or another date column within the data in order to optimize your query performance.

Next, you can create an ingestion job as follows:

CREATE SYNC JOB replicate_mysql_tables
AS COPY FROM MYSQL upsolver_mysql_samples 
    TABLE_INCLUDE_LIST = ('samples.orders', 'samples.customers')
INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Note that multiple ingestion jobs can write to the same table, resulting in a final table that contains the UNION ALL of all data copied into that table, meaning that any duplicate rows that are written are not removed and the columns list may expand if new columns are detected.

This may not be your intended behavior, so make sure you are writing to the correct table before running your job.

Job options

The example above only uses a small subset of all job options available when reading from S3. Depending on your use case, you may want to configure a different set of options. For instance, if you're reading from a folder partitioned by date, you may want to use the DATE_PATTERN option.

Transformations can be applied to your ingestion job to correct issues, exclude columns, or mask data before it lands in the target. Furthermore, you can use expectations to define data quality rules on your data stream and take appropriate action.

Alter a job that reads from MySQL

Some job options are considered mutable, enabling you to run a SQL command to alter an existing ingestion job rather than create a new job.

For example, take the job we created earlier:

CREATE SYNC JOB replicate_mysql_tables
AS COPY FROM MYSQL upsolver_mysql_samples 
    TABLE_INCLUDE_LIST = ('samples.orders', 'samples.customers')
INTO default_glue_catalog.upsolver_samples.orders_raw_data;

If you want to keep the job as is, but only change the cluster that is running the job, execute the following command:

ALTER JOB replicate_mysql_tables 
    SET COMPUTE_CLUSTER = "my_new_cluster";

Note that some options such as COMPRESSION cannot be altered once the connection has been created.

Drop a job that reads from MySQL

If you no longer need a job, you can easily drop it using the following SQL command:

DROP JOB replicate_mysql_tables;

Learn More

Please see the SQL command reference for MySQL for full details of the available job options and further examples.

Last updated 1 year ago