MySQL
This article describes how to ingest CDC data from your MySQL source database.
Prerequisites for MySQL
Permissions
In order for Upsolver to read the binlog and initial state of the database, the CDC data source requires the following permissions:
SELECT
(on the tables that are to be loaded, usually all tables)REPLICATION CLIENT
REPLICATION SLAVE
RELOAD
SHOW DATABASES
(this is only necessary if the server was started with the --skip-show-database option)
For more information about creating a user with the correct permissions, see this article: https://debezium.io/documentation/reference/connectors/mysql.html#setting-up-mysql
Enabled binlog
The binlog must be enabled in ROW
mode. You can follow these instructions to check if it's enabled and configure it if you are using your own MySQL server: https://debezium.io/documentation/reference/connectors/mysql.html#enable-mysql-binlog
If you are using AWS RDS you will need to:
Ensure that Automated Backups are enabled. If not, the binlog will be disabled regardless of the parameter group settings.
Change the binlog mode to
ROW
in the parameter group settings for the cluster:Create a new parameter group based on the base group that matches your database version.
Change the
BINGLOG_FORMAT
toROW
Change the database cluster to use the new parameter group and apply the changes.
Ensure binlog duration is long enough
Upsolver constantly reads the binlog's latest events, so generally the binlog retention only needs to be set high enough to handle server interruptions or situations where the data source or compute cluster is paused in Upsolver. It is therefore recommended to set the binlog retention to 1 day, and generally not recommended to set it below 3 hours.
Using AWS RDS, this can be achieved using the following command:
call mysql.rds_set_configuration('binlog retention hours', 24);
Please consult the documentation for your versions of MySQL for custom deployments. For more information regarding MySQL connections, see Connect to your MySQL database.
To learn more about how tables work in Upsolver, see SQL tables.
Create a CDC job that reads from your database
You can create a job to ingest your data from MySQL into a staging table in the data lake.
Jump to
Ingest to the data lake
After completing the prerequisites, you can create your staging tables. The example below creates a table without defining columns or data types, as these will be inferred automatically by Upsolver, though you can define columns if required:
Create a staging table to store the data ingested from MySQL.
Upsolver recommends partitioning by the system column $event_date
or another date column within the data in order to optimize your query performance.
Next, you can create an ingestion job as follows:
Create an ingestion job to copy data from MySQL into a staging table in the data lake.
Note that multiple ingestion jobs can write to the same table, resulting in a final table that contains the UNION ALL
of all data copied into that table, meaning that any duplicate rows that are written are not removed and the columns list may expand if new columns are detected.
This may not be your intended behavior, so make sure you are writing to the correct table before running your job.
The example above only uses a small subset of all job options available when reading from S3. Depending on your use case, you may want to configure a different set of options. For instance, if you're reading from a folder partitioned by date, you may want to use the DATE_PATTERN
option.
For the full list of job options with syntax and detailed descriptions, see the databases page for CDC options.
After your data has been ingested into your staging table, you are ready to move on to the next step of building your data pipeline: transforming your data and writing it to your intended target locations.
Job options
Transformations can be applied to your ingestion job to correct issues, exclude columns, or mask data before it lands in the target. Furthermore, you can use expectations to define data quality rules on your data stream and take appropriate action.
For more information, see the Ingestion jobs page, which describes the available job options and includes examples.
Alter a job that reads from MySQL
Some job options are considered mutable, enabling you to run a SQL command to alter an existing ingestion job rather than create a new job.
For example, take the job we created earlier:
Create an ingestion job to copy data from MySQL into a staging table in the data lake.
If you want to keep the job as is, but only change the cluster that is running the job, execute the following command:
Note that some options such as COMPRESSION
cannot be altered once the connection has been created.
To check which job options are mutable, see MySQL.
Drop a job that reads from MySQL
If you no longer need a job, you can easily drop it using the following SQL command:
Last updated