MySQL CDC data source
This article provides an introduction to how Upsolver works with MySQL change data capture data sources.
Last updated
Was this helpful?
This article provides an introduction to how Upsolver works with MySQL change data capture data sources.
Last updated
Was this helpful?
MySQL database versions 5.6+ are supported. This includes managed DB instances hosted by AWS RDS.
In order for Upsolver to read the binlog and the initial state of the database the CDC data source requires the following permissions:
SELECT
(On the tables that are to be loaded, usually all tables)
REPLICATION CLIENT
REPLICATION SLAVE
RELOAD
SHOW DATABASES
(this is only necessary if the server was started with the --skip-show-database option)
For more information about creating such a user see here:
The binlog must be enabled in ROW
mode. You can follow these instructions to check if it's enabled and configure it if you are using your own MySQL server:
If you are using AWS RDS you will need to:
Change the binlog mode to ROW
in the parameter group settings for the cluster:
Create new parameter group based on the base group that matches your DB version.
Change BINGLOG_FORMAT
to ROW
Change the DB Cluster to use the new parameter group and apply the changes.
Upsolver constantly reads the binlog's latest events, so generally the binlog retention only needs to be set high enough to handle server interruptions or situations where the data source or compute cluster is paused in Upsolver. It is therefore recommended to set the binlog retention to 1 day, and generally not recommended to set it below 3 hours.
Using AWS RDS this can be achieved using the following command:
call mysql.rds_set_configuration('binlog retention hours', 24);
For custom deployments please consult the documentation for your versions of MySQL.
1. Click on DATA SOURCES and click on NEW on the upper right hand corner
2. SELECT MySQL from the Change Data Capture group
3. Enter your connection string for MySQL database. Make sure to use the right port, user name and password. Click on TEST CONNECTION and make sure the green check mark appears.
3. (Optional) Choose the schema(s) and table(s) that you want to read from. The list supports regular expression. You may also exclude columns from your tables.
4. (Optional) define destination information. This will replicate the database automatically to the target data lake. TABLE PREFIX will show up as part of your destination table name. For example, if you input staging_
for a table named orders
. The destination table name will be staging_orders
5. Give a name to your data source and data output. This will be the name of the entity in Upsolver. Click on CONTINUE.
You may monitor the ingestion progress by clicking on the CDC Data source > MONITORING > SNAPSHOT STATUS
1. Click on OUTPUTS and NEW on the upper right hand corner.
2. Select the Athena data output.
3. Give the output a name. Choose the CDC data source that you've defined from the previous section. Make sure to choose Hierarchical and click on NEXT.
4. Click on the SQL view and add the data.* as *
to your SQL statement.
5. Click back to the UI view and click on More and select Manage Multi Table
6. Select the column that you want to use as the key for splitting data into multiple tables.Click on CLOSE.
7. Click on RUN on the upper right hand corner
8. Fill out your run parameters and make sure to define TABLE NAME prefix. If the MySQL table is named orders
, it will show up as split_orders
in Athena with the example given below. Click on NEXT.
9. Define how you would like to run this deployment. Keep in mind that leaving ENDING AT
as Never
will continuously read updates from your MySQL source.
10. Wait for the data to be written to Athena and you're done!
Make sure that are enabled. If they are not enabled the binlog will be disabled regardless of the parameter group settings.