Upsolver
Contact Support
  • Welcome to Upsolver
  • Getting Started
    • Start using Upsolver for free
    • Get started as a Upsolver user
      • Upsolver in 5 minutes
        • Upsolver Quickstart in 5 minutes
          • Additional sandbox fun
        • Amazon Athena data output
        • MySQL (AWS RDS) data output
        • Local MySQL data output
      • Upsolver free training
        • Introduction to Upsolver
          • Transform and write data to Amazon Athena
          • Pre-aggregate data for efficiency and performance
          • UPSERT streaming data to Amazon Athena
      • Prerequisites for AWS deployment
      • AWS integration
      • Deploy Upsolver on your AWS account
      • Prerequisites for Azure Deployment
      • Azure Integration
        • Prerequisites for Azure Users
        • Log into Upsolver
        • Log into Azure & Authenticate
        • Set Up and Deploy Azure Resources
        • Delegate Resource Group, and Deploy Upsolver in Azure
        • Integrate Azure with Upsolver
    • Upsolver concepts
      • Deployment models
      • Upsolver components
      • Data ingestion
    • Upsolver Amazon AWS deployment guide
      • Private VPC
      • Upsolver VPC
      • AWS role permissions
      • VPC peering
    • Tutorials and FAQ
      • Tutorials
        • How To Re-process Data
        • Create an Amazon S3 data source
        • Create an Amazon Athena data output
        • Join multiple data streams for real-time analytics
        • Use Upsolver to index less data into Splunk
        • Upsert and delete use case
        • AWS S3 to Athena use case
        • Merge data use case
        • Full vs. Partial Inbound Data Records
      • FAQ
      • Infrastructure
        • What is a dry-run cluster?
    • Glossary
      • Language guide
        • SQL syntax reference
        • Functions
          • Aggregation Functions
            • APPROX_COUNT_DISTINCT
            • APPROX_COUNT_DISTINCT_EACH
            • AVG
            • AVG_EACH
            • AVG_TIME_SERIES
            • COLLECT_SET
            • COLLECT_SET_EACH
            • COUNT
            • COUNT(*)
            • COUNT_DISTINCT
            • COUNT_EACH
            • COUNT_IF
            • DECAYED_SUM
            • DYNAMIC_SESSIONS
            • FIRST
            • FIRST_ARRAY
            • FIRST_EACH
            • FIRST_TIME_SERIES
            • LAST
            • LAST_ARRAY
            • LAST_EACH
            • LAST_K
            • LAST_K_EACH
            • LAST_TIME_SERIES
            • MAX
            • MAX_BY
            • MAX_EACH
            • MAX_TIME_SERIES
            • MIN
            • MIN_BY
            • MIN_EACH
            • MIN_TIME_SERIES
            • SESSION_COUNT
            • STD_DEV
            • STD_DEV_EACH
            • STRING_MAX_EACH
            • STRING_MIN_EACH
            • SUM
            • SUM_EACH
            • SUM_TIME_SERIES
            • WEIGHTED_AVERAGE
          • Calculated functions
            • Aerospike functions
            • Array functions
            • Conditional functions
            • Date functions
            • External API functions
            • Filter functions
            • Numeric functions
            • Spatial functions
            • String functions
            • Structural functions
              • ZIP
            • Type conversion functions
      • Data formats
      • Data types and features
      • Database output options
      • Upsolver shards
      • Permissions list
      • Index
    • Troubleshooting
      • My CloudFormation stack failed to deploy
      • My private API doesn't start or I can't connect to it
        • Elastic IPs limit reached
        • EC2 Spot Instance not running
        • DNS cache
        • Security group not open
      • My compute cluster doesn't start
      • I can't connect to my Kafka cluster
      • I can't create an S3 data source
      • Data doesn't appear in Athena table
      • I get an exception when querying my Athena table
      • Unable to define a JDBC (Postgres) connection
  • Connecting data sources
    • Amazon AWS data sources
      • Amazon S3 data source
        • Quick guide: S3 data source
        • Full guide: S3 data source
      • Amazon Kinesis Stream data source
      • Amazon S3 over SQS data source
      • Amazon AppFlow data source
        • Setup Google Analytics client ID and client secret.
    • Microsoft Azure data sources
      • Azure Blob storage data source
      • Azure Event Hubs data source
    • Kafka data source
    • Google Cloud Storage data source
    • File upload data source
    • CDC data sources (Debezium)
      • MySQL CDC data source
        • Binlog retention in MySQL
      • PostgreSQL CDC database replication
    • JDBC data source
    • HDFS data source
    • Data source UI
    • Data source properties
  • Data outputs and data transformation
    • Data outputs
      • Amazon AWS data outputs
        • Amazon S3 data output
        • Amazon Athena data output
          • Quick guide: Athena data output
          • Full guide: Athena data output
          • Output all data source fields to Amazon Athena
        • Amazon Kinesis data output
        • Amazon Redshift data output
        • Amazon Redshift Spectrum data output
          • Connect Redshift Spectrum to Glue Data Catalog
        • Amazon SageMaker data output
      • Data lake / database data outputs
        • Snowflake data output
          • Upsert data to Snowflake
        • MySQL data output
        • PostgreSQL data output
        • Microsoft SQL Server data output
        • Elasticsearch data output
        • Dremio
        • PrestoDB
      • Upsolver data output
      • HDFS data output
      • Google Storage data output
      • Microsoft Azure Storage data output
      • Qubole data output
      • Lookup table data output
        • Lookup table alias
        • API Playground
        • Serialization of EACH aggregations
      • Kafka data output
    • Data transformation
      • Transform with SQL
        • Mapping data to a desired schema
        • Transforming data with SQL
        • Aggregate streaming data
        • Query hierarchical data
      • Work with Arrays
      • View outputs
      • Create an output
        • Modify an output in SQL
          • Quick guide: SQL data transformation
        • Add calculated fields
        • Add filters
        • Add lookups
          • Add lookups from data sources
          • Add lookups from lookup tables
          • Adding lookups from reference data
        • Output properties
          • General output properties
      • Run an output
      • Edit an output
      • Duplicate an output
      • Stop an output
      • Delete an output
  • Guide for developers
    • Upsolver REST API
      • Create a data source
      • Modify a data source
      • API content formats
    • CI/CD on Upsolver
  • Administration
    • Connections
      • Amazon S3 connection
      • Amazon Kinesis connection
      • Amazon Redshift connection
      • Amazon Athena connection
      • Amazon S3 over SQS connection
      • Google Storage connection
      • Azure Blob storage connection
      • Snowflake connection
      • MySQL connection
      • Elasticsearch connection
      • HDFS connection
      • Qubole connection
      • PostgreSQL connection
      • Microsoft SQL Server connection
      • Spotinst Private VPC connection
      • Kafka connection
    • Clusters
      • Cluster types
        • Compute cluster
        • Query cluster
        • Local API cluster
      • Monitoring clusters
      • Cluster tasks
      • Cluster Elastic IPs
      • Cluster properties
      • Uploading user-provided certificates
    • Python UDF
    • Reference data
    • Workspaces
    • Monitoring
      • Credits
      • Delays In Upsolver pipelines
      • Monitoring reports
        • Monitoring system properties
        • Monitoring metrics
    • Security
      • IAM: Identity and access management
        • Manage users
        • Manage groups
        • Manage policies
      • Git integration
      • Single sign-on with SAML
        • Microsoft Azure AD with SAML sign-on
        • Okta with SAML sign-on
        • OneLogin with SAML sign-on
      • AMI security updates
  • Support
    • Upsolver support portal
  • Change log
  • Legal
Powered by GitBook
On this page
  • Supportability
  • Prerequisites
  • Permissions
  • Enabled binlog
  • Ensure binlog duration is long enough
  • Creating a MySQL CDC Data Source with Amazon Athena Output
  • Monitoring
  • Creating an Amazon Athena Output writing to Multiple Tables (split by key)

Was this helpful?

  1. Connecting data sources
  2. CDC data sources (Debezium)

MySQL CDC data source

This article provides an introduction to how Upsolver works with MySQL change data capture data sources.

PreviousCDC data sources (Debezium)NextBinlog retention in MySQL

Last updated 3 years ago

Was this helpful?

Supportability

MySQL database versions 5.6+ are supported. This includes managed DB instances hosted by AWS RDS.

Prerequisites

Permissions

In order for Upsolver to read the binlog and the initial state of the database the CDC data source requires the following permissions:

  • SELECT (On the tables that are to be loaded, usually all tables)

  • REPLICATION CLIENT

  • REPLICATION SLAVE

  • RELOAD

  • SHOW DATABASES (this is only necessary if the server was started with the --skip-show-database option)

For more information about creating such a user see here:

Enabled binlog

The binlog must be enabled in ROW mode. You can follow these instructions to check if it's enabled and configure it if you are using your own MySQL server:

If you are using AWS RDS you will need to:

  1. Change the binlog mode to ROW in the parameter group settings for the cluster:

    1. Create new parameter group based on the base group that matches your DB version.

    2. Change BINGLOG_FORMAT to ROW

    3. Change the DB Cluster to use the new parameter group and apply the changes.

Ensure binlog duration is long enough

Upsolver constantly reads the binlog's latest events, so generally the binlog retention only needs to be set high enough to handle server interruptions or situations where the data source or compute cluster is paused in Upsolver. It is therefore recommended to set the binlog retention to 1 day, and generally not recommended to set it below 3 hours.

Using AWS RDS this can be achieved using the following command:

call mysql.rds_set_configuration('binlog retention hours', 24);

For custom deployments please consult the documentation for your versions of MySQL.

Creating a MySQL CDC Data Source with Amazon Athena Output

1. Click on DATA SOURCES and click on NEW on the upper right hand corner

2. SELECT MySQL from the Change Data Capture group

3. Enter your connection string for MySQL database. Make sure to use the right port, user name and password. Click on TEST CONNECTION and make sure the green check mark appears.

3. (Optional) Choose the schema(s) and table(s) that you want to read from. The list supports regular expression. You may also exclude columns from your tables.

4. (Optional) define destination information. This will replicate the database automatically to the target data lake. TABLE PREFIX will show up as part of your destination table name. For example, if you input staging_ for a table named orders. The destination table name will be staging_orders

5. Give a name to your data source and data output. This will be the name of the entity in Upsolver. Click on CONTINUE.

Monitoring

You may monitor the ingestion progress by clicking on the CDC Data source > MONITORING > SNAPSHOT STATUS

Creating an Amazon Athena Output writing to Multiple Tables (split by key)

1. Click on OUTPUTS and NEW on the upper right hand corner.

2. Select the Athena data output.

3. Give the output a name. Choose the CDC data source that you've defined from the previous section. Make sure to choose Hierarchical and click on NEXT.

4. Click on the SQL view and add the data.* as * to your SQL statement.

5. Click back to the UI view and click on More and select Manage Multi Table

6. Select the column that you want to use as the key for splitting data into multiple tables.Click on CLOSE.

7. Click on RUN on the upper right hand corner

8. Fill out your run parameters and make sure to define TABLE NAME prefix. If the MySQL table is named orders, it will show up as split_orders in Athena with the example given below. Click on NEXT.

9. Define how you would like to run this deployment. Keep in mind that leaving ENDING AT as Never will continuously read updates from your MySQL source.

10. Wait for the data to be written to Athena and you're done!

Make sure that are enabled. If they are not enabled the binlog will be disabled regardless of the parameter group settings.

https://debezium.io/documentation/reference/connectors/mysql.html#setting-up-mysql
https://debezium.io/documentation/reference/connectors/mysql.html#enable-mysql-binlog
Automated Backups