Microsoft SQL Server

This article describes how to ingest CDC data from your SQL Server database.

Before ingesting your CDC data into Upsolver, please ensure you have configured your database to enable change data capture. This article guides you through the process.

Prerequisites for SQL Server

Upsolver currently supports the following SQL Server set-up:

  • Amazon RDS for SQL Server

  • SQL Server 2019 (15x)

  • SQL Server Standard or Enterprise Edition

  • SQL Server Agent must be running

Please refer to the Amazon RDS article Using change data capture for more information.

Enable CDC on your database

Ensure you are logged in to SQL Server using an account with sysadmin privileges. From your query window, run the following to enable change data capture on the database:

USE Sales
GO
EXEC sys.sp_cdc_enable_db  
GO  

For Amazon RDS for SQL Server, enable change data capture as follows:

// Enable CDC on the Sales database 
EXEC msdb.dbo.rds_cdc_enable_db 'Sales' 
GO  

Enable CDC on your tables

After CDC is enabled for a database, any user with db_owner privileges of the database can enable or disable CDC at the table level.

To check if a table has already been enabled for change data capture, run the following:

// Check if the Customer table is enabled for CDC
SELECT  [name], is_tracked_by_cdc 
FROM    sys.tables
WHERE   [name] = 'Customer'

If is_tracked_by_cdc returns 0, enable change data capture for each table you want to include in the capture:

// Enable CDC on the Customer table
USE Sales
GO  
  
EXEC sys.sp_cdc_enable_table  
  @source_schema = N'dbo',  
  @source_name = N'Customer',  
  @role_name = N'cdc_role',  
  @filegroup_name = N'fg_sales_cdc_data',  
  @supports_net_changes = 1  
GO  

The input parameters should be specified as follows:

ParameterDescription

@source_schema

This is the name of the schema to which the table belongs, for example, dbo.

@source_name

The name of the source table you want to enable for change data capture, e.g. Customer.

@role_name

To control access to the changed data, you can optionally specify an existing fixed server or database role. If the database role does not exist, SQL Server creates it. Users must have SELECT permissions on all captured columns in the source table and be added to the new role if they are not members of the db_owner or sysadmin roles. Alternatively, if you don't want to use a gating role, set this parameter to NULL, e.g. @role_name = NULL.

@filegroup_name

It is best practice to store your change data table separately from the source table. You can optionally specify the name of a pre-existing filegroup to store the CDC data, e.g. fg_sales_cdc_data. If you don't specify an alternative filegroup, CDC data is stored in the default filegroup of the database.

@supports_net_changes

Set this value to 1 if you want a net changes function to be generated for the capture instance. The net changes function will return one change for each distinct row that was changed in the specified interval in the call. For more information, please refer to cdc.fn_cdc_get_net_changes_<capture_instance>. The net changes function requires the source table to include a primary key or unique index. If using the latter, you must include the @index_name parameter to specify the name of the unique index.

Specifying capture columns

By default, when you enable CDC on a table, all columns are identified as captured columns. If you don't need to track all columns or want to exclude specific columns for privacy reasons for example, you can use the @captured_column_list parameter:

// Enable CDC on the Customer table and limit the captured columns
USE Sales
GO  
  
EXEC sys.sp_cdc_enable_table  
  @source_schema = N'dbo',  
  @source_name = N'Customer',  
  @role_name = N'cdc_role',  
  @captured_column_list = 'CustomerID, Title, FirstName, LastName, CompanyName'
  @filegroup_name = N'fg_sales_cdc_data',    
  @supports_net_changes = 1  
GO  

If you have set @support_net_changes to 1 to switch on net change tracking, you must include either the primary key column or the columns defined in the unique index in the @captured_column_list parameter.

For more information on CDC, please refer to the Enable and disable change data capture guide from Microsoft.

Create a CDC job that reads from your database

Connect to SQL Server

Now that change data capture is enabled on your tables, the next step is to create a connection to SQL Server from Upsolver. The example below shows you the syntax to create a persistent connection to your database that is available to other users in your organization.

CREATE MSSQL CONNECTION my_sqlserver_connection
  CONNECTION_STRING = 'jdbc:sqlserver://ms-sqlserver-1.myendpoint;encrypt=false;DatabaseName=orders'
  USER_NAME = '<user_name>'
  PASSWORD = '<password>';

Next, you can create a job to ingest your data from SQL Server into a staging table in the data lake.

Jump to

Ingest to the data lake

After completing the prerequisites, you can create your staging tables. The example below creates a table without defining columns or data types, as these will be inferred automatically by Upsolver, though you can define columns if required:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data()
    PARTITIONED BY $event_date;

Create a staging table to store the data ingested from SQL Server.

Upsolver recommends partitioning by the system column $event_date or another date column within the data in order to optimize your query performance.

Next, create an ingestion job as follows:

CREATE JOB load_raw_data_from_mssql
    COMMENT = 'Ingest from production SQL Server'
AS COPY FROM MSSQL my_mssql_connection
    INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Job options

The example above only uses a small subset of all job options available when reading from SQL Server. Depending on your use case, you may want to configure a different set of options.

Transformations can be applied to your ingestion job, for example, to exclude columns, correct issues, or mask data, before it lands in the target. Furthermore, you can use expectations to define data quality rules on your data stream and take appropriate action.

For more information, see the Ingestion jobs page, which describes the available job options and includes examples.

After your data has been ingested into your staging table, you are ready to move on to the next step of building your data pipeline: transforming your data and writing it to your intended target locations.

Alter a job

You can alter some of the options of an existing job. For example, if you want to keep the job as is, but only change the cluster that is running the job, execute the following command:

ALTER JOB load_orders_raw_data_from_sqlserver 
    SET COMPUTE_CLUSTER = "my_new_cluster";

Note that some options such as COMPRESSION cannot be altered once the job has been created.

To check which job options are mutable, see Microsoft SQL Server.

Drop a job that reads from SQL Server

If you no longer need a job, you can easily drop it using the following SQL command:

DROP JOB load_orders_raw_data_from_sqlserver;

Last updated