Enriching Data - Amazon S3 to ClickHouse

This guide teaches you how to create a job to write transformed data into an existing ClickHouse table.

A common use case for creating a transformation job is the need to enrich an existing analytics table with data from production systems or third parties. In this guide, we will use ClickHouse for our target database, but the process for creating a transformation job is almost identical for other data platforms.

ClickHouse is an OLAP engine and is the fastest and most efficient open-source database for real-time analytics. Users gain the most benefit when data is loaded into one of its specialized table engines, and ClickHouse Cloud can process, store, and analyze billions of rows in seconds.

While you can use query federation and the ClickHouse native S3 integration to query data, you won't benefit from the performance gains to be made by loading the data directly into ClickHouse, which can return queries in milliseconds. An alternative would be to use the S3 or S3Cluster table functions, which are easy enough to implement, but this ease is generally offset by the manual tuning that is required - along with in-depth knowledge of the data structure - to avoid mis-configuration. In addition, users must create and administer a scheduled process to refresh the data.

Upsolver removes all the manual intervention and complexity, and the need to "figure things out" by offering a simple command to ingest data from S3 into your ClickHouse table.

What you will learn

In this guide, you will learn how to connect to your source and target data. Then you will discover how to create a transformation job that loads the data into the destination table in ClickHouse ready for you to analyze. Once loaded into the table, you will benefit from the performance gains delivered by ClickHouse's table engines.

Prerequisites

Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to Start Your Free Trial and Deploy Upsolver on AWS.

When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See AWS Role Permissions for more information.

In this guide, the cluster has permission to read and write to the AWS Glue Data Catalog and to S3.

The steps for enriching an existing table in ClickHouse are as follows:

Connect to Amazon S3
Connect to ClickHouse
Create a transformation job to load the data into ClickHouse

Step 1

Create a connection to your data sources in S3

This connection enables you to configure the AWS IAM credentials that Upsolver needs to access the data. Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

Here’s the code:

CREATE S3 CONNECTION my_s3_connection
    AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*'
    EXTERNAL_ID = '12345678'
    READ_ONLY = TRUE;

For future reference, your AWS_ROLE can be copied from your AWS IAM user page. After you complete this step, you should find the my_s3_connection connection in your navigation tree.

Every connection you create is visible and available to all users within your organization, and always on, meaning you don't need to recreate or reconnect when you create subsequent jobs.

Step 2

Create a connection to ClickHouse

The next step is to create a connection to your target ClickHouse database.

Here's the code:

CREATE CLICKHOUSE CONNECTION my_clickhouse_connection
    CONNECTION_STRING = 'http://x.us-east-1.aws.clickhouse.cloud:8123/sales_db'
    USER_NAME = 'my_username'
    PASSWORD = 'my_password'
    COMMENT = 'Connection to Sales database';

In the above example, a connection named my_clickhouse_connection is created to access the sales_db database. An optional comment is added to inform other users as to what this connection does and this is visible in the connections system table.

Step 3

Create a transformation job

Now that you have a connection to both your source and target systems, you can create a job to transform and load the data into ClickHouse.

Here's the code to create a transformation job:

CREATE SYNC JOB transform_and_load_to_clickhouse
  START_FROM = BEGINNING
  RUN_INTERVAL = 1 MINUTE
  RUN_PARALLELISM = 20
AS INSERT INTO CLICKHOUSE my_clickhouse_connection.sales_db.target_tbl 
  MAP_COLUMNS_BY_NAME
    SELECT 
      orders,
      MD5(customer_email) AS customer_id,
      ARRAY_SUM(data.items[].quantity * data.items[].price) AS net_total
    FROM default_glue_catalog.lake_db.raw_order_events
    WHERE TIME_FILTER();

In the example above, we create a job named transform_and_load_to_clickhouse, which we then customize using some of the available job options. The first option, START_FROM, instructs Upsolver on what point in time to start ingesting the data from. We have specified BEGINNING, so all data will be loaded into ClickHouse. You can also specify a date, or NOW to only ingest data starting from when the job starts running.

Next, the RUN_INTERVAL option tells the job to load changes into the target every 1 MINUTE, though this is a highly configurable option, and can be set to your exact requirements.

The RUN_PARALLELISM option controls how many jobs run in parallel to process each minute of data from the source table, and has been set to 20.

Then, the INSERT INTO statement defines where we want to load the data; in our case we want the target_tbl table in the sales_db database, using the my_clickhouse_connection connection we created in Step 2.

The second half of the query defines the data we want to transform and load. The MAP_COLUMNS_BY_NAME keyword specifies that the columns in the SELECT list must map to the same column names in the target. If we excluded this keyword, columns would be mapped on their ordinal position.

The SELECT statement defines the columns we want to load into the target_tbl table. To protect sensitive data, the customer_email column has been masked using the MD5() hashing function, so the value will not be readable in the target. The ARRAY_SUM() function has been used to derive the net_total value of the order.

Finally, in the WHERE clause, we include the TIME_FILTER() function to specify the window over which the aggregation is calculated. As we have not passed a value in to the function, it takes the default specified in the RUN_INTERVAL option, in our case, 1 MINUTE.

This is a simple example to demonstrate how to create a transformation job to enrich your target table, however, Upsolver's library includes an extensive list of Functions and Operators that can be applied to your jobs to ensure your data lands in your target exactly how you want it.

Conclusion

In this guide you learned that, in some cases, it is best to load data into ClickHouse to benefit from the super fast query performance offered by a database specifically geared up for high-scale analytics. You learned how to create a connection to your Amazon S3 source, a connection to your target ClickHouse database, and then how to create a transformation job to continuously load the data.

Try it yourself

To create a transformation job to load your data into ClickHouse:

Create a connection to your Amazon S3 source
Create a connection to your ClickHouse database
Write a transformation job to copy data from S3 to ClickHouse using Upsolver's Functions and Operators to transform your data

Last updated 1 year ago