LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
How To Guides
How To Guides
  • How To Guides
  • SETUP
    • Deploy Upsolver on AWS
      • Deployment Guide
      • AWS Role Permissions
      • VPC Peering Guide
      • Role-Based AWS Credentials
    • Enable API Integration
    • Install the Upsolver CLI
  • CONNECTORS
    • Create Connections
      • Amazon Kinesis
      • Amazon Redshift
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • ClickHouse
      • Confluent Cloud
      • Elasticsearch
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • PostgreSQL
      • Snowflake
      • Tabular
    • Configure Access
      • Amazon Kinesis
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • Confluent Kafka
    • Enable CDC
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • PostgreSQL
  • JOBS
    • Basics
      • Real-time Data Ingestion — Amazon Kinesis to ClickHouse
      • Real-time Data Ingestion — Amazon S3 to Amazon Athena
      • Real-time Data Ingestion — Apache Kafka to Amazon Athena
      • Real-time Data Ingestion — Apache Kafka to Snowflake
    • Advanced Use Cases
      • Build a Data Lakehouse
      • Enriching Data - Amazon S3 to ClickHouse
      • Joining Data — Amazon S3 to Amazon Athena
      • Upserting Data — Amazon S3 to Amazon Athena
      • Aggregating Data — Amazon S3 to Amazon Athena
      • Managing Data Quality - Ingesting Data with Expectations
    • Database Replication
      • Replicate CDC Data into Snowflake
      • Replicate CDC Data to Multiple Targets in Snowflake
      • Ingest Your Microsoft SQL Server CDC Data to Snowflake
      • Ingest Your MongoDB CDC Data to Snowflake
      • Handle PostgreSQL TOAST Values
    • VPC Flow Logs
      • Data Ingestion — VPC Flow Logs
      • Data Analytics — VPC Flow Logs
    • Job Monitoring
      • Export Metrics to a Third-Party System
    • Data Observability
      • Observe Data with Datasets
  • DATA
    • Query Upsolver Iceberg Tables from Snowflake
  • APACHE ICEBERG
    • Analyze Your Iceberg Tables Using the Upsolver CLI
    • Optimize Your Iceberg Tables
Powered by GitBook
On this page
  • What you will learn
  • Prerequisites
  • Step 1
  • Create a connection to your data sources in S3
  • Step 2
  • Create a connection to ClickHouse
  • Step 3
  • Create a transformation job
  • Conclusion
  • Try it yourself
  1. JOBS
  2. Advanced Use Cases

Enriching Data - Amazon S3 to ClickHouse

This guide teaches you how to create a job to write transformed data into an existing ClickHouse table.

Last updated 11 months ago

A common use case for creating a transformation job is the need to enrich an existing analytics table with data from production systems or third parties. In this guide, we will use ClickHouse for our target database, but the process for creating a job is almost identical for other data platforms.

ClickHouse is an OLAP engine and is the fastest and most efficient open-source database for real-time analytics. Users gain the most benefit when data is loaded into one of its specialized table engines, and ClickHouse Cloud can process, store, and analyze billions of rows in seconds.

While you can use query federation and the ClickHouse native S3 integration to query data, you won't benefit from the performance gains to be made by loading the data directly into ClickHouse, which can return queries in milliseconds. An alternative would be to use the S3 or S3Cluster table functions, which are easy enough to implement, but this ease is generally offset by the manual tuning that is required - along with in-depth knowledge of the data structure - to avoid mis-configuration. In addition, users must create and administer a scheduled process to refresh the data.

Upsolver removes all the manual intervention and complexity, and the need to "figure things out" by offering a simple command to ingest data from S3 into your ClickHouse table.

What you will learn

In this guide, you will learn how to connect to your source and target data. Then you will discover how to create a transformation job that loads the data into the destination table in ClickHouse ready for you to analyze. Once loaded into the table, you will benefit from the performance gains delivered by ClickHouse's table engines.


Prerequisites

Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to and .

When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See for more information.

In this guide, the cluster has permission to read and write to the AWS Glue Data Catalog and to S3.

The steps for enriching an existing table in ClickHouse are as follows:

  1. Connect to Amazon S3

  2. Connect to ClickHouse

  3. Create a transformation job to load the data into ClickHouse


Step 1

Create a connection to your data sources in S3

Here’s the code:

CREATE S3 CONNECTION my_s3_connection
    AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*'
    EXTERNAL_ID = '12345678'
    READ_ONLY = TRUE;

Every connection you create is visible and available to all users within your organization, and always on, meaning you don't need to recreate or reconnect when you create subsequent jobs.

Step 2

Create a connection to ClickHouse

The next step is to create a connection to your target ClickHouse database.

Here's the code:

CREATE CLICKHOUSE CONNECTION my_clickhouse_connection
    CONNECTION_STRING = 'http://x.us-east-1.aws.clickhouse.cloud:8123/sales_db'
    USER_NAME = 'my_username'
    PASSWORD = 'my_password'
    COMMENT = 'Connection to Sales database';

Step 3

Create a transformation job

Now that you have a connection to both your source and target systems, you can create a job to transform and load the data into ClickHouse.

Here's the code to create a transformation job:

CREATE SYNC JOB transform_and_load_to_clickhouse
  START_FROM = BEGINNING
  RUN_INTERVAL = 1 MINUTE
  RUN_PARALLELISM = 20
AS INSERT INTO CLICKHOUSE my_clickhouse_connection.sales_db.target_tbl 
  MAP_COLUMNS_BY_NAME
    SELECT 
      orders,
      MD5(customer_email) AS customer_id,
      ARRAY_SUM(data.items[].quantity * data.items[].price) AS net_total
    FROM default_glue_catalog.lake_db.raw_order_events
    WHERE TIME_FILTER(); 

In the example above, we create a job named transform_and_load_to_clickhouse, which we then customize using some of the available job options. The first option, START_FROM, instructs Upsolver on what point in time to start ingesting the data from. We have specified BEGINNING, so all data will be loaded into ClickHouse. You can also specify a date, or NOW to only ingest data starting from when the job starts running.

Next, the RUN_INTERVAL option tells the job to load changes into the target every 1 MINUTE, though this is a highly configurable option, and can be set to your exact requirements.

The RUN_PARALLELISM option controls how many jobs run in parallel to process each minute of data from the source table, and has been set to 20.

The second half of the query defines the data we want to transform and load. The MAP_COLUMNS_BY_NAME keyword specifies that the columns in the SELECT list must map to the same column names in the target. If we excluded this keyword, columns would be mapped on their ordinal position.

The SELECT statement defines the columns we want to load into the target_tbl table. To protect sensitive data, the customer_email column has been masked using the MD5() hashing function, so the value will not be readable in the target. The ARRAY_SUM() function has been used to derive the net_total value of the order.

Finally, in the WHERE clause, we include the TIME_FILTER() function to specify the window over which the aggregation is calculated. As we have not passed a value in to the function, it takes the default specified in the RUN_INTERVAL option, in our case, 1 MINUTE.


Conclusion

In this guide you learned that, in some cases, it is best to load data into ClickHouse to benefit from the super fast query performance offered by a database specifically geared up for high-scale analytics. You learned how to create a connection to your Amazon S3 source, a connection to your target ClickHouse database, and then how to create a transformation job to continuously load the data.


Try it yourself

To create a transformation job to load your data into ClickHouse:

This connection enables you to configure the credentials that Upsolver needs to access the data. Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

For future reference, your AWS_ROLE can be copied from your user page. After you complete this step, you should find the my_s3_connection connection in your navigation tree.

In the above example, a connection named my_clickhouse_connection is created to access the sales_db database. An optional comment is added to inform other users as to what this connection does and this is visible in the .

Then, the INSERT INTO statement defines where we want to load the data; in our case we want the target_tbl table in the sales_db database, using the my_clickhouse_connection connection we created in .

This is a simple example to demonstrate how to create a transformation job to enrich your target table, however, Upsolver's library includes an extensive list of and that can be applied to your jobs to ensure your data lands in your target exactly how you want it.

Create a connection to your source

Create a connection to your database

Write a to copy data from S3 to ClickHouse using Upsolver's and to transform your data

AWS IAM
AWS IAM
Step 2
Deploy Upsolver on AWS
AWS Role Permissions
connections system table
Functions
Operators
Amazon S3
ClickHouse
transformation job
Functions
Operators
transformation
Start Your Free Trial