Build a Data Lakehouse

This guide shows you haw to build a data lakehouse with Upsolver and load data into ClickHouse for real-time analysis.

Build an Upsolver-managed data lakehouse

A data lakehouse is the most cost-effective, feature complete way to build a managed, open standards-based shared storage for analytics and AI. Leading the standards for open table formats is Apache Iceberg that is enabling this shared experience and is fully supported by Upsolver.

Building on the data lake, a lakehouse continues to provide the cheap storage we now take for granted. This enables companies to store all their data in Apache Iceberg, offering historical and unstructured data for ML and other use cases, while business critical data can be loaded into an OLAP engine such as ClickHouse for faster, real-time analytics.

When you build a data lakehouse with Upsolver, we manage your Iceberg tables by running background compaction and tuning operations based on industry best practice - and at the optimal time. By compacting the data files, users will experience better query performance and storage costs will be reduced.

Upsolver & ClickHouse

Upsolver's integration with ClickHouse enables customers to reliably ingest large volumes of data in near real-time. With support for the main industry data platforms, Upsolver can ingest from databases, streaming, and file sources into ClickHouse Cloud 50-100X faster than other solutions.

Purpose-built for big data, Upsolver easily moves high-volume, streaming data from your operational systems into your analytics target. Whether your source is generating application events at a rate of thousands per second, or your gaming or ad-tech company creates millions of events per second, Upsolver can ingest your stream into your data lakehouse and target analytics systems with ease.

Furthermore, Upsolver automates schema evolution, maintains performance, and ensures the strongly-ordered, exactly once-delivery of your data.

Upsolver automatically detects changes in the source and streams only changed rows to ClickHouse. Orchestration and task scheduling is handled internally, eliminating the need for external tools like Airflow or AWS Lambda.

What you will learn

In this guide, you will learn how to build a data lakehouse using Upsolver-managed Iceberg tables. You will create a job to load data from Apache Kafka to your Iceberg table to store historical data, benefiting from the cheap storage of your data lakehouse. You will then see how to create a job that loads the same data into a table in your ClickHouse database, ready for you to analyze.

Prerequisites

Before you begin, please read the guide on How to Configure SSL for Your Apache Kafka connection to enable permission for Upsolver to read the data from your Apache Kafka source.

The steps for building your lakehouse and ingesting data from Apache Kafka to Iceberg and ClickHouse are as follows:

Connect to Apache Kafka
Create an Upsolver-managed Iceberg table
Create a job to ingest your Kafka data to your Iceberg table
Connect to ClickHouse
Create a job to ingest data to your ClickHouse table

Step 1

Create a connection to Apache Kafka

The first step is to connect to your Apache Kafka source. When you create a connection in Upsolver, other users in your organization will be able to see this connection, and it remains always on, meaning you don't need to reconnect.

Here's the code:

CREATE KAFKA CONNECTION prod_kafka_cluster
   HOSTS = ('pkc-2396y.us-east-1.aws.confluent.cloud:9092')
   CONSUMER_PROPERTIES = 
     'bootstrap.servers=pkc-2396y.us-east-1.aws.confluent.cloud:9092
     security.protocol=SASL_SSL
     sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule            
         required username="XXXXXXXX"   password="-----------";
     ssl.endpoint.identification.algorithm=https
     sasl.mechanism=PLAIN';

Step 2

Create an Upsolver-managed Iceberg table

The next step is to create an Iceberg table. The code is very similar to creating a standard table in your data lake, but here we simply include the ICEBERG keyword. The following example creates an Iceberg table named orders_ice_tbl, in the sales_db database in the default_glue_catalog catalog.

Here's the code:

CREATE ICEBERG TABLE default_glue_catalog.sales_db.orders_ice_tbl()
    PARTITIONED BY event_date;

Notice that we did not specify any columns. This enables Upsolver to manage the schema by identifying the source column names and types, and evolve the schema as changes are detected in the source. All this takes place automatically, ensuring your pipeline doesn't break when a source column changes or new columns are added. Therefore, no manual intervention is needed, as Upsolver takes care of schema changes for us. Upsolver will also continuously compact and tune the table in the background, ensuring queries are returned as fast as possible.

Step 3

Ingest data from Apache Kafka to Iceberg

Now that we have created our Iceberg table, and Upsolver is taking care of schema evolution and compaction, we can begin loading our data. To do that we need to create an ingestion job.

Here's the code to create the job:

CREATE SYNC JOB load_orders_to_iceberg
    START_FROM = NOW
AS COPY FROM KAFKA prod_kafka_cluster 
   TOPIC = 'orders' 
INTO ICEBERG default_glue_catalog.sales_db.orders_ice_tbl;

As soon as we execute the above code, Upsolver begins the process of ingesting the data, which starts with defining the source schema. Having determined the source structure, Upsolver creates the columns in our orders_ice_tbl table, and the data begins streaming.

In the above example, the START_FROM job option specifies that we want to ingest all events starting from NOW. This instructs Upsolver to ignore previous events, though we could specify BEGINNING to ingest all events, or pass in a date to set the starting point, e.g. the first day of the current year.

Step 4

Create a connection to ClickHouse

The next step is to create a connection to your target ClickHouse database. As with your Apache Kafka connection, this connection is visible to all users within your organization, and is always on.

Here's the code:

CREATE CLICKHOUSE CONNECTION my_clickhouse_connection
    CONNECTION_STRING = 'http://x.us-east-1.aws.clickhouse.cloud:8123/sales_db'
    USER_NAME = 'my_username'
    PASSWORD = 'my_password'
    COMMENT = 'Connection to Sales database';

Step 5

Ingest data from Apache Kafka to ClickHouse

Now that we have connected to ClickHouse, the final step is create a job to load the data from Apache Kafka into ClickHouse. To reduce your ClickHouse storage costs, ensure you configure a short retention time on the target table. Let's ingest our orders topic to ClickHouse.

Here's the code:

CREATE SYNC JOB load_orders_to_clickhouse
    START_FROM = NOW
AS COPY FROM KAFKA prod_kafka_cluster 
    TOPIC = 'orders'  
INTO CLICKHOUSE my_clickhouse_connection.sales_db.orders_tbl;

The above example creates a job named load_orders_to_clickhouse that will ingest the orders topic in the prod_kafka_cluster connection, into the orders_tbl table in the sales_db database in ClickHouse.

As with our job that ingests the data from Kafka to our Iceberg table, we have set the START_FROM option to begin from NOW, so that previous events are ignored. While we can retain large volumes of historical data at low cost in S3, we want to minimize our ClickHouse storage cost. Therefore, we will ingest only new events and, along with the data retention policy on the ClickHouse table, retain the minimum amount of data to fulfill our analytics requirements.

Conclusion

In this guide, you learned how to create a connection to your Apache Kafka source that is shared with all users in your organization and permanently connected. Then you saw how to create an Iceberg table without defining columns and types so that Upsolver can manage the schema evolution and automatically handle changes.

You learned that Upsolver takes care of the table maintenance operations that are essential for optimizing your tables in Iceberg, including compaction to speed up queries and reduce storage costs. Next, you created a simple but powerful job to ingest events from Kafka to the new Iceberg table.

In the latter steps, you learned how to create a connection to your ClickHouse database, and how to create a job to ingest data from Kafka.

Try it yourself

To create your data lakehouse, and ingest data from Kafka to ClickHouse:

Connect to your Apache Kafka source
Create your data lakehouse table
Write a job to ingest data from Apache Kafka to your lakehouse table
Connect to your ClickHouse database
Write a job to ingest data from Apache Kafka to your ClickHouse table

Last updated 1 year ago