Iceberg Partition Clustering

Explore how Upsolver uses partition clustering in Apache Iceberg to manage large datasets.

What is partition clustering?

Partition clustering is a strategy within Upsolver to efficiently organize and manage large datasets that are partitioned on high-cardinality columns. Let's explore how this works within the Upsolver environment.

Imagine we have a dataset partitioned by both time (hour) and country. Each partition corresponds to a specific combination of hour and country. For instance, data for the 13th of May, 2024, in the United States would be stored in a folder named 2024_05_13_01_US. Similarly, data for the same date in the United Kingdom would be stored in a folder named 2024_05_13_01_UK, and so on.

Within each partition folder, multiple data files are generated. As data is continuously streamed into the system, a new file is typically created every minute, reflecting the high frequency of data writes. These files are named according to the minute they represent, for example, 2024_05_13_01_US/01.parquet, 2024_05_13_01_US/02.parquet, and so forth.

However, due to variations in data distribution, some partitions may accumulate significantly more data than others. For instance, the US partition might contain 90% of the data, while other countries have substantially less.

This non-uniform distribution can impact system performance, particularly when dealing with a high volume of small files. The process of reading and writing numerous small files can introduce latency and overhead.

How does partition clustering work?

Partition clustering within Upsolver addresses this issue by optimizing file storage. Instead of storing every small file separately within its respective partition, small files are merged together into a larger, non-partitioned file. The merged file is then sorted based on the partition criteria, ensuring data remains organized and accessible.

Periodically, as part of an offline compaction process, the merged file may be evaluated. If the accumulated data within a merged file is substantial enough to represent a significant portion of a partition, it may be moved from the non-partitioned storage into the relevant partition folder. This ensures data remains efficiently organized while balancing storage considerations.

What are the benefits of using partition clustering?

By reducing the number of small files and optimizing storage organization, partition clustering in Upsolver minimizes latency and improves system performance, enabling faster data access and processing.

Reduced query time - Data is organized more efficiently and therefore queries return faster.
Reduced full table scan duration - By significantly decreasing the total number of files in the table, partition clustering shortens the duration of full table scans.
Near real time data freshness - Partition clustering ensures efficient management of partitions with high cardinality, allowing data to be written and refreshed with minimal latency. By consolidating small files into larger ones and optimizing storage organization, partition clustering minimizes the time required for data refresh processes, enhancing overall system efficiency.
Reduced S3 API costs - With fewer individual files due to the merging of small files, partition clustering helps reduce the number of S3 API calls required for data access, leading to improved performance and cost-effectiveness.

When to use partition clustering?

Partition clustering is particularly beneficial for datasets where data arrives frequently across multiple partitions. It allows users to write to tables with such high cardinality while minimizing the number of files generated, thereby streamlining data management processes. The following are examples of scenarios that benefit from clustering:

Tables frequently filtered by high cardinality columns.
Tables exhibiting significant skew in data distribution.
Tables experiencing rapid growth and requiring maintenance and tuning efforts.
Tables where typical partition keys could result in either an excessive or inadequate number of partitions.

When to avoid partition clustering?

Partition clustering may not be suitable when complete separation of data is crucial. If you must keep data from different partitions in physically separate folders, particularly from a security perspective, avoid using partition clustering.

Which clustering keys should I choose?

When choosing clustering keys, it's advisable to prioritize columns frequently used in query filters. The order of the clustering keys is important, as this impacts the sorting of the non-partitioned file.

For example, if two columns used by the clustering key are related, specify the more general column before the more data specific column, e.g. Country then City.

Limitations

When using partition clustering in Upsolver, please be aware of the following limitations:

Primary keys not supported on tables with partition clustering.
As tables with clustering do not have a primary key defined, you can only append (insert) data to these tables. Updating or deleting existing records with a transformation (merge) job is not supported.
Partition clustering is not relevant for Materialized Views
Data will not be retained for records in the non-partitioned storage.

Syntax

Create table

The following syntax creates an Iceberg table, with partitioning, or partition clustering:

CREATE ICEBERG TABLE <table_identifier>
([ <column_name> <column_type> [, ...]])
   { PARTITIONED | PARTITION | CLUSTERED | CLUSTER } BY <column_name> [, ...]
   PRIMARY KEY <column_name> [, ...]
   -- Specify sort order (default is ASC)
   ORDER BY <column_name> [ASC | DESC] [, ...] -
   [{ table_options }];

In the case where sort fields are included, Upsolver automatically sorts the non-partitions storage by the clustering fields and then by the sort fields.

You can also add transformations to the clustering field, partition keys, or sorting keys using Iceberg functions.

Note

You cannot create a table using both PARTITIONED BY and CLUSTERED BY clauses:

Use PARTITIONED BY for regular partitions
Use CLUSTERED BY for partition clustering

When partitioning is solely based on time or other fields with low cardinality, clustering may not provide significant benefits.

Update table clustering

The following syntax creates partitioned clustering on the table and overrides the existing partitioning:

ALTER TABLE <table_identifier> CLUSTER BY (<column_name> [, ...]);

Remove table clustering

Use the following syntax to remove the table clustering:

ALTER TABLE <table_identifier> CLUSTER BY ();

Last updated 4 days ago