Optimization Processes for Iceberg Tables in Upsolver

Upsolver employs several optimization processes to enhance the performance and manageability of Iceberg tables. These processes are designed to maintain efficient storage, ensure high query performance, and reduce operational overhead. Below are the key optimization processes performed by Upsolver:

1. Continuous Compaction

Compaction in Upsolver runs continuously and is specifically optimized for streaming data. The compaction process involves:

Monitoring and Selection: Regularly checking for potential compaction opportunities.
Optimization Criteria: Selecting compactions that offer the highest predicted query performance gains and cost reduction relative to the cost of performing the compaction.

This approach ensures that the Iceberg tables remain optimized for query performance without incurring unnecessary computational costs.

2. Snapshot Expiration

Iceberg operations generate new snapshots, which are available for user queries, enabling features such as time travel. However, storing these snapshots can lead to increased storage requirements. To manage this, Upsolver automatically cleans up old snapshots.

Users can configure the retention of snapshots using Iceberg table properties as detailed in the Iceberg documentation.

This clean-up process occurs every few hours, ensuring that only necessary snapshots are retained, thereby optimizing storage usage.

3. Dangling File Clean-up

During Iceberg operations, files may sometimes become unreferenced or "dangling". These files can accumulate, leading to increased storage costs. Upsolver addresses this by performing a daily clean-up of detected dangling files.

Daily Clean-up: Automatically seeking out and removing dangling files from the table's storage location.

This daily clean-up helps maintain a tidy and cost-effective storage environment.

4. Data Retention

Upsolver provides configurable data retention policies, allowing users to define how long data should be retained based on date partitions. This process involves:

Retention Configuration: Setting a retention period based on a date partition in the table.
Automatic Deletion: Automatically deleting data in partitions that fall outside the retention period.

This ensures that outdated data is removed in a timely manner, helping manage storage and maintain compliance with data governance policies.

By incorporating these optimization processes, Upsolver ensures that Iceberg tables are efficient, performant, and cost-effective, while also providing flexibility and control to customers.

Read the guide on how to Optimize Your Iceberg Tables to learn how to leverage this functionality in Upsolver.

Last updated 1 year ago