LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Articles
Articles
  • Articles
  • GET STARTED
    • Core Concepts
      • Core Components
      • Deployment Models
      • Entities Overview
      • Upsolver Timeline
      • Schema Detection and Evolution
    • Pipeline Basics
    • Understanding Sync and Non-Sync Jobs
  • DATA
    • Optimization Processes for Iceberg Tables in Upsolver
    • Column Case Sensitivity
    • Column Transformations
    • Compaction Process
    • Expectations
    • Field Name Encoding
    • Iceberg Adaptive Clustering
    • Schema Evolution
      • Iceberg Schema Evolution
      • Snowflake Schema Evolution
      • Redshift Schema Evolution
    • System Columns
    • Working with Date Patterns
  • JOBS
    • Ingest Data Using CDC
      • Performing Snapshots
      • MySQL Binlog Retention
      • PostgreSQL Partitioned Tables
      • CDC Known Limitations
    • Transformation
      • Flattening Arrays
      • Working with Arrays
Powered by GitBook
On this page
  • 1. Continuous Compaction
  • 2. Snapshot Expiration
  • 3. Dangling File Clean-up
  • 4. Data Retention
  1. DATA

Optimization Processes for Iceberg Tables in Upsolver

Last updated 10 months ago

Upsolver employs several optimization processes to enhance the performance and manageability of Iceberg tables. These processes are designed to maintain efficient storage, ensure high query performance, and reduce operational overhead. Below are the key optimization processes performed by Upsolver:

1. Continuous Compaction

Compaction in Upsolver runs continuously and is specifically optimized for streaming data. The compaction process involves:

  • Monitoring and Selection: Regularly checking for potential compaction opportunities.

  • Optimization Criteria: Selecting compactions that offer the highest predicted query performance gains and cost reduction relative to the cost of performing the compaction.

This approach ensures that the Iceberg tables remain optimized for query performance without incurring unnecessary computational costs.

2. Snapshot Expiration

Iceberg operations generate new snapshots, which are available for user queries, enabling features such as time travel. However, storing these snapshots can lead to increased storage requirements. To manage this, Upsolver automatically cleans up old snapshots.

Users can configure the retention of snapshots using Iceberg table properties as detailed in the .

This clean-up process occurs every few hours, ensuring that only necessary snapshots are retained, thereby optimizing storage usage.

3. Dangling File Clean-up

During Iceberg operations, files may sometimes become unreferenced or "dangling". These files can accumulate, leading to increased storage costs. Upsolver addresses this by performing a daily clean-up of detected dangling files.

  • Daily Clean-up: Automatically seeking out and removing dangling files from the table's storage location.

This daily clean-up helps maintain a tidy and cost-effective storage environment.

4. Data Retention

Upsolver provides configurable data retention policies, allowing users to define how long data should be retained based on date partitions. This process involves:

  • Retention Configuration: Setting a retention period based on a date partition in the table.

  • Automatic Deletion: Automatically deleting data in partitions that fall outside the retention period.

This ensures that outdated data is removed in a timely manner, helping manage storage and maintain compliance with data governance policies.

By incorporating these optimization processes, Upsolver ensures that Iceberg tables are efficient, performant, and cost-effective, while also providing flexibility and control to customers.

Read the guide on how to to learn how to leverage this functionality in Upsolver.

Iceberg documentation
Optimize Your Iceberg Tables