Iceberg Cloud Storage Breakdown

This page describes the cloud storage layout of Iceberg tables. It will explain which folders exists and what the data within them is used for.

Background

Iceberg tables managed by Upsolver are made up of two types of files:

  1. Iceberg Table Files - These files make up the iceberg table itself and are used by query engines when querying the table. These include metadata files and data files.

  2. Upsolver Files - As part of ingesting data into an Iceberg Table or Table Optimization tasks upsolver creates some intermediate, state, and statistics files. These are used to keep track of the internal processes and display information in the UI.

The Upsolver files may be inside the Iceberg table root folder or in a dedicated location depending on the job/table settings.

Below is a breakdown of the folders.

Folder Structure Overview

<iceberg_table_root>/data/

  • Purpose: Stores the Iceberg table's data files.

  • Contents: Includes data files from both the latest snapshot and older snapshots, which are required for Time Travel queries.

  • Retention Control: The retention of historical snapshots can be managed via the following table properties:

    • history.expire.max-snapshot-age-ms: Maximum age of snapshots before they expire.

    • history.expire.min-snapshots-to-keep: Minimum number of snapshots to retain.

<iceberg_table_root>/metadata/

  • Purpose: Stores the Iceberg table's metadata files.

  • Contents: Similar to the data/ folder, it contains metadata from both the latest and older snapshots used in Time Travel queries.

  • Retention Control: Managed by the same properties used for data retention: history.expire.max-snapshot-age-ms and history.expire.min-snapshots-to-keep.

<upsolver_storage_location>/tables/<table_id>/dangling_files_backup/

  • Purpose: Holds orphaned files found in the data or metadata folders that are not associated with any Iceberg snapshot.

  • Process: Upsolver periodically (about once per day) checks for such files. If found, they are moved to this backup folder from the table's root location.

  • Typical Size: Dangling files should not be very common under normal circumstances so this folder should not contain a lot of data usually.

  • Retention: By default, files are retained for 7 days before permanent deletion. Files are considered orphans if they have not been used by any snapshots and are at least 3 days old.

  • Recovery: If a file is incorrectly identified as orphaned, it can be restored by moving it back to the table's root folder.

  • Customization: Retention settings can be changed, but reducing durations decreases recovery capability. Contact Upsolver Support for assistance.

<upsolver_storage_location>/tables/<table_id>/expired_files_backup/

  • Purpose: Stores files that are no longer referenced by any Iceberg table snapshots due to snapshot expiration.

  • Process: When expiring snapshots, Upsolver first removes them from the Iceberg table metadata and moves files related to those snapshots into this folder.

  • Typical Size: If new data is constantly streaming into the table, new snapshots will constantly be created and old ones expired. The size of the folder is relative to the volume of data expired by old snapshots.

  • Retention: Files are retained here for 7 days before permanent deletion.

  • Customization: These settings can be changed, but shorter retention periods may reduce recovery flexibility. Contact Upsolver Support for changes.

<upsolver_storage_location>/tables/<table_id>/used_files_index/

  • Purpose: An index of files used by the Iceberg table. This index is maintained by Upsolver to help identify which files are orphaned and which may be safely deleted when snapshots expire.

  • Retention: Indefinite, retention will be added in the future

<upsolver_storage_location>/tables/<table_id>/compaction_results/

  • Purpose: Stores details and information about completed compaction tasks. Each file in this folder belongs to a single compaction shard.

  • Retention: Indefinite, retention will be added in the future

Static Files

<upsolver_storage_location>/tables/<table_id>/used_files_groupings.json.gz

  • Purpose: A state file used while building the used_files_index.

<upsolver_storage_location>/tables/<table_id>/iceberg_coordinator.json

  • Purpose: A state file used for planning which compactions to run.

<upsolver_storage_location>/tables/<table_id>/recent_compactions.json

  • Purpose: Contains a list of recent compactions, which are used to display compaction statistics in the frontend and system tables.

<upsolver_storage_location>/tables/<table_id>/statistics.json

  • Purpose: Contains Iceberg table statistics that are periodically collected.

  • Usage: These statistics are displayed in the frontend for monitoring table performance.

<upsolver_storage_location>/inputs/<job_id>/

  • Purpose: Contains files related to the job loading data into a table.

  • Retention: Most of these files are ephemeral and will be deleted once the data is loaded and committed to the table. Specifically the metadata folder inside this folder is not ephemeral, see below for more details.

  • Folder Size: The size of this folder will stabilize depending on the data volume being streamed into the table.

To find out which job a specific job_id refers to, you can query the system.information_schema.jobs table.

<upsolver_storage_location>/inputs/<job_id>/metadata/

  • Purpose: Stores statistics about data written to the table. This metadata is used by the system to discover schema information and by the frontend and system tables to display data statistics.

<upsolver_storage_location> may be <iceberg_table_root> or in a dedicated location. This can be controlled via job / table settings such as INTERMEDIATE_STORAGE_LOCATION and INTERMEDIATE_STORAGE_CONNECTION. By default new tables should not place <upsolver_storage_location> inside the table root. However, older tables/jobs were created this way by default.

Last updated