Upsolver's compaction process

This page describes how Upsolver performs file compaction for faster querying.

Upsolver continuously compacts table files in the background to provide fresh data and deal with the problems that arise from having many small files in the data lake. For a detailed explanation of the issue with small files and how Upsolver addresses this challenge, refer to this blog post. The compaction process is vital for improving query performance and optimizing data management.

Overview

Compaction is a gradual background process aimed at reorganizing written data files for faster querying. Although it takes a while to complete, the data is readily available and included in query results. Upsolver strives to create files of approximately 500MB each, although compaction is limited by the partitions. Therefore, files within a partition smaller than 500MB will not be merged with different partition files.

Components

The compaction process consists of two main components that work collaboratively:

  1. Coordinator Task: This task determines how to compact/split files and keeps track of all related files.

  2. Compaction Task: This task rewrites the files.

The compaction process is continuous, and the coordinator always checks if there is anything to compact. If so, the compaction tasks perform the necessary actions. If a compaction task fails, it can continue from approximately the same place, as the coordinator tracks of the files needing compaction.

File Path

The file path includes compaction_id = <int>; the more compactions the files undergo, the larger the compaction ID will be. To avoid saving the same data multiple times, Upsolver deletes the old compactions that are no longer needed after a newer compaction ID is ready. The deletion of old compaction partitions happens within the delete_old_compactions task.

How Does Compaction Work?

  1. Upsolver decides to trigger a compaction for a specific partition.

  2. The compaction task starts compacting the files and writes them to the new compacted partition location.

  3. The output/job continues writing new data to both the old and new partition locations to ensure data consistency during compaction.

  4. After the compaction completes, Upsolver updates the partition location to the new compacted location and stops writing data to the old location.

Conditions That Trigger Compaction

  1. The partition has 60 files or more, the average file size is less than 32MB, and the last file written to the partition is older than 2 days.

  2. The partition has 1000 files or more, and the average file size is less than 32MB.

  3. The last file written to the partition is older than 35 days, and the average file size is less than 32MB.

Compaction and Upserts

Compaction is crucial for upserting outputs. As new data arrives, Upsolver appends the new data changes and uses Merge On Read to return the latest data when queried. The compaction process then removes duplicate data, keeping only the latest rows and removing deleted rows.

Last updated