The Basics of Compaction — Bin Packing Your Data for Efficiency

02 Mins read

The Basics of Compaction — Bin Packing Your Data for Efficiency

The Basics of Compaction — Bin Packing Your Data for Efficiency

In the first post of this series, we explored how Apache Iceberg tables degrade when left unoptimized. Now it’s time to look at the most foundational optimization technique: compaction.

Compaction is the process of merging small files into larger ones to reduce file system overhead and improve query performance. In Iceberg, this usually takes the form of bin packing — grouping smaller files together so they align with an optimal size target.

Why Bin Packing Matters

Query engines like Dremio, Trino, and Spark operate more efficiently when reading a smaller number of larger files instead of a large number of tiny files. Every file adds cost:

  • It triggers an I/O request
  • It needs to be tracked in metadata
  • It increases planning and scheduling complexity

By merging many small files into fewer large files, compaction directly addresses:

  • Small file problem
  • Metadata bloat in manifests
  • Inefficient scan patterns

How Standard Compaction Works

A typical Iceberg compaction job involves:

  1. Scanning the table to identify small files below a certain threshold.
  2. Reading and coalescing records from multiple small files within a partition.
  3. Writing out new files targeting an optimal size (commonly 128MB–512MB per file).
  4. Creating a new snapshot that references the new files and drops the older ones.

This process can be orchestrated using:

  • Apache Spark with Iceberg’s RewriteDataFiles action
  • Dremio with its OPTIMIZE command

Example: Spark Action

import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option("target-file-size-bytes", 134217728) // 128 MB
  .execute()

This will identify and bin-pack small files across partitions, replacing them with larger files.

Tips for Running Compaction

  • Target file size: Match your engine’s ideal scan size. 128MB or 256MB often work well.

  • Partition scope: You can compact per partition to avoid touching the entire table.

  • Job parallelism: Tune parallelism to handle large volumes efficiently.

  • Avoid overlap: If streaming ingestion is running, compaction jobs should avoid writing to the same partitions concurrently (we’ll cover this in Part 3).

When Should You Run It?

That depends on:

  • Ingestion frequency: Frequent writes = more small files = more frequent compaction

  • Query behavior: If queries touch recently ingested data, compact often

  • Table size and storage costs: The larger the table, the more benefit from compaction

In many cases, a daily or hourly schedule works well. Some platforms support event-driven compaction based on file count or size thresholds.

Tradeoffs

While compaction boosts performance, it also:

  • Consumes compute and I/O resources

  • Temporarily increases storage (until old files are expired)

  • Can interfere with concurrent writes if not carefully scheduled

That’s why timing and scope matter—a theme we’ll return to later in this series.

Up Next

Now that you understand standard compaction, the next challenge is applying it without interrupting streaming workloads. In Part 3, we’ll explore techniques to make compaction faster, safer, and more incremental for real-time pipelines.

Share :

Related Posts

Nessie -  An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Nessie - An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Unlike traditional table formats, Apache Iceberg provides a comprehensive solution for handling big data's complexity, volume, and diversity. It's designed to improve data processing in various analyt...

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

The concept of the **Open Lakehouse** has emerged as a beacon of flexibility and innovation. An Open Lakehouse represents a specialized form data lakehouse (bringing data warehouse like functionality/...

Embracing the Future of Data Management - Why Choose Lakehouse, Iceberg, and Dremio?

Embracing the Future of Data Management - Why Choose Lakehouse, Iceberg, and Dremio?

Data is not just an asset but the cornerstone of business strategy. The way we manage, store, and process this invaluable resource has evolved dramatically. The traditional boundaries of data warehous...