The Basics of Compaction — Bin Packing Your Data for Efficiency

In the first post of this series, we explored how Apache Iceberg tables degrade when left unoptimized. Now it’s time to look at the most foundational optimization technique: compaction.

Compaction is the process of merging small files into larger ones to reduce file system overhead and improve query performance. In Iceberg, this usually takes the form of bin packing — grouping smaller files together so they align with an optimal size target.

Why Bin Packing Matters

Query engines like Dremio, Trino, and Spark operate more efficiently when reading a smaller number of larger files instead of a large number of tiny files. Every file adds cost:

It triggers an I/O request
It needs to be tracked in metadata
It increases planning and scheduling complexity

By merging many small files into fewer large files, compaction directly addresses:

Small file problem
Metadata bloat in manifests
Inefficient scan patterns

How Standard Compaction Works

A typical Iceberg compaction job involves:

Scanning the table to identify small files below a certain threshold.
Reading and coalescing records from multiple small files within a partition.
Writing out new files targeting an optimal size (commonly 128MB–512MB per file).
Creating a new snapshot that references the new files and drops the older ones.

This process can be orchestrated using:

Apache Spark with Iceberg’s RewriteDataFiles action
Dremio with its OPTIMIZE command

Example: Spark Action

import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option("target-file-size-bytes", 134217728) // 128 MB
  .execute()

This will identify and bin-pack small files across partitions, replacing them with larger files.

Tips for Running Compaction

Target file size: Match your engine’s ideal scan size. 128MB or 256MB often work well.
Partition scope: You can compact per partition to avoid touching the entire table.
Job parallelism: Tune parallelism to handle large volumes efficiently.
Avoid overlap: If streaming ingestion is running, compaction jobs should avoid writing to the same partitions concurrently (we’ll cover this in Part 3).

When Should You Run It?

That depends on:

Ingestion frequency: Frequent writes = more small files = more frequent compaction
Query behavior: If queries touch recently ingested data, compact often
Table size and storage costs: The larger the table, the more benefit from compaction

In many cases, a daily or hourly schedule works well. Some platforms support event-driven compaction based on file count or size thresholds.

Tradeoffs

While compaction boosts performance, it also:

Consumes compute and I/O resources
Temporarily increases storage (until old files are expired)
Can interfere with concurrent writes if not carefully scheduled

That’s why timing and scope matter—a theme we’ll return to later in this series.

Up Next

Now that you understand standard compaction, the next challenge is applying it without interrupting streaming workloads. In Part 3, we’ll explore techniques to make compaction faster, safer, and more incremental for real-time pipelines.

The Basics of Compaction — Bin Packing Your Data for Efficiency

The Basics of Compaction — Bin Packing Your Data for Efficiency

Why Bin Packing Matters

How Standard Compaction Works

Example: Spark Action

Tips for Running Compaction

When Should You Run It?

Tradeoffs

Up Next

Share :

Related Posts

Nessie - An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

Embracing the Future of Data Management - Why Choose Lakehouse, Iceberg, and Dremio?