Lance table format explained simply, stupid

TLDR (but stay for the animations!): Lance is a successor to Iceberg / Delta Lake, more optimized for random reads, and supports adding ad-hoc columns without needing to copy all the data.

It feels like a while since I last kept up with the big data world.

Some big things happened in 2025:

Iceberg V3 spec got released and added cool stuff like VARIANT.
turbopuffer announced a vector search over object storages (similar to Quickwit).
Apache Fluss lets Flink manage real-time streams with tiering to object storage.
Datadog bought Quickwit.
Databricks bought Neon.

I’m noticing a theme here. If I write about you in a blog post, someone will buy you…

I don’t know how, but something way bigger flew completely under my radar, most likely as I was pretty busy building at $DAY_JOB (some pretty cool stuff, I must say).

This thing is called Lance. It’s a file format (like Apache Parquet), a table format (like Apache Iceberg), and a catalog spec (like Iceberg’s REST catalog spec).

To quickly get a gist of it, I used Claude to generate me an animation comparing Parquet and Lance file format. It did a really good job, so after a few hours of iterating, reading more of Lance’s docs, and improving the texts, I got something that looks good, and will most likely be educational.

Lance file format is similar to Parquet, but more optimized for random reads (WHERE id = 123), while still preserving Parquet’s performance when doing sequential reads.

Official docs here.

Something interesting to test is how would Parquet behave if we configure it to store each page as 64kb instead of the default 1mb 🤔.

Lance table format is similar to Iceberg, but allows adding columns ad-hoc without copying all the data (just to add a value for the new column to all rows), while still preserving Iceberg’s MVCC.

Another great feature of Lance tables is they also support indexes, such as BTree, inverted index (FTS), and vectors (e.g. HNSW).

Official docs here.

Apparently there’s another open-source Parquet competing file format called vortex created by SpiralDB which seems like a direct competitor to LanceDB.

These technologies only came about because of a need for multi-modal data lakes now that AI is so prevalent.

I wonder what other technologies will come from this AI software era.

Source link

Leave a Reply Cancel reply