I've been looking at Iceberg for a while, but in the end went with Delta Lake be...

pammf · on Jan 26, 2025

Iceberg has the hdfs catalog, which also relies only on dirs and files.

That said, a catalog (which Delta also can have) helps a lot to keep things tidy. For example, I can write a dataset with Spark, transform it with dbt and a query engine (such as Trino) and consume the resulting dataset with any client that supports Iceberg. If I use a catalog, all happens without having to register the dataset location in each of these components.

mritchie712 · on Jan 26, 2025

Why don't you want a catalog? The SQL or REST catalogs are pretty light to set up. I have my eye on lakekeeper[0], but Polaris (from Snowflake) is a good option too.

PyIceberg is likely the easiest way to write without Spark.

0 - https://github.com/lakekeeper/lakekeeper

datancoffee · on Jan 27, 2025

We did an evaluation of various REST catalog options and went with Open Catalog from Snowflake (a Polaris-based managed service that works independently from their data warehousing solution). Lakekeeper is nice - it's one of the few catalogs with FGAC and table maintenance.

https://tower.dev/blog/picking-snowflake-open-catalog-as-a-m...

anktor · on Jan 26, 2025

PyIceberg is nice but we had to drop it because it's behind Java API and it's unclear when it will match up, so depending on which features are needed I'd look it up

mritchie712 · on Jan 26, 2025

what are you using instead?