Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a big deal in the database world as delta, iceberg and hudi mean that data is being stored in an open source format, often on S3.

It means that the storage and much of the processing is being standrdised so that you can move between databases easily and almost all tools will eventually be able to work with the same set of files in a transactionally sound way.

For instance, Snowflake could be writing to a file, a data scientist could be querying the data live from a Jupyter notebook, and ClickHouse could be serving user facing analytics against the same data with consistency guarantees.

If the business then decide to switch Snowflake to Databricks then it isn’t such a big deal.

Right now it isn’t quite as fast to query these formats on S3 as a native ingestion would be, but every database vendor will be forced by the market to optimise for performance such that they tend towards the performance of natively ingested data.

It’s a great win for openness and open source and for businesses to have their data in open and portable formats.

Lakehouse has the same implications. Lots of companies have data lakes and data warehouses and end up copying data between the two. To query the same set of data and have just one system to manage is equally impactful.

It’s a very interesting time to be in the data engineering world.



Apache Arrow and Substrait have been working towards making this a reality. I see a future where executing a query can/will send plans to many different engines distributed across the cloud, but also locally on your on machine.


Real-time Bidding on query execution? The more I think about it, I believe you actually have a viable business model here.


That’s a wildly interesting idea.

It open up another market too: compatible, scalable storage. Sell shovels in a gold-rush, and what better shovel than the substrate infrastructure that those bidding query engines would probably depend on.


If the queries can be executed by any provider, you are talking about a commodity product.

The business model of selling a commodity is wildly unlike the business model tech is in today.


The query execution might be commodity, but the purchasers will still need to store their data somewhere, and this somewhere will need to be able to service the bandwidth and requirements of the query execution providers.


It feels like you could just as well pack the runtime/engine into the job you are requesting? Am I wrong?


The point is more so in aim of creating interoperability between systems and making them in turn composable.

When there’s a common intermediate representation you can pass around those compute instructions and execute. And when there’s shared memory formats data can pass from storage to engine without serialization/deserialization.

So it wouldn’t matter if data is here or there, in this or that format, because the instructions are the same the specific interface (snowflake, MySQL, a local parquet file, etc) is irrelevant mitigating the need for glue code.


> " every database vendor will be forced by the market to optimise for performance such that they tend towards the performance of natively ingested data."

This assumes that their internal storage format has nothing to do with decades of engineering infrastructure that they built their business model around and that they would simply give all that up and compete based on just their compute layer. snowflake might as well shutup shop and return billions to the investors. Locking in data into their ecosystem is their whole business model.

Is there as good example of open standard forcing companies to give up their proprietary tech ?


That's the natural evolution of most tech markets. When the tech is young, proprietary companies dominate because they can control the customer experience better and deliver functionality that is simply too complex for open solutions. As the technology matures, customers start demanding interoperability, reliability, better prices, and eventually some employees "defect" from one of the big companies and start the open standards that replace their ex-employer, or an outsider reads a paper and re-implements the technology from scratch.

> Is there as good example of open standard forcing companies to give up their proprietary tech ?

UNIX -> Linux, BSD

Oracle/Sybase -> MySQL/PostgresQL

Symbolics/Lucid -> Common Lisp

Altair/Apple/Commodore/Atari -> IBM PC & clones

VMWare -> QEMU

Basically every tech that Google pioneered and then missed out on commercializing. Protobufs -> Avro/Parquet, MapReduce -> Hadoop, Flume -> Spark, Chubby -> Zookeeper, Borg -> Kubernetes, etc.


I’ll just point out on the Snowflake side, we’ve been very public saying we want Iceberg/Parquet to be at or as close to parity as possible with our native format. The value add is the platform, not lock in. That also forces us to be the best on open formats, which IMO is also a good thing for everyone.

Disclaimer: I work at Snowflake literally on this with my team. :)


> we’ve been very public saying we want Iceberg/Parquet to be at or as close to parity as possible with our native format

Thats great to hear. Would this mean that external iceberg tables would have the same performance as native table ? My impression of parent comment was that, eventually there would be no such thing as 'native format'. Really interested to see public statements by snowflake to that effect, would love to share that with my team.


> snowflake might as well shutup shop and return billions to the investors.

I mean, we can dream right?

There’s a bunch of companies that I don’t believe deserve their status or valuation and Snowflake is one of them.


[flagged]


...gpt? Or ... why would you summarize an adjacent thread? Am I missing a joke?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: