Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've had to test out various networked filesystems this year for a few use cases (satellite/geo) on a multi petabyte scale. Some of my thoughts:

* JuiceFS - Works well, for high performance it has limited use cases where privacy concerns matter. There is the open source version, which is slower. The metadata backend selection really matters if you are tuning for latency.

* Lustre - Heavily optimised for latency. Gets very expensive if you need more bandwidth, as it is tiered and tied to volume sizes. Managed solutions available pretty much everywhere.

* EFS - Surprisingly good these days, still insanely expensive. Useful for small amounts of data (few terabytes).

* FlexFS - An interesting beast. It murders on bandwidth/cost. But slightly loses on latency sensitive operations. Great if you have petabyte scale data and need to parallel process it. But struggles when you have tooling that does many small unbuffered writes.





Did you happen to look into CephFS? CERN (folks that operate Large Hadron Collider) use it to store ~30PB of scientific data. Their analysis cluster is serving ~30GB/s reads

Sure, so the use case I have requires elastic storage and elastic compute. So CephFS really isn't a good fit in the cloud environment for that case. It would get prohibitively expensive.

Ceph is more something you build your own cloud with than something you run on someone else's cloud.

Nothing around content addressable storage? Has anyone used something like IPFS / Kubo in production at that kind of scale?

(for those who don't know IPFS, I find the original paper fascinating: https://arxiv.org/pdf/1407.3561)


The latency and bandwidth really isn't there for HPC.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: