Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With regards to Spark, consider if you could possibly do the processing on a single machine. For many workloads, a single threaded program running on a laptop will beat an entire Spark cluster.

https://www.frankmcsherry.org/assets/COST.pdf



This was a great point to make at the time, when people thought “my days has exceeded Excel’s row limit, therefore I should set up a Hadoop cluster and run Spark jobs against it”

Since then … it’s become a bit of a meme, unfortunately. Definitely there still exist workloads assigned to Spark clusters that could run on a laptop, especially if the data happens to be there already. But the space as a whole provides immense value, both enabling jobs that really don’t fit on laptops, and moving the compute for laptop sized jobs to where the data happens to be.


The datasets they test against are 6gb and 15gb, and I get that those are the two one of their references uses, but that’s clearly not multi-node territory. Also as they point out graph computation is not trivially parallelized. Spark is more for doing long running transformations on independent data in a fault tolerant way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: