With regards to Spark, consider if you could possibly do the processing on a sin...

kristjansson · on April 8, 2023

This was a great point to make at the time, when people thought “my days has exceeded Excel’s row limit, therefore I should set up a Hadoop cluster and run Spark jobs against it”

Since then … it’s become a bit of a meme, unfortunately. Definitely there still exist workloads assigned to Spark clusters that could run on a laptop, especially if the data happens to be there already. But the space as a whole provides immense value, both enabling jobs that really don’t fit on laptops, and moving the compute for laptop sized jobs to where the data happens to be.

lamp_book · on April 8, 2023

The datasets they test against are 6gb and 15gb, and I get that those are the two one of their references uses, but that’s clearly not multi-node territory. Also as they point out graph computation is not trivially parallelized. Spark is more for doing long running transformations on independent data in a fault tolerant way.