> It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.
I don't understand what you are saying. Are you talking about the time the request is buffered in some queue on average assuming they are arriving randomly? Or something like that?
Here is what I'm thinking. We are operating a hospital which does only one type of surgery which last an hour exactly. (Presumably it is a veterinary practice for spherical cows.) A fully staffed operating room can operate on 24 spherical cows a day. If due to a spherical cow jamboree we expect more patients and we set up a second operating theatre we will be able to serve 48 of them a day. But we are still serving them for an hour each. (because that is how long the operation takes.)
Even if we are talking about latency when 24 cows show up at the stroke of midnight to the one operating room hospital they each will be served on average in 12.5h. Same average if 48 cows show up to the two operating room hospital.
So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.
in queue theory, you don't expect "operating rooms" to operate 24 hours per day - spherical patients may have a gap, which causes the room to not work for some time, but then jamboree happens and it averages out
doubling the cows input doesn't mean each "burst" becomes twice as big - some of the new cows can simply fall into periods that previously weren't used
thus the second portion of patients don't need a whole second copy of all operating rooms - part of them get gobbled into inactive timeslots of already existing resources
(so it seems putting two stochastic processes on top of each other is not like putting two "solid" things on top of each other, right? intuitively they mesh, I guess their stacking height is their "expected value"?
and their worst case will be the sum of their worst cases, where there's no averaging, right? so again intuitively a the larger a flow is the more dangerous is, even if it's smooth, because if it backs up it fills up queues/buffers faster, so to plan for extreme cases we still need "linear" thinking)
Kind of yes. Stacking two stochastic processes simply adds up the expectation value but not the noise/dispersion/volatility. That variation adds as a sum of squares.
If you push utilisation towards 1, what you essentially do is push the next "free" slot farther and farther into the future. This, essentially, means that you always buy higher utilisation with longer latency (at least in the upper bound). But the good thing is: If you have enough numbers, then the maximum latency grows slower with utilisation.
> So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.
Single Core vs Multi Core (ish).
With a Single thread you must work twice as fast to handle the increased load which also means the work is done at half the speed.
With Multi threading you can shuffle two units of work out at the same time so it's twice the load but the same speed.
To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.
So if you scale the processing power per thread up then the time goes down, if you scale the processing power by adding threads (cores) the time stays the same (ish).
> To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.
I was thinking that maybe that is what we are talking about. But convinced myself otherwise. Surely we don't need that much math to show if we cut the processing time in half the processing time will be cut in half. But if that is all we are saying I guess I can accept that as quite trivial.
Cut the processing time in half while doubling the load!
The average cow, pre-jamboree, spends one hour in the hospital, including time waiting for a slot in the OR. Then you give the surgeon a power saw that allows him to complete the job in half the time, but he also gets twice as many cows to work on.
Most people's intuition would tell them the cows would still spend an hour in the hospital (the doubling in work rate canceled by the doubling in work amount), but actually now it takes half an hour -- regardless of how swamped the surgeon was initially.
I don't understand what you are saying. Are you talking about the time the request is buffered in some queue on average assuming they are arriving randomly? Or something like that?
Here is what I'm thinking. We are operating a hospital which does only one type of surgery which last an hour exactly. (Presumably it is a veterinary practice for spherical cows.) A fully staffed operating room can operate on 24 spherical cows a day. If due to a spherical cow jamboree we expect more patients and we set up a second operating theatre we will be able to serve 48 of them a day. But we are still serving them for an hour each. (because that is how long the operation takes.)
Even if we are talking about latency when 24 cows show up at the stroke of midnight to the one operating room hospital they each will be served on average in 12.5h. Same average if 48 cows show up to the two operating room hospital.
So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.