> The title seems to imply a practical bent, but it seems more like a collection of ideas (which are important and interesting, but not really what engineers need to know. IMO the #1 skill for distributed computing is to be competent at BOTH programming a single computer and at system administration).
I think this is something where different authors will emphasize different aspects. My view is that understanding of how to deal with the evolution of state within a system is crucial. Even systems that are not databases per se still have a dependency on how state is managed because you want to be able to reason about how some specific answer to a computation was derived and what guarantees it comes with (from strong consistency to some alternative but hopefully precise definition). I figure there will be disagreement on whether this important, and that's fine. There are other books.
That does bring up an interesting question: which books on distributed systems do you feel exhibit your preferred approach (free or paid)?
Re: the suggested topics:
Clusters of stateless web servers + single master. This is definitely a common setup, but you need very little if any distributed systems research to implement it.
Queues: I find the larger scale implications of queuing to be rather interesting (specifically, how cascading failures can be caused by an inadequate understanding of interactions between queues) but haven't found a good discussion beyond Google's findings that doing duplicate work often pays off as reduced 95th percentile latency.
MapReduce: There are many good books covering this topic in much more depth and specificity, so I didn't feel like I had that much to add. MR does use the techniques described - beyond job assignment the whole system rests on the DFS which uses block-level replication and some coordination protocol to maintain metadata state.
I kind of assume people have had some exposure to the paradigm at this point and do address MapReduce a bit in the context of the CALM theorem, which notes that a much larger set of relational algebra operations can actually be executed safely without coordination. Another point might be that MapReduce is inefficient in that it provides too much fault tolerance for typical workloads and cluster sizes.
Original GFS: the design has been largely superseded both by newer version of HDFS (e.g. eliminating the single point of failure in the initial design) and Google's (unpublished?) internal equivalents. BTW, the original GFS relies on Chubby, which uses Paxos internally.
Napster, BitTorrent and BitCoin: peer-to-peer systems definitely deserve a more extensive treatment in a later version of the book. The issues here are different in that trust, efficiency and resiliency are more important and I didn't have the bandwidth to handle them in the book as it stands.
Thanks for your comment, and I hope this doesn't sound like a rebuttal - I just wanted to think through the topics you mentioned one by one.
> Clusters of stateless web servers + single master. This is definitely a common setup, but you need very little if any distributed systems research to implement it.
First, define "stateless"? I would not characterized such a system as stateless at all. Even if you're not using sticky sessions (with cache servers/load balancers talking to each other for failover using fairly involved protocols), there's still state that's ephemeral (sessions) in your application server, as well as bulk of the persistent state that's provided with essentially "faith based consistency" (consider typical memcached cluster with client doing consistent hashing -- asynchronously replicated MySQL with failover, etc... -- in case of a failure, neither availability nor consistency are guaranteed).
On the level of protocols design, the whole idea of stateless protocols (REST) vs. stateful ones (sticky load balancers + SOAP, CORBA, RMI, etc...) is by itself a big distributed systems topic.
A web browser talking to a web server is by definition a distributed system. I am typing this as fast as I can, praying not to get the "your link is invalid" error from HN right now -- this a real example of distributed system and cache coherence/consistency/atomicity. Here's one paper that deals with just these sorts of question (in context of NFS): ftp://ftp.cs.berkeley.edu/ucb/sprite/papers/state.ps
> Re: Chubby and GFS
Original GFS doesn't rely on Chubby, BigTable does however (for metadata). I believe newer versions (Collosus) by extension rely on Chubby as they rely on BigTable.
F1/Spanner, however, use consensus and transactions far more than others and is very interesting in this sense.
[Edit: more elaboration on distributed systems issues in a "stateless" cluster of web servers].
I think this is something where different authors will emphasize different aspects. My view is that understanding of how to deal with the evolution of state within a system is crucial. Even systems that are not databases per se still have a dependency on how state is managed because you want to be able to reason about how some specific answer to a computation was derived and what guarantees it comes with (from strong consistency to some alternative but hopefully precise definition). I figure there will be disagreement on whether this important, and that's fine. There are other books.
That does bring up an interesting question: which books on distributed systems do you feel exhibit your preferred approach (free or paid)?
Re: the suggested topics:
Clusters of stateless web servers + single master. This is definitely a common setup, but you need very little if any distributed systems research to implement it.
Queues: I find the larger scale implications of queuing to be rather interesting (specifically, how cascading failures can be caused by an inadequate understanding of interactions between queues) but haven't found a good discussion beyond Google's findings that doing duplicate work often pays off as reduced 95th percentile latency.
MapReduce: There are many good books covering this topic in much more depth and specificity, so I didn't feel like I had that much to add. MR does use the techniques described - beyond job assignment the whole system rests on the DFS which uses block-level replication and some coordination protocol to maintain metadata state.
I kind of assume people have had some exposure to the paradigm at this point and do address MapReduce a bit in the context of the CALM theorem, which notes that a much larger set of relational algebra operations can actually be executed safely without coordination. Another point might be that MapReduce is inefficient in that it provides too much fault tolerance for typical workloads and cluster sizes.
Original GFS: the design has been largely superseded both by newer version of HDFS (e.g. eliminating the single point of failure in the initial design) and Google's (unpublished?) internal equivalents. BTW, the original GFS relies on Chubby, which uses Paxos internally.
Napster, BitTorrent and BitCoin: peer-to-peer systems definitely deserve a more extensive treatment in a later version of the book. The issues here are different in that trust, efficiency and resiliency are more important and I didn't have the bandwidth to handle them in the book as it stands.
Thanks for your comment, and I hope this doesn't sound like a rebuttal - I just wanted to think through the topics you mentioned one by one.