In a typical LRU cache every read is a write in order to maintain access order. ...

In a typical LRU cache every read is a write in order to maintain access order. If this is a concurrent cache then those mutations would cause contention, as the skewed access distribution leads to serializing threads on atomic operations trying to maintain this ordering. The way concurrent caches work is by avoiding this work because popular items will be reordered more often, e.g. sample the requests into lossy ring buffers to replay those reorderings under a try-lock. This is what Java's Caffeine cache does for 940M reads/s using 16 thread (vs 2.3B/s for an unbounded map). At that point other system overhead, like network I/O, will dominate the profile so trying to rearrange the hash table to dynamically optimize the data layout for hot items seems unnecessary. As you suggest, one would probably be better served by using a SwissTable style approach to optimize the hash table data layout and instruction mix rather than muck with recency-aware structural adjustments.

The fastest retrieval will be a cache hit, so really once the data structures are not the bottleneck then the focus should switch to the hit rates. That's where the Count-Min sketch, hill climbing, etc. come into play in the Java case. There's also memoization to avoid cache stampedes, efficient expiration (e.g. timing wheels), async reloads, and so on that can become important. Or if a dedicated cache server like memcached, one has to worry about fragmentation, minimizing wasted space (to maximizing usable capacity), efficient I/O, etc. because every cache server can saturating the network these days so the goals shifts towards reducing the operational cost with stable tail latencies. What "good performance" means is actually on a spectrum because one should optimize for overall system performance rather than any individual, narrow metric.