There's a great piece of software called "molly-guard", which intercepts calls to "poweroff" and "reboot" and similar. It checks if it's being invoked via an SSH session, and if so, it asks you to type the name of the system you're shutting down. That way, you never accidentally shut down a remote server when you meant to shut down your own system (or a different server).
I once accidentally rebooted the reverse proxy for all our production traffic. We got some very quiet two minutes while it came back up.
After that we installed molly-guard with a check for the number of active connections. Made it painless to reboot standby proxies and difficult to reboot active ones.
(We also instituted pairing on production proxy maintenance. I'm not a fan of pair programming but pair maintenance is great.)
I like telling junior hires about this incident because it teaches them that (a) anyone can make mistakes, (b) even serious mistakes usually aren't that dangerous, (c) you can learn a lot from mistakes with the right mindset, (d) we cannot prevent mistakes but with the right system design we can reduce their consequences.
> (We also instituted pairing on production proxy maintenance. I'm not a fan of pair programming but pair maintenance is great.)
It's a great opportunity to share knowledge and techniques and I very much recommend doing so. It's an important way to get people familiar and comfortable with what the documentation says. Or, it's less scary to failover a database or an archiving clutser while the DBA or an archive admin is in a call with you.
Also reminds me of an entirely funny culture shock for a new team member, who was on a team with a much worse work culture and mutual respect beforehand. Just 2-3 months after she joined, we had a major outage and various components and clusters needed to be checked and put back on track. For these things, we do exactly this pilot/copilot structure if changes to the system must go right.
Except, during this huge outage, two people were sick, two guys had a sick kid, one guy was on a boat on the northern sea, one guy was in Finland and it was down to 3 of the regulars and the junior. Wonderful. So we shoved her the documentation for one of the procedures and made her the copilot of her mentor and then we got to work, just calmly talking through the situation.
Until she said "Wait". And some combined 40 - 50 years of experience stopped on a dime. There was a bit of confusion of how much that word weighed in the team, but she did correctly flag an inaccuracy in procedure we had to adress, which saved a few minutes of rework.
I was using my company dev machine via Windows RDP remotely during Covid and installed Glasswire which by default blocks all traffic so I lost access. No one was there to uninstall it so I continued development in my personal machine.
Another fun one is disabling the network interface on a remote server. An acquaintance did that by mistake on a cloud VM running some core services, and the cloud provider had no virtual console for some reason. Ended up having to write off the VM and restore from backup. Fun day at the office.
Long ago, I succeeded once to cut my own access through SSH to a remote server, after some firewall changes. That of course has required a long trip to the server, for physical access.
However that was good, because after that I have always been extra careful at any changes that could affect the firewall in any way. (That is not restricted to changes in firewall rules, because there are systems where the versions of the firewall program and of the kernel must be correlated, so an inconsistent update may make the firewall revert to its default state of denying all connections.)
I previously managed a firewall via scripts which would automatically revert your update in 20 seconds unless interrupted. So if you botched it and lost access, you just had to sit tight for 20 seconds.
Hah, I once did “netplan try” on a prototype production machine. The new config wasn’t quite right (although not catastrophic in any respect) so I told it to roll back. Bye bye new machine.
Fortunately this was an exercise and we had BMC access, so no big deal. Except that we got yet another datapoint suggesting that netplan is not a high quality piece of software.
Last I checked, if you non-forcibly reboot a GCE instance via console or API and it does not shut itself down in a timely manner, there was literally no way to force it to turn off or hard-reboot so that your block storage instances get released. IIRC the last time I encountered this the process timed out eventually after some silly long wait.