Note that GP said -A -- this means the agent gets forwarded, and processes on the malicious server can ask the agent to perform authentication operations.
Touch to auth means the agent (or hardware token) asks the user to to confirm they are expecting an authentication request to come in.
This allows you to forward your agent to a host and have slightly more protection against malicious processes on the host using your key.
Look how fucking horrible behavior can be when even when the organization has a publicly stated stance of holding members accountable, and occasionally actually does so.
Now imagine the fucking horrible behavior that doesn't even manage to get surfaced in an organization that takes a public stance of not holding members or itself accountable.
The military is far far from perfect. The police still manage to be worse. And that's fucking terrifying.
> The military is far far from perfect. The police still manage to be worse. And that's fucking terrifying.
Having been personally subject to police violence for my life as an activist several times: I only sort of agree, which is why I said look as Jermey's work and draw your own conclusion.
I can tell you right now: my experiences do not even remotely compare to that of the normal civilian in Vietnam or Laos during those wars, I went to school with many of the children of that generation and it would be outright offensive if I tried to compare our experiences; let alone that of the Iraqi or Afghani people who are living in a literal nuclear waste land due to the constant bombing and use of depleted uranium munitions on their land while being 'shocked and awed' into submission.
Many of who I'm sure would tell you were just as oppressed by the Saddam regimes as well as the 'Taliban' but have found themselves in perhaps the worst of all possible situations in a horrible Humanitarian crises as they were 'liberated' by the US. While the world continues to ignore that.
one is that our network is obscenely open and used in weird ways.
public ips handed out to all the things via dhcp. dynamic hostnames (generated from the dhcp request) on a subdomain of our .gov for all the things. similarly static ips and top level dns records on our .gov are passed out like candy.
the border is heavily firewalled, and all networks are heavily sniffed and monitored, but everyone has a public ip with a .gov hostname. the network users consist of thousands of academics and scientists who use the network in fun an interesting ways, frequently without tls.
changing this culture is likely way more difficult than making config changes on bind and dhcpd
I've slowly learned to stop asking, and just try to keep my sobbing down during calls
A major part of what makes these machines special is their interconnect. Fujitsu is running a 6D torus interconnect with latencies well in the sub-usecond range. The special sauce is ability of cores to interact with each other with extreme bandwidth at extremely low latencies.
Thank you for this helpful info. For comparison's sake, say that you wanted to make babby's first super computer in your house with 2 laptops. That is to say, each laptop is a single core x86 system with its own motherboard and ram and ssd, and they are connected to each other in some way (ethernet? usb?)
What software would one use to distribute some workload between these two nodes, what would the latency and bandwidth be bottlenecked by (the network connection?) and what other key statistics would be important in measuring exactly how this cheap $400 (used) set up compares to price/watt/flop performance for top 500 computers?
You could use MPI and OpenMP. I got my start building an 10megabit ethernet cluster of 6 machines for $15K (this would have been back in ~2000). It only scaled about 4X using 6 machines, but that was still good enough/cheap enough to replace a bunch of more expensive SGIs, Suns, and HP machines.
Where the bottleneck is depends entirely on many details of the computations you wanted to run. IN many cases, you can get trivial embarassing parallelism if you can break your problem into N parts and there doesn't need to be any real communication between the processors running the distinct parts. In that case, memory bandwidth and clock rate are the bottleneck. but if you're running something like ML training with tight coupling, then the throughput and latency of the network can definitely be a problem.
The thing to keep in mind about supercomputers is that they are designed for particular applications. Nuclear weapons simulation, biological analysis (can we run simulations and get a vaccine?), cryptanalysis. These applications are usually written in MPI, which is what coordinates communication between nodes.
If you want to play with it at home, connect those laptops to an ethernet network and install MPI on them both--you should be able to find tutorials with a little web searching. Then you could probably run Linpack if you felt like it, but if you wanted to learn a little more about how HPC applications actually work, you could write your own MPI application. I wrote an MPI raytracer in college; it's a relatively quick project and, again, you can probably find a tutorial for it online.
Edit: Your cluster is going to suck terribly in comparison to "real" supercomputers, but scientists frequently do build their own small-scale clusters for application development. The actual big machines like Sequoia are all batch-processing and must be scheduled in advance, so it's a lot easier (and cheaper, supercomputer time costs money) to test your application locally in real-time.
Summit and Sierra, for instance, actually run a fair range of applications fast, though Sierra is probably targetted mainly at weapons simulation-type tasks. A typical HPC system for research, e.g. university or regional, has to be pretty general purpose.
If you want to get experience of working with higher node counts without breaking the bank, people do case kits for raspberri pies so you can build your own cluster.
For actual computing a modern higher end processor/server will murder it but its' closer to the real world of clusters than anything (so much so that there is a company that does 100+ pi node clusters for super computing labs to test on, you can't obviously run scientific workloads but it's cheaper than using the real machine as well).
If you want to understand distributed-memory parallel performance you're probably better off with a simulator, like SimGrid. I don't know what bog-standard hardware you'd need to get a typical correct balance between floating point performance, memory, filesystem i/o, and general interconnect performance otherwise. No toy system is going to teach you about running a real HPC system either -- you really don't want the fastest system if it's going to fall over every few hours or basically fall apart after a year.
It looks like this is not an ARM core, but a Fujitsu implementation of the Arm v8-A instruction set and Fujitsu-developed Scaleable Vector Extension. Most likely the latter is doing all the heavy lifting.
>A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance.
The text you linked to actually says that the SVE was developed cooperatively by Fujitsu and ARM, without, however, going into details about who did what.
So looking at anandtech's breakdown the CPUs are closer to a knights landing 'CPU/GPU' than a traditional CPU (currently). They also have a ton of HBM2 right next to the dies so this should be insanely fast as they can feed those cores very very quickly regardless of how fast each core is by clock and pipeline. That should massively reduce stalls.
Oh agreed, but honestly what makes this so interesting is how tuned it is. I'm honestly surprised we haven't seen Intel or AMD ship an HPC CPU with on package HBM2 yet.
Besides FLOP/Watt what's also very interesting here is the FLOP/Byte ratio (memory bandwidth). It has kept the same balance as K computer, i.e. is geared at scientific workloads and not just benchmarks (duh, just worth pointing out here as it makes this machine quite special especially compared to Xeon based clusters - Intel IMO has dropped the ball on bandwidth since the last 5 years or so).
As an early user of KNL, I don't get the "GPU" bit. KNL runs normal x86_64 code and doesn't look that much different to the AMD Interlagos systems I once used apart from the memory architecture.
It comes from the fact that KNL came from Larrabee which was actually developed as a GPU initially (and even ran games... sort of) but was never actually released. The next revision of that was the Xeon Phi chips you used. So the connection is "Lots of small cores with lots of high bandwidth ram" although these cores are definitely superscalar where Larrabee and derivatives were not really.
(SVE isn't 512-bit SIMD like AVX512.)
I don't know what BLAS they're using, though I know they've long worked on their own, but BLIS has gained SVE support recently, for what it's worth.
Yes, SVE, like the RISC-V vector extension, is a "real" vector ISA, with things like vector length register (no need for a scalar loop epilog), scatter/gather memory ops for sparse matrix work, mask registers for if-conversion, looser alignment requirements (no/less need for loop prologues).
That being said, apart from becoming wider, AVX-NNN has also gotten more "real" vector features with every generation. The difference might not be as huge anymore.
Touch to auth means the agent (or hardware token) asks the user to to confirm they are expecting an authentication request to come in.
This allows you to forward your agent to a host and have slightly more protection against malicious processes on the host using your key.