fhqghds's comments

fhqghds · on June 28, 2020

Note that GP said -A -- this means the agent gets forwarded, and processes on the malicious server can ask the agent to perform authentication operations.

Touch to auth means the agent (or hardware token) asks the user to to confirm they are expecting an authentication request to come in.

This allows you to forward your agent to a host and have slightly more protection against malicious processes on the host using your key.

cma · on June 28, 2020

Github shouldn't ever make any use of -A should it?

xmodem · on June 28, 2020

Nope. But if you -A to a malicious server it could use your key to push to github.

fhqghds · on June 26, 2020

The take away from this should be:

Look how fucking horrible behavior can be when even when the organization has a publicly stated stance of holding members accountable, and occasionally actually does so.

Now imagine the fucking horrible behavior that doesn't even manage to get surfaced in an organization that takes a public stance of not holding members or itself accountable.

The military is far far from perfect. The police still manage to be worse. And that's fucking terrifying.

Melting_Harps · on June 26, 2020

> The military is far far from perfect. The police still manage to be worse. And that's fucking terrifying.

Having been personally subject to police violence for my life as an activist several times: I only sort of agree, which is why I said look as Jermey's work and draw your own conclusion.

I can tell you right now: my experiences do not even remotely compare to that of the normal civilian in Vietnam or Laos during those wars, I went to school with many of the children of that generation and it would be outright offensive if I tried to compare our experiences; let alone that of the Iraqi or Afghani people who are living in a literal nuclear waste land due to the constant bombing and use of depleted uranium munitions on their land while being 'shocked and awed' into submission.

Many of who I'm sure would tell you were just as oppressed by the Saddam regimes as well as the 'Taliban' but have found themselves in perhaps the worst of all possible situations in a horrible Humanitarian crises as they were 'liberated' by the US. While the world continues to ignore that.

fhqghds · on June 25, 2020

as someone who works on such things at a .gov, this has been in the works for years, and will likely remain in the works for years

the level of push back against it is absolutely epic.

the .gov I work on has even been considering moving most services off of .gov to another tld (such as .us) in order to avoid having to comply...

KeepFlying · on June 25, 2020

What is the reasoning for the pushback? Can you talk about some of the reasons they give for that?

fhqghds · on June 25, 2020

short answer: massive amounts of inertia

long answer: there are a lot of reasons...

one is that our network is obscenely open and used in weird ways.

public ips handed out to all the things via dhcp. dynamic hostnames (generated from the dhcp request) on a subdomain of our .gov for all the things. similarly static ips and top level dns records on our .gov are passed out like candy.

the border is heavily firewalled, and all networks are heavily sniffed and monitored, but everyone has a public ip with a .gov hostname. the network users consist of thousands of academics and scientists who use the network in fun an interesting ways, frequently without tls.

changing this culture is likely way more difficult than making config changes on bind and dhcpd

I've slowly learned to stop asking, and just try to keep my sobbing down during calls

fhqghds · on June 22, 2020

yyyeah... no.

A major part of what makes these machines special is their interconnect. Fujitsu is running a 6D torus interconnect with latencies well in the sub-usecond range. The special sauce is ability of cores to interact with each other with extreme bandwidth at extremely low latencies.

sushshshsh · on June 22, 2020

Thank you for this helpful info. For comparison's sake, say that you wanted to make babby's first super computer in your house with 2 laptops. That is to say, each laptop is a single core x86 system with its own motherboard and ram and ssd, and they are connected to each other in some way (ethernet? usb?)

What software would one use to distribute some workload between these two nodes, what would the latency and bandwidth be bottlenecked by (the network connection?) and what other key statistics would be important in measuring exactly how this cheap $400 (used) set up compares to price/watt/flop performance for top 500 computers?

dekhn · on June 22, 2020

You could use MPI and OpenMP. I got my start building an 10megabit ethernet cluster of 6 machines for $15K (this would have been back in ~2000). It only scaled about 4X using 6 machines, but that was still good enough/cheap enough to replace a bunch of more expensive SGIs, Suns, and HP machines.

Where the bottleneck is depends entirely on many details of the computations you wanted to run. IN many cases, you can get trivial embarassing parallelism if you can break your problem into N parts and there doesn't need to be any real communication between the processors running the distinct parts. In that case, memory bandwidth and clock rate are the bottleneck. but if you're running something like ML training with tight coupling, then the throughput and latency of the network can definitely be a problem.

floren · on June 22, 2020

The thing to keep in mind about supercomputers is that they are designed for particular applications. Nuclear weapons simulation, biological analysis (can we run simulations and get a vaccine?), cryptanalysis. These applications are usually written in MPI, which is what coordinates communication between nodes.

If you want to play with it at home, connect those laptops to an ethernet network and install MPI on them both--you should be able to find tutorials with a little web searching. Then you could probably run Linpack if you felt like it, but if you wanted to learn a little more about how HPC applications actually work, you could write your own MPI application. I wrote an MPI raytracer in college; it's a relatively quick project and, again, you can probably find a tutorial for it online.

Edit: Your cluster is going to suck terribly in comparison to "real" supercomputers, but scientists frequently do build their own small-scale clusters for application development. The actual big machines like Sequoia are all batch-processing and must be scheduled in advance, so it's a lot easier (and cheaper, supercomputer time costs money) to test your application locally in real-time.

gnufx · on June 22, 2020

Summit and Sierra, for instance, actually run a fair range of applications fast, though Sierra is probably targetted mainly at weapons simulation-type tasks. A typical HPC system for research, e.g. university or regional, has to be pretty general purpose.

timthorn · on June 22, 2020

Step by step instructions for building an Arm-based MPI cluster using Raspberry Pis: https://epcced.github.io/wee_archlet/

noir_lord · on June 22, 2020

If you want to get experience of working with higher node counts without breaking the bank, people do case kits for raspberri pies so you can build your own cluster.

For actual computing a modern higher end processor/server will murder it but its' closer to the real world of clusters than anything (so much so that there is a company that does 100+ pi node clusters for super computing labs to test on, you can't obviously run scientific workloads but it's cheaper than using the real machine as well).

https://www.zdnet.com/article/raspberry-pi-supercomputer-los...

gnufx · on June 22, 2020

If you want to understand distributed-memory parallel performance you're probably better off with a simulator, like SimGrid. I don't know what bog-standard hardware you'd need to get a typical correct balance between floating point performance, memory, filesystem i/o, and general interconnect performance otherwise. No toy system is going to teach you about running a real HPC system either -- you really don't want the fastest system if it's going to fall over every few hours or basically fall apart after a year.

corford · on June 22, 2020

For software, I know https://en.wikipedia.org/wiki/HTCondor is used quite frequently in academia for distributed workloads.

fhqghds · on June 22, 2020

Get ready for a surprise then: all those FLOPS are coming from the ARM cores.... This beast has no GPUs:

https://postk-web.r-ccs.riken.jp/spec.html

Merrill · on June 22, 2020

It looks like this is not an ARM core, but a Fujitsu implementation of the Arm v8-A instruction set and Fujitsu-developed Scaleable Vector Extension. Most likely the latter is doing all the heavy lifting.

https://www.fujitsu.com/global/about/resources/news/press-re...

>A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance.

d_tr · on June 22, 2020

The text you linked to actually says that the SVE was developed cooperatively by Fujitsu and ARM, without, however, going into details about who did what.

numpad0 · on June 23, 2020

There are words floating that A64fx is basically a SPARC with ARM ISA without much ARM IP in it, no idea how accurate but intriguing

leeter · on June 22, 2020

So looking at anandtech's breakdown the CPUs are closer to a knights landing 'CPU/GPU' than a traditional CPU (currently). They also have a ton of HBM2 right next to the dies so this should be insanely fast as they can feed those cores very very quickly regardless of how fast each core is by clock and pipeline. That should massively reduce stalls.

stephencanon · on June 22, 2020

The "traditional CPU" portion of the core is a bit more capable than KNL, but yeah, that's roughly accurate.

leeter · on June 22, 2020

Oh agreed, but honestly what makes this so interesting is how tuned it is. I'm honestly surprised we haven't seen Intel or AMD ship an HPC CPU with on package HBM2 yet.

m_mueller · on June 23, 2020

Besides FLOP/Watt what's also very interesting here is the FLOP/Byte ratio (memory bandwidth). It has kept the same balance as K computer, i.e. is geared at scientific workloads and not just benchmarks (duh, just worth pointing out here as it makes this machine quite special especially compared to Xeon based clusters - Intel IMO has dropped the ball on bandwidth since the last 5 years or so).

gnufx · on June 22, 2020

As an early user of KNL, I don't get the "GPU" bit. KNL runs normal x86_64 code and doesn't look that much different to the AMD Interlagos systems I once used apart from the memory architecture.

leeter · on June 22, 2020

It comes from the fact that KNL came from Larrabee which was actually developed as a GPU initially (and even ran games... sort of) but was never actually released. The next revision of that was the Xeon Phi chips you used. So the connection is "Lots of small cores with lots of high bandwidth ram" although these cores are definitely superscalar where Larrabee and derivatives were not really.

https://en.wikipedia.org/wiki/Xeon_Phi https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

gnufx · on June 22, 2020

Sure, but people don't normally think of "GPU" in this context as just runs all your x86_64 code.

ViralBShah · on June 22, 2020

That's pretty cool! That probably means that applications will have an easier time. Looks like it has 512-bit SIMD.

I wonder what BLAS they are using, and if the contributions are open sourced.

gnufx · on June 22, 2020

(SVE isn't 512-bit SIMD like AVX512.) I don't know what BLAS they're using, though I know they've long worked on their own, but BLIS has gained SVE support recently, for what it's worth.

floatboth · on June 22, 2020

SVE is whatever width the chip designer wants, Fujitsu's implementation is 512-bit according to AnandTech

gnufx · on June 22, 2020

I know, but it's different apart from coming in different hardware widths, as ARM techies will gush.

jabl · on June 23, 2020

Yes, SVE, like the RISC-V vector extension, is a "real" vector ISA, with things like vector length register (no need for a scalar loop epilog), scatter/gather memory ops for sparse matrix work, mask registers for if-conversion, looser alignment requirements (no/less need for loop prologues).

That being said, apart from becoming wider, AVX-NNN has also gotten more "real" vector features with every generation. The difference might not be as huge anymore.

d_tr · on June 22, 2020

I am really happy to have come across this post, mainly due to this fact.