Cray, AMD to Extend DOE’s Exascale Frontier

tntn · on May 7, 2019

IMO the coolest thing about Summit/Sierra is that the GPUs and CPUs have a fully coherent single address space with all memory available to the GPUs by default, meaning that your stack- and malloc-allocated variables can be used directly from the GPUs.

I wonder if that will be the case on Frontier.

eslaught · on May 7, 2019

Yes: https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontie...

However, maybe this is just me, but I don't completely trust this to work without losing a certain amount of peak memory performance. I hope they at least leave the option to turn it off, so we can verify the impact it has on a per-application basis.

vvanders · on May 7, 2019

That's basically the entire state of mobile(minus a few weird SoCs) and many game consoles, makes for a much more convenient development platform.

tntn · on May 7, 2019

I'm not familiar with mobile / APUs, but I was under the impression that it was still necessary to do clCreateBuffer (or similar). I could only find old slides (https://developer.amd.com/wordpress/media/2013/06/1004_final...), though.

Would you mind pointing me in the right direction to learn about how to do this? (something equivalent to slide 2 of https://www.olcf.ornl.gov/wp-content/uploads/2018/12/summit_... but for mobile/APU)?

vvanders · on May 7, 2019

https://www.khronos.org/registry/OpenGL/extensions/OES/OES_E... is the basic entry point on most the OpenGL ES platforms, elsewhere you get into platform-specific APIs.

Mostly my point was the unified system/gpu memory is pretty common outside of the desktop space.

jacobush · on May 7, 2019

Amiga 500 all over again. :)

I feel these things always go back and forth in cycles in the industry.

tntn · on May 7, 2019

I think you'd have a hard time making a case that Summit is a cyclical return to any part of the Amiga 500 :)

"fully coherent single address space with all memory available to [all processors]" is a lot easier if there is no virtual memory, no caching, and all processors are on a single shared bus to the same memory chips.

wahern · on May 7, 2019

You left out the qualifier "Single-socket nodes". It's much easier when all the components are on the same die.

jacobush · on May 8, 2019

Which system are you referring to here?

wahern · on May 9, 2019

From the article,

> Single-socket nodes will consist of one CPU and four GPUs, connected by AMD’s custom high bandwidth, low latency coherent Infinity fabric.

The article, at least, only describes coherency in the context of single-socket nodes. I suppose it's ambiguous whether Infinity fabric only connects components on the node or connects nodes to each other, but in fact Infinity fabric is the interconnect used with Zen CPUs, so I think it's pretty clear that in this case it only connects and provides memory coherency for 1xCPU and 4xGPUs.

Performant memory coherency across nodes would be quite an achievement, but not one they seem to have achieved. Rather, it seems they've simply revived the PS4 model.

jacobush · on May 9, 2019

So the comparison to an Amiga 500 maybe wasn't so crazy after all. :)

foobard · on May 7, 2019

> In a media briefing ahead of today’s announcement at Oak Ridge, the partners revealed that Frontier will span more than 100 Shasta supercomputer cabinets, each supporting 300 kilowatts of computing.

So 30 megawatts of computing, plus cooling and other supporting services. How do you power something like this? Does ORNL have their own power station (given they have reactor(s) on site)? If power comes from an external station do they coordinate with the station operator when bringing a system like this online?

kincl · on May 7, 2019

As has been noted in other comments, we do not have a power station at ORNL. We buy power from TVA at about 5.5 cents per kW hour which in part is because of the locality of the lab to TVA power plants.

TVA recently completed a 210 MW substation on ORNL's campus to better serve our needs. We do not need to coordinate with them for large runs on the machines.

C1sc0cat · on May 7, 2019

Nice :-) Back in the day in the UK the RAE royal aircraft establishment twinwoods had a direct line to a local power station for their wind tunnels and used to control the speed form the power staion

W-Stool · on May 7, 2019

With that much gear and those kind of loads do you still have a traditional UPS / transfer switch / genset arrangement for everything in the room? If not, how do you manage short duration power outages?

kincl · on May 8, 2019

Yep, we have battery-backed generators for UPS and a transfer switch at the 480-V feed that comes into the room but it is not enough to power the compute nodes. The UPS allows cluster management nodes and the parallel filesystem (which is a small cluster by itself) to ride through full outages and other PQE.

toomuchtodo · on May 7, 2019

Thanks for taking the time to provide context in thread!

noahl · on May 7, 2019

Oak ridge national laboratory was built where it is partly because they could get lots of cheap power from the TVA, so probably from that. (TVA is a regional electricity provider that operates a lot of hydro plants.)

jessriedel · on May 7, 2019

For those who are curious, a typical American home uses of order a kilowatt, time-averaged (10,400 kWh per year = 1.2 kW). So 30 MW is roughly the average power usage of a city of 30,000 homes, or 80,000 people, although total capacity will be larger to handle fluctuations.

cr0sh · on May 7, 2019

...and yet, even a machine at this scale, or even 100 times that, could not come close to being on par, in terms of neural/neuron simulation, with that of the human brain.

We are definitely "doing something wrong" when it comes to artificial neural networking; even though are models are much simpler, it still takes an enormous amount of computing power, both in terms of raw CPU as well as actual electrical needs, just to be able to simulate things at a small scale (and if we use more accurate models, based on what we know about the brain and neurons, then at best our simulations can only be run to simulate, over our actual-time, what would be in actually fractions of a second in real-time).

That our brains can do so much using so little power (wattage), with such a high number of nodes and interconnections that dwarf anything we've so far have managed to simulate - it's a bit mind-boggling and humbling.

I just wonder where and what the issue actually is.

Why do our current practical models of a neuron, which are vastly simplified, require so much power to run at scale?

Is the issue related to the fact that they are simplified models, and actual neurons with their complexity are able to do things we don't yet understand or know about?

All of this is also related to back-propagation; such a thing doesn't seem to exist in nature (jury is still out on the theory, though) - so how do biological neural nets "learn"?

If we could eliminate or reduce the need for backpropagation, would that lower our power requirements for artificial network implementations?

As someone who has merely dabbled with artificial neural networks, these questions and conundrums fascinate me, and cause me to attempt to think up potential solutions, however far-fetched.

I highly doubt I will be the one to solve the issue, but I do hope to see it solved within my lifetime.

uj8efdjkfdshf · on May 7, 2019

The main difference between the brain and a CPU is that the brain runs at a much lower, variable clock speed (10-100Hz) to reduce switching costs and makes up for this with extensive pipelining and parallelization when possible. The high node count and necessary connectivity is then possible due to the use of directed self assembly.

At any rate, it is quite likely that the neocortex of the brain simply computes a function(s) recursively upon sensory input (see Chomsky's minimalist program for suggestions on what it could be). What is unclear is what this function is, and how it comes about - for this reason the approach by some has been to attempt to simulate an entire brain to see what it does. But without the necessary abstractions, this will be inherently wasteful and generates nothing new other than validating your experimental data.

kevin_thibedeau · on May 7, 2019

Connectivity. ICs are more or less 2D structures. That leads to less efficient connectivity and packing.

srcmap · on May 7, 2019

Probably current AI is base on FP64, FP32, biological neuron likely be large network of only one or few bits. Once we figure out how to build, use, optimized large network of nodes which only process a few bits each, the power consumption might just go way down.

Someone might also find that kind of network can run a KHz like human brain cells instead of MHz, GHz. The power usage will go down even more.

p1esk · on May 7, 2019

It seems you’re confusing AI and brain simulation.

dekhn · on May 8, 2019

brains can't run high resolution simulations of brains, but supercomputers can.

neural networks are only a small component of what we use supercomputers for.

nthompson · on May 7, 2019

They do not have their own power station. They have the Bull Run coal plant and hydro plants in the area. They do coordinate with TVA before a run.

petschge · on May 7, 2019

They might coordinate with TVA for transients (i.e. going from an idle machine to a full-machine run), but in normal operations these systems are at least 50% full. In my experience as a user these machines are more than 80% most of the time. I don't have hard numbers on the average utilization over the last week/month/year (these might even be classified), but you don't by and build such a machine to be idle.

I could find the numbers for two German super computers I have used in the past. SuperMUC (Phase 1 and 2 combined) had "above the desired 85%" utilization in 2017 and Hazel Hen at HLRS in Stuttgart reported a utilization "between 92% and 98%" in 2017.

nthompson · on May 7, 2019

> When I took a tour of the Oak Ridge Leadership Computing Facility a few years ago, Buddy Bland, who is Project Director, told me he could tell when he comes to the Lab in the morning if they are running the LINPACK benchmark by looking at how much steam is coming out of the cooling towers.

https://blogs.mathworks.com/cleve/2013/06/24/the-linpack-ben...

Different programs consume dramatically different amounts of power.

petschge · on May 7, 2019

That is true. But very few super computers track power consumption for different codes and do power aware scheduling. SuperMUC and IBM actually do a lot of research on this, because it is a rather new field in HPC.

lettergram · on May 7, 2019

Most of the super computers today have their own power station on site. I know blue waters at UIUC had one, which I believe caused a power outage at one point.

cr0sh · on May 7, 2019

So - on a more "applies to ordinary mortals" level - the fact that they are going to use all AMD components is intriguing.

In reference to AI, NVidia has things "locked up" with CUDA, versus 2nd cousin AMD's OpenCL.

From what I understand, it is possible to recompile TensorFlow (for instance - not that ORNL will be using TF) for OpenCL - but I don't know how well it works. Personally, I've only used TF with CUDA.

Does this mean we might see greater/better support for OpenCL in the AI realm? Might we seem it become on-par with CUDA because of this collaboration for this HPC?

Or will things stay as-is, at least "down here" in the consumer/business realm of AI hardware and applications? Do things like this trickle down, or are things so customized and/or proprietary for the needs of HPC at ORNL (or elsewhere) that anything to do with AI on this machine will have little to no bearing outside of the lab?

Ultimately, I'd just like to see another choice (a lower cost choice!) for GPU in the world of consumer/enthusiast/hobbyist AI/DL/ML - while today's higher-end GPUs, no matter the manufacturer, tend to be fairly expensive, AMD still has an edge here that make them attractive to users (not to mention the fact that their Linux drivers are open-source, which is also a plus).

petschge · on May 7, 2019

I doubt they are going to run much AI on that machine. The national labs mostly run "traditional HPC" workloads such as fluid codes that simulate (magneto)hydrodynamics in one way or another.

dekhn · on May 7, 2019

many modern HPC workloads are being ported to include tensorflow so they can do all sorts of in-process event categorization. https://www.nersc.gov/news-publications/nersc-news/science-n...

ML (not AI) will probably end up driving a lot of supercomputer workloads because distributed ML training resembles modern HPC simulation codes.

gnufx · on May 7, 2019

See https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontie...

timClicks · on May 7, 2019

The DOE is responsible for the USA's nuclear arsenal so I expect a few simulations of that nature.

dekhn · on May 7, 2019

this system is mostly for non-classified work; not clear just how much stockpile stewardship will occur on it.

kincl · on May 8, 2019

Yep, actually all of our user projects are unclassified at the OLCF.

tntn · on May 7, 2019

... except for on Summit.

"Exascale Deep Learning for Climate Analytics" won the 2018 Gordon bell using summit.

petschge · on May 7, 2019

They have done that and it got an award because it was new and cool, but that is not where the majority of cycles goes. That is spent on fluid codes such as Raptor, XGC or Flash and to some extend kinetic simulations with codes such as VPIC or quantum codes such as QFCPACK.

There should be more information at https://www.olcf.ornl.gov/leadership-science/

tntn · on May 8, 2019

I agree that the majority of the cycles go to traditional codes, but there is quite a bit of ML/AI happening on Summit.

From your link:

"New Frontiers for Material Modeling via Machine Learning Techniques"

"Large scale deep neural network optimization for neutrino physics"

"High-Fidelity Simulations of Gas Turbine Stages for Model Development using Machine Learning"

"HPC4mfg – Reinforcement Learning-based Traffic Control to Optimize Energy Usage and Throughput"

"Advances in Machine Learning to Improve Scientific Discovery"

jedbrown · on May 7, 2019

HIP (https://rocm-documentation.readthedocs.io/en/latest/Programm...) is basically 1-1 with CUDA, and can be almost automatically generated from CUDA (https://github.com/ROCm-Developer-Tools/HIP/tree/master/hipi...). The datasheet doesn't mention OpenCL so evidently it isn't a priority any more.

Symmetry · on May 7, 2019

I'd tend to assume, with the amount of resources going into this, that the software will be coded at a lower level here than it would be in a typical dev environment and so NVidia's library advantages will be less salient?

AlphaSite · on May 7, 2019

They'll probably use ROCm rather than OpenCL.

tntn · on May 7, 2019

But ROCm isn't an API is it? It's a "platform," whatever that means, and so software still has to use either hip or OpenCL, right?

arcanus · on May 7, 2019

"greater than 1.5 exaflops" of performance will likely correspond to greater than 1 Exaflop of sustained performance on HPL (used for the top-500 ranking), making this a likely candidate for the first 'true' exascale computer.

opportune · on May 7, 2019

Out of curiosity what makes this an exascale computer and not, say, an AWS or Azure datacenter? Just the fact that they are open about benchmarking #pflops?

arcanus · on May 11, 2019

Good question. Technically speaking just running HPL on a cluster is sufficient to get it ranked on top-500.

In reality, most HPC sites use high performance interconnects, such as infiniband or a proprietary high performance ethernet. They are also less likely to use virtualization, and every node looks 'bare metal'. The software stacks are very different, from everything between the distributed memory model, compilers, to the system schedulers and diagnostic software.

Nevertheless, there is a great deal of convergence happening between HPC and hyperscale data centers, particularly as hyperscale uses more machine learning, which has a similar flavor to HPC. Many believe that the FANG companies have exaflop capabilities already, but they just aren't well optimized for scientific workloads.

shifto · on May 7, 2019

But will it run Crysis?

gnode · on May 7, 2019

Taking this idea more seriously than intended: is this practical? I've heard of CPU clusters being used for network distributed realtime raytracing. Is there a DirectX / OpenGL API implementation for this?

azhenley · on May 7, 2019

This is just a few miles down the road from me. I can try to see if they will let me run it...

kincl · on May 8, 2019

The call for proposals for INCITE (one of the programs we provide cycles for) is open for 2020 but this would be for Summit not Frontier this time around :)

http://www.doeleadershipcomputing.org/proposal/call-for-prop...

BooneJS · on May 7, 2019

Looks like Frontier will use Cray’s Slingshot network. https://www.cray.com/products/computing/slingshot

https://www.anandtech.com/show/14302/us-dept-of-energy-annou...

berbec · on May 7, 2019

What's the advantages of infinity fabric over pcie 4 for cpu/gpu?

What interconnects do these sorts of machines use? I assume even 100GbE isn't enough?

Just curious. It's interesting what exists in the "so far beyond my price range as to be ludicrous" category.

Symmetry · on May 7, 2019

PCIe provides communication but isn't intended to provide memory coherency. There's a lot of work that goes on in figuring out which cache(s) have a copy of which cache line and figuring out how to resolve conflicting access needs.

gnode · on May 7, 2019

Infinity Fabric / HyperTransport is generally lower level and lower latency than PCIe. It's aimed more for use as a front-side bus than a peripheral interconnect. A better analogue would be Intel's QuickPath Interconnect.

Grazester · on May 7, 2019

As to the interconnect question. They usually use InfiniBand but Cray uses their proprietary interconnect tech.

ksec · on May 7, 2019

I wonder if this will help the adoption of ROCm. It seems everything I read settled on CUDA.

gok · on May 7, 2019

1.5 exaflops, 30 megawatts, around 50 GFLOPs per watt? Impressive if true; that's 3x more efficient than the current top of the Green500.

Symmetry · on May 7, 2019

Moore's law might have stopped with the clock gains when Denard scaling gave out but we've still got energy efficiency gains. Koomey's law[1] is holding strong. I don't know if it'll get us all the way to Landauer's Limit[2] but I hope so.

[1]https://en.wikipedia.org/wiki/Koomey%27s_law

[2]https://en.wikipedia.org/wiki/Landauer%27s_principle

tntn · on May 7, 2019

It's probably not a great comparison to compare the theoretical numbers of Frontier to the achieved numbers of the green 500. The achieved flops is pretty much always considerably lower than the theoretical flops. Titan is a 27 PFLOPS machine that achieves 17.6 PFLOPS, sequoia is a 20 PFLOPS machine that achieves 17, summit is a 200 PFLOPS machine that achieves 143, ...

~ 37 GFLOPS/W is probably a better projection if we assume (out of nowhere) that the theoretical/achieved ratio of Frontier is comparable to Summit (75%). Still very impressive.

gok · on May 7, 2019

Well they don't usually hit their peak power usage either; Summit is rated at 13 MW but used less than 10 during its LINPACK run. But fair point.

kincl · on May 7, 2019

GPUs help considerably here, taking a look at previous ORNL machines:

  Jaguar | 2.3 PF XT5 (CPU-only)      | 7 MW for HPL
  Titan  | 27 PF XK7 (1:1 CPU to GPU) | 8.2 MW
  Summit | 186 PF (2:6 CPU to GPU     | 8.8 MW

Overall, substantial changes in computing power and 10%-20% increase in power.