By | November 14, 2023
Forget Zettascale, trouble scaling Exascale supercomputers

In 2021, Intel declared its goal of reaching zettascale supercomputing by 2027, or scaling today’s Exascale computers by 1,000 times.

By 2023, attendees said the challenges are scaling performance even within Exaflops at the Supercomputing 2023 conference, which is being held in Denver.

The move to CPU-GPU architecture has helped scale performance, but other concerns — such as architectural limitations and durability issues — make it difficult to scale performance, Top500 officials said.

In fact, at the current rate, supercomputers may not reach 10 Exaflops of performance by 2030. Furthermore, performance growth has slowed in recent years despite new Exascale systems entering the Top500 list.

“If we don’t change the way we approach computing, our growth in the future may be significantly less than it has been in the past,” said Erich Strohmaier, co-founder of Top500, during a press conference.

The end of two fundamental corollaries – Dennard Scaling and Moore’s Law – have created challenges in scaling performance.

“The end of Moore’s Law is coming, there’s no doubt about it,” Strohmaier said.

The number of systems submitted to the Top500 has gradually decreased since 2017. The average performance of the systems has also decreased in recent years.

The decline is also related to the inability to grow system sizes due to architectural limitations and sustainability issues.

“Our data centers cannot grow much larger than they are. So we cannot increase the number of … CPU sockets,” Strohmaier said.

Optical I/O has been identified as a technology to reach zetta scale. However, a US Department of Energy (DoE) official said that optical I/O was not on their roadmap due to the cost and energy required to drive optical I/O to connect circuits over short distances at the motherboard level. By comparison, copper is cheap and plentiful.

Average HPC systems also have longer lifespans. The average age of a Top500 system was around 15 months in 2018-2019 and doubled to 30 months in 2023.

SC 23 Top500 Average system age

The top seven systems on the November Top500 list have as much performance as the remaining 493. The upcoming systems will create an even wider gap, with an even higher performance ratio coming from the top 10 systems.

Meanwhile, some exciting new Exascale machines will make it to the Top500 list. There may be many lead changes as more supercomputers come online and are optimized to perform faster.

There are two new systems – Aurora this year and El Capitan next year – that could take top 500 positions in the coming years. The systems will scale to two Exaflops.

There was no change in the leader of the Top500 supercomputer list issued this week, with the Frontier at Oak Ridge National Laboratory retaining its top spot. The system delivered peak performance with 1.1 exaflops of performance and remained the only Exascale system on the list.

“I would say the machine is really stable right now, and it’s performing exceptionally well,” said Lori Diachin, project manager for the Exascale Computing Project at the US Department of Energy.

But Frontier may soon be replaced by the second-fastest system, Aurora, installed at Argonne National Laboratory. It delivered a performance of 585.34 petaflops and has been partially benchmarked. The system features Intel 4th generation Xeon server chips called Sapphire Rapids CPUs and Data Center GPU Max chips called Ponte Vecchio.

Argonne submitted benchmarks for half the system size, and its performance will only go up when fully benchmarked, said Erich Strohmaier, co-founder of Top500.

“It’s doubtful that Frontier will be the number one system much longer,” Strohmaier said.

Diachin’s team has had limited access to the system since July and is seeing good performance.

“We’re really looking forward to having full access to that system, hopefully later this month,” Diachin said.

The third Exascale supercomputer, El Capitan, will be deployed in mid-to-late 2024 at Lawrence Livermore National Laboratory.

The system is likely to take the top spot on the Top500 when the benchmark is released, but it is not certain when that will happen.

“There will be a short early science period for that machine before it transitions into classified inventory management use for the NSA,” Diachin said.

In addition, many Top500-class Exaflop systems can be seen, especially in cloud facilities for vendors who have not bothered to submit the results. Google’s A3 supercomputer can accommodate up to 26,000 Nvidia H100 GPUs but has not submitted any results.

But one entry, Microsoft’s Azure AI supercomputer called Eagle, unexpectedly landed third in this year’s Top500, and Nvidia’s bare metal Eos was ninth.

A previous contributor, China, has dropped off the map and is not submitting results to the Top500. One submission for the Gordon Bell awards is a Chinese Exascale system, but there were no submissions of the system’s performance to the Top500.

Beyond raw horsepower, DoE’s Diachin is also trying new ways to scale performance within the current hardware limitations.

One such idea is to use mixed precision and a wider implementation of accelerated computation. DoE is also looking at incorporating AI into large multiphysics models and wrapping it in classical computing to achieve faster results.

“From our perspective, one of the things we’re really looking forward to is some of these algorithmic improvements and broader incorporation of those kinds of technologies to accelerate applications while keeping the power footprint manageable,” Diachin said.

Many labs also look at their old code written in languages ​​like Fortran 77 and rewrite and compile it for accelerated computing environments.

This approach “will help future-proof many of these codes by extracting layers specific to different types of hardware and allowing them to be more portable in performance with less work,” Diachin said.

Hardware and algorithmic improvements yielded performance improvements mostly in the 200x to 300x range and “as much as even several 1,000-fold improvements,” Diachin said.

Labs typically rely on E4S, or Extreme-scale Scientific Software, which includes debugging, runtime, math, visualization, and compression tools. It has more than 115 packages and is pushed out to academia, scientific organizations, and other US government agencies.

Highlights of the SC23 Top500 list

#Forget #Zettascale #trouble #scaling #Exascale #supercomputers

Leave a Reply

Your email address will not be published. Required fields are marked *