The world is hungry for GPU chips from the dominant artificial intelligence supplier, Nvidia. It has so far failed to provide a meaningful boost to chip sales at rivals Advanced Micro Devices and Intel. But it can help build a new kind of computer model.
“It’s increasingly the case that there is, sort of, an alternative to Nvidia,” said Andrew Feldman, co-founder and CEO of AI computing startup Cerebras Systems, which sells a massive AI computer, the CS-2, that runs the world’s largest chip.
Also: Nvidia boosts its “superchip” Grace-Hopper with faster memory for AI
Feldman and team started selling computers to compete with Nvidia’s GPUs four years ago. A funny thing happened on the way to the market. Feldman increasingly sees his business as a hybrid where there are some sales of individual systems, but much larger sales of massively parallel systems that Cerebras builds over months and then runs on behalf of customers as a dedicated AI cloud service.
The business “has completely changed” for Cerebras, Feldman told ZDNET. “Instead of buying one or two machines, and putting a (computing) job on one machine for a week, customers would rather have it on 16 machines for a few hours” as a cloud service model.
The bottom line for Cerebras is, “For hardware sales, you can do fewer, bigger deals, and you’ll spend a lot of time and effort managing your own cloud.”
On Monday, at a supercomputing conference in Denver called SC23, Feldman and team unveiled the latest feat of the expanding AI cloud.
Also: Cerebras just built a giant computer system with 27 million AI “cores”
The company announced that it has completed construction of a massive AI computer, the Condor Galaxy 1, or “CG-1,” built for client G42, a five-year-old investment firm based in Abu Dhabi, United Arab Emirates.
The Condor Galaxy, announced earlier this year, is named after a spiral galaxy located 212 million light years from Earth. The machine is a collection of 64 of Cerebra’s CS-2s. The total value of the CG-1 is, Feldman said, slightly less than the cost of an equivalent number of Nvidia GPU chips, on the order of $150 million, based on the price of Nvidia’s 8-way “DGX” computer.
“This is a very good deal,” Feldman said of such large ticket sales. “We’re having a monster year” in terms of sales, he said.
Plus: Why Nvidia is teaching robots to twirl pens and how generative AI is helping
The Condor Galaxy machine is not physically located in Abu Dhabi, but rather installed at Santa Clara, Calif.-based Colovore, a hosting provider that competes in the cloud services market with the likes of Equinix.
Cerebras is starting to build the second version of the Condor Galaxy, number two, or “CG-2,” which will add another 64 computers and four more “exaFLOPS” of computing power. (An exaFLOP is a billion, billion floating-point operations per second, see Wikipedia), for a total of 8 exaFLOPs for the Condor Galaxy system.
The Condor Galaxy system is expected, in its final configuration, to total 36 exaFLOPs, with 576 CS-2 computers, overseen by 654,000 AMD CPU cores.
In the new hybrid business model, Feldman said, the measure of success is not only system sales but also the percentage of new customers who rent capacity in Cerebra’s cloud without any upfront purchases. “Sometimes you’d send them hardware and they’d set it up, and you’d run a trial or prove it on their premise, and now we’ll give you a login,” Feldman explained of the new sales approach.
Pharmaceutical giant GlaxoSmithKline, an early customer for CS-2 hardware, is also leasing capacity in Cerebra’s cloud, Feldman said. “They have our equipment in place, and then, when they want to do giant runs, they come into our cloud,” he explained. “And it’s a very interesting model.”
Also: Glaxo’s biology research with the new Cerebras machine shows that hardware can change how AI is done
“We now have kind of so much supercomputer capacity that other people are using our system in all sorts of creative ways,” Feldman said. “In the AI space, they’re training interesting models, and in the supercomputing space, they’re doing interesting work — and this is simply not the case with anyone else.”
Feldman cited as “incredibly interesting” AI work done at Condor Galaxy the development of a large open-source language model, similar to OpenAI’s GPT. That program is the best-performing model with 3 billion neural network “parameters” on the Hugging Face machine learning repository, Feldman noted, with more than a billion downloads. That program is small enough to run on a smartphone to perform AI inference, which is the intent, Feldman said.
As an example of scientific work, Feldman cited a research paper by researchers at the King Abdullah University of Science and Technology in Saudi Arabia that was a finalist for the distinguished Gordon Bell Award presented by the Association for Computing Machinery, organizer of the SC23 event.
“We lent them time on the Condor Galaxy so they could break records for seismic processing,” noted Feldman.
The first version of the Condor Galaxy, CG-1, took 70 days to complete, Feldman said. The CG-2 machine will be ready “early next year.” The company is already planning the Condor Galaxy-3, which will add another 64 machines and another 4 exaFLOPS, for a total of 12 exaFLOPS.
One of the key advantages of a machine like the Condor Galaxy, both 1 and 2, Feldman said, is the engineering of the systems. Assembling an equivalent number of GPU chips is incredibly difficult, he told ZDNET. “The number of people who can network a thousand GPUs is very small,” Feldman said. “It’s maybe 25 companies.”
Plus: Qualcomm’s Snapdragon X Elite brings more AI power to your next PC
“It’s very difficult to make efficient use of this much distributed computing, it’s a very, very difficult problem,” Feldman said. “That’s one of the problems we’re basically solving.”
Each CS-2 computer in the Condor Galaxy 1 and 2 contains one of Cerebra’s new AI chips, the “Wafer-Scale-Engine” or WSE. These chips, the largest in the world, each contain 850,000 individual “cores” to process AI instructions in parallel, making them the equivalent of multiple GPU chips.
Additionally, the CS-2 computers are complemented by Cerebra’s special “fabric” switch, Swarm-X, and its dedicated memory hub, Memory-X, used to cluster the CS-2s together.
#pioneer #Cerebras #monster #year #hybrid #computing