Monday, June 22, 2020

Nvidiaが1か月未満でトップ10のAIスーパーコンピューターを構築

Nvidiaは本日、Seleneと呼ばれる最初のAmpereベースのスーパーコンピューティングクラスタのクロークを取りました。このクラスターは、Nvidiaの新しいAmpereアーキテクチャーとA100人工知能(AI)アクセラレーターの発表から1か月後に構築されました。Nvidia today took the cloak of its first Ampere-based supercomputing cluster called Selene. The cluster was built in the month since the announcement of Nvidia’s new Ampere architecture and A100 artificial intelligence (AI) accelerators. “You take the fastest GPU on Earth and you combine it with the fastest network on Earth, you get Selene, the fastest industrial system in the U.S.,” said Paresh Kharya, director of product management and marketing at Nvidia. “It provides up to 1 exaFLOPs of AI, and over 27 petaFLOPs of HPM.” According to the latest Green500, Selene ranks as the No. 2 supercomputer, exceeding 20 gigaFLOPs per watt. Meanwhile, on the supercomputing TOP500, Selene ranked No. 7 at 27.5 petaFLOPs on the Linpack benchmark. Selene was built using the DGX Superpod architecture, which is preconfigured with nearly 10 miles of optical fiber and based on the InfiniBand interconnect technology obtained with the acquisition of Mellanox for $6.9 billion dollars, last year. The Superpod functions as a backplane for Nvidia’s DGX A100 supercomputing system. Each DGX motherboard can be configured with eight A100 GPUs, two 64-core AMD Rome CPUs, 1 terabyte of system memory, nine Mellanox ConnectX-6 virtual protocol interconnect (VPI) NICs, six Nvidia NVSwitches for simultaneous communications across each of the GPUs, and 15 TBs of NVMe-based storage. Selene uses 280 of these DGX A100 systems, for a total of 2,240 A100 GPUs, and 35,840 processor cores. Nvidia announced its Ampere-based A100 GPUs last month at GTC. The company claims the A100 chips provide 20-times the performance of its previous generation V100 chipset. Nivida plans to use the supercomputer to advanced the field of AI and to develop new products. Nvidia also rolled out its A100 GPU in a more traditional PCIe form factor. The PCIe form factor is, according to Kharya, “powered by the same ampere chip that, with 54 billion transistors, is the world’s largest 7-nanometer chip.” The two form factors enable Nvidia to meet the needs of different AI workloads and server designs. “A PCIe configuration, for example, has just 250 watts of [thermal design power]. This configuration is well suited for mainstream, accelerated servers that go into the standard racks that offer lower power per server,” explained Kharya. “Even on PCIe, it provides great performance for applications that scale to one or two GPUs at a time, including AI inference and some HPC applications. Meanwhile, Kharya said the 400-watt SXM4 form factor, used in the DGX A100, is ideally suited for workloads scaling to multiple GPUs in a server as well as across servers. “It’s available through HGX A100 server boards that interconnect GPUs with NVLink and NVSwitch in four or eight GPU configurations at nearly 10-times the bandwidth of PCIe fourth generation,” he added. The chipmaker also announced that more than 50 A100-based servers from industry partners like Dell Technologies, Hewlett Packard Enterprise (HPE), Inspur, Lenovo, Fujitsu, Cisco, Asus, Atos, Gigabyte, Quanta, One Stop Systems, and Supermicro will launch by the end of the year. According to Kharya, the first 30 of these servers are expected to arrive this summer, while an additional 20 are slated for the second half of 2020. He added that these servers will feature both the SXM4 form factor announced with the A100 and the newly unveiled PCIe form factor. Alongside the Ampere-related announcement, Mellanox announced a new unified fabric manager called UFM Cyber-AI, which is designed to minimize downtime in InfinibBand equipped data centers. The ultimate goal of the system is to detect outages before they happen, allowing operators time to take preventative measures. The platform applies AI to learn the data center’s operational norm and network patterns by drawing on realtime and historic telemetry and workload data. “The UFM Cyber-AI platform determines a data center’s unique vital signs and uses them to identify performance degradation, component failures, and abnormal usage patterns,” said Gilad Shainer, SVP of marketing for Mellanox. “It allows system administrators to quickly detect and respond to potential security threats and address upcoming failures, saving cost, and ensuring consistent service to customers.”

Archive