Measuring modern high-performance computing for AI and more
Supercomputers are deployed all over the world to solve some of the biggest challenges faced by humankind. These room-sized machines, millions of times more powerful than any laptop, are capable of dizzyingly fast computational feats. These giants were once exclusively at the disposal of organisations like large government laboratories, NASA and the topmost elite of players in vertical sectors like manufacturing, finance, oil and gas and aerospace. But now changes are afoot in the way supercomputers are designed and built, opening them up to a new range of use cases. Benefitting from a new generation of processing power and ultra-fast networking, we are entering a new and perhaps more democratised era of high-performance computing (HPC).
Graphics processing units (GPUs) are replacing central processing units (CPUs) for processing, resulting in significantly more computational throughput. GPU-based systems offer a smaller footprint than legacy HPC systems, and they also operate at greater efficiency and have a lower operational cost.
But as computing horsepower increases, so does the demand for maximal data throughput. This need for high throughput and very low latency is being met by InfiniBand, a networking standard commonly used in the world of HPC.
A strong supporting ecosystem is another factor that must be considered as a sure sign of democratisation. With more than 600 HPC applications that now take advantage of GPUs and InfiniBand networking to accelerate performance, adoption continues to be strong in the arenas of both business and research.
Pioneering the next generation of AI
Another emerging use for this increasingly accessible processing power lies in enabling artificial intelligence. There is a trend towards using massive AI models, and that is changing how AI is built.
Microsoft, for example, is a pioneer in AI and uses both GPUs and InfiniBand at scale. By utilising state-of-the-art supercomputing in its Azure platform to power a new class of large-scale models, Microsoft is enabling a whole new generation of AI. By using massive amounts of data, these large-scale models only need to be trained once. Then, the models can be fine-tuned for different tasks and domains with much smaller datasets and resources.
The importance of measuring performance
As HPC use cases broaden, more supercomputers are being built to faster and more powerful specifications. It remains as important as ever to understand how different HPC machines compare with each other. Hence the significance of the TOP500 project which ranks and details the 500 most powerful non-distributed computer systems in the world. The project started way back in 1993 and still publishes an updated list of supercomputers twice a year, now including a far greater range of machines than in its early days.
The value of the TOP500 project lies in providing a reliable basis for tracking and detecting trends in high-performance computing. But let’s consider for a moment the benchmarks that are used to quantify HPC.
Historically the foremost of these has been the long-standing HPL benchmark. HPL is a portable implementation of the High-Performance Linpack Benchmark. It is used as reference to provide data for the TOP500 and is a key tool in the ranking of supercomputers worldwide. However, it only measures compute power in the form of flops.
The HPCG benchmark (High Performance Conjugate Gradients) was created as an alternative, offering another metric for ranking HPC systems and intended as a complement to HPL. It is not, however, integrated into the TOP500 ranking.
As we have already seen, artificial intelligence is now a key part of the HPC landscape, and so a new and more suitable benchmark is regarded by some as a necessary recognition of this trend.
A new metric for modern day HPC systems
MLPerf is a new type of benchmarking organisation. Right in line with the age of AI supercomputing, its mission is to build fair and useful benchmarks for measuring training and inference performance of machine learning (ML) hardware, software and services. Its growing acceptance is making it a useful tool for researchers, developers, hardware manufacturers, builders of machine learning frameworks, cloud service providers, application providers, and of course end users.
Its goals revolve around accelerating the progress of ML via fair and useful measurement to serve both the commercial and research communities. It also seeks to enable a more equitable basis for the comparison of competing systems, while encouraging innovation. Perhaps the part of its ethos that marks it out most from other HPC benchmarks it its commitment to keeping benchmarking affordable so all can participate. MLPerf is backed by organisations including Amazon, Baidu, Facebook, Google, Harvard, Intel, Microsoft and Stanford, and it is constantly evolving to remain relevant as AI itself evolves.
The largest HPC and AI systems today are tackling not only new methods of traditional HPC workloads by way of GPUs with InfiniBand networking, but are also enabling a new wave of recommendation systems and conversational AI applications, while others power the quest for personalised and precision medicine. All the while we are moving beyond the traditional CPU-based systems that used to dominate the world of HPC research. Compute at the top level is no longer the preserve of an elite.
By Scot Schultz, Sr. Director, HPC / Technical Computing, NVIDIA networking business unit