Dardel Fastest in Sweden
Gert Svensson, PDC
Dardel is now the fastest high-performance computing (HPC) system in Sweden and fifth in the worldwide Green500 list ( www.top500.org/lists/green500/2022/11 )! In recent months, Dardel has undergone a number of expansions and upgrades. The most significant is that 56 graphics processing unit (GPU) nodes have been installed. They are now in operation and in the process of undergoing acceptance tests.
As previously announced, the plan was for Dardel to have 56 GPU nodes, each with four AMD Instinct™ MI250X GPUs; all of those nodes have now been installed in the system. That increased the speed significantly and resulted in a performance of 8.2 petaflops for the GPUs on the High-Performance Linpack (HPL) benchmark. This means that Dardel’s HPL performance went up more than three times.
Dardel is now in 68th place on the latest TOP500 list, which was released in November (see www.top500.org/lists/top500/2022/11 ). Note that the GPU partition of Dardel is referred to as “Dardel GPU” in both the TOP500 and Green500 lists in order to distinguish it from the earlier CPU-only phase of the system, which was previously listed just as “Dardel”. The CPU partition of the system is now referred to as “Dardel CPU” and is in 345th place on the TOP500 list. As mentioned earlier, “Dardel GPU” is in fifth place on the Green500 list, and it is interesting to observe that the systems in positions two to seven on the Green500 list are all occupied by HPE Cray EX systems with AMD Instinct™ MI250X GPUs. This clearly demonstrates that this type of architecture is highly energy efficient, measured in floating-point operations per watt.
When the previous issue of the PDC newsletter was published, most researchers had been migrated from PDC’s previous systems to Dardel, although the Scania partition of Dardel was not fully operational at the time. Thanks to the upgrades and installations, 240 of the Dardel CPU nodes are now serving Scania’s research and development. Recently, 468 nodes were added to the system for SNIC academic researchers, including eight two-terabyte “Giant” nodes. In addition, twelve nodes (which are dedicated for research by the Dept. of Astronomy at Stockholm University) have been installed.
Types of nodes | Memory | Number of CPU nodes | Number of GPU nodes | ||||
---|---|---|---|---|---|---|---|
SNIC initial | Industry/business | SU Astronomy | SNIC extra | Total | |||
Thin | 256 GB | 488 | 36 | 0 | 212 | 736 | 0 |
Large | 512 GB | 20 | 236 | 12 | 248 | 516 | 56 |
Huge | 1 TB | 8 | 0 | 0 | 0 | 8 | 0 |
Giant | 2 TB | 2 | 0 | 0 | 8 | 10 | 0 |
TOTAL | - | 518 | 272 | 12 | 468 | 1270 | 56 |
The entire software stack of Dardel has been upgraded to a more current release called Strawberry. At the time of the writing, this release was being tested on an “island” of the Dardel system. Also, the access mechanism for the Lustre file storage system has been changed so it will be compatible with future software versions – more specifically it was changed from Remote Direct Memory Access (RDMA) technology to the Transmission Control Protocol/Internet Protocol (TCP/IP). This may temporarily decrease the speed of the disks until the disk software is also updated, after which the speed should be similar or better than before.
PDC has been working with Hewlett Packard Enterprise (HPE) to make all these significant changes to Dardel while minimising effects on the system operation. Most of the work has been possible without affecting the researchers using the system.
There are still some major hardware and software upgrades ahead of us. The interconnect in Dardel will soon be upgraded to the next generation of Slingshot network with a speed of 200 gigabit/second. This will require that all the Slingshot cards and some cables be replaced, which will require system downtime of a week.
Other upcoming upgrades include updating the Lustre disk system with 50% more capacity in terms of both size and speed and upgrading the software in the disk system to a new version. As mentioned earlier, that software upgrade should optimize TCP/IP access so the access speed should be similar to that with the previous RDMA access. The Dardel system software will also be upgraded to several new releases that will provide a lot of new functionality, especially for supporting the GPUs.