Dardel Second Phase on the Way
Gert Svensson, PDC
Current Status of Dardel
The first phase of the new Swedish National Infrastructure for Computing (SNIC) flagship system at PDC (which is an HPE Cray EX system called Dardel) was inaugurated at the start of this year. All the researchers who had been using PDC’s previous systems, Beskow and Tegner, have been transferred to Dardel (except for those from Scania) and have been successfully using the system for their scientific and technical simulations. Researchers from Scania will continue using Beskow until later this year when the Scania partition on Dardel is fully operational. The number of cores (or processing units) per CPU increased dramatically from 16 in Beskow to 64 in Dardel. The memory bandwidth was also increased, though not as much as the number of cores. Some researchers who are using Dardel have needed to adapt their code to the changes, and the resulting performance is usually good.
Training for Using Dardel
Together with HPE and AMD, which is the manufacturer of the Dardel CPUs and graphics processing units (GPUs), PDC has arranged programming classes and introductory courses on how to use the system, as well as a GPU Hackathon week where experts from HPE and AMD helped to adapt locally developed code to run well on the AMD Instinct™ MI250X GPUs that will feature in the second phase of Dardel. There will be more courses and hackathons like these coming up, so keep an eye on the PDC Events page: www.pdc.kth.se/about/events .
Technology Readiness Level
HPE used a number of innovative techniques to make the Dardel system highly scalable and resilient to failures. This makes the system quite complex to operate from a system administration point of view, particularly as some of the new technology has not yet reached the technology readiness level needed to facilitate simple and smooth operation of the system. Some of the software tools that are expected for the day-to-day operation of the system are also not ready yet. In addition, although PDC has worked closely with the HPE team to schedule the updates needed to alleviate these issues, the maximum amount of downtime that was guaranteed by HPE has been exceeded due to unforeseen problems. PDC apologises for any inconvenience this has caused for the researchers using Dardel; work is underway with HPE to achieve more stable operation.
GPU Nodes for Second Phase of Dardel
In the previous PDC Newsletter, it was mentioned that the GPUs for the second phase of Dardel were planned to arrive during the spring this year. However, due to factors outside PDC’s control, their arrival has been delayed till the summer. Adding the GPUs to Dardel could result in a significant amount of downtime, so PDC and HPE have spent considerable time investigating the best way to start using the GPUs without disrupting the ongoing operation of the first phase of the system. The first phase of Dardel uses a Slingshot 10 interconnect (rated at 100 Gb/s); it is the network that enables the compute nodes to exchange data at high speeds. However, the GPU cards for the second phase use Slingshot 11 (a 200 Gb/s version of the current interconnect). To resolve this in such a way that both phases would use Slingshot 11 would involve changing all the network interface cards in the first phase and hence having the whole system down for a considerable amount of time. To avoid this, a separate system will be built for the GPU nodes; it will work with Slingshot 11 and be controlled using a cluster manager known as the HPE Performance Cluster Manager (HPCM). (In contrast, the first phase of Dardel uses Shasta technology.) This solution should provide the same level of functionality for those using the GPUs and will have the advantage of not affecting the ongoing stable operation of the existing Dardel compute nodes. The GPU nodes will access the current Lustre file system through a couple of Lustre router nodes. The plan is for the second phase of the system to be installed during the summer. PDC plans to join the CPU and GPU systems when all of the second phase has been installed and the whole system is operating reliably. That could happen towards the end of this year or early next year.
Upcoming Expansion of Dardel
The Swedish Science Council (VR) and SNIC have decided to invest more in Dardel than was initially planned (to mitigate the risk of a future lack of adequate capacity for Swedish research). This funding, along with the compensation from HPE for the late delivery of the first phase of Dardel, will be used to extend the system with an extra 468 CPU nodes and an increase of 50% in the disk capacity. At this stage, it is expected the additional hardware will be delivered in September or October this year. These extra nodes, plus the GPU phase, mean that Dardel will be increased with another two compute cabinets and one cooling distribution cabinet (for controlling the water that cools the system).
Types of nodes | Memory | Number of CPU nodes | Number of GPU nodes | ||||
---|---|---|---|---|---|---|---|
SNIC initial | Industry/business | SU Astronomy | SNIC extra | Total | |||
Thin | 256 GB | 488 | 36 | 0 | 212 | 736 | 0 |
Large | 512 GB | 20 | 236 | 12 | 248 | 516 | 56 |
Huge | 1 TB | 8 | 0 | 0 | 0 | 8 | 0 |
Giant | 2 TB | 2 | 0 | 0 | 8 | 10 | 0 |
TOTAL | - | 518 | 272 | 12 | 468 | 1270 | 56 |