Dardel Status
Gert Svensson, PDC
As reported in the Dardel Inauguration article in this newsletter, Dardel was officially inaugurated on the 23rd of August to celebrate the fact that all the hardware for the system was in place. There is still work being done to get the system functioning optimally – much of the complexity is arising from the current cluster manager software, a relatively new product called CSM. Dardel is also still in the process of being updated to a more recent version of the software stack (known as “Raspberry”).
Proposal to Replace CSM
Hewlett Packard Enterprise (HPE), the company that delivered the Dardel system, has suggested changing the current cluster manager, CSM, to another cluster manager called HPCM. HPCM is a less complex cluster manager that has worked well on other cluster installations for many years. HPE has recently provided PDC with a test rack for HPCM (at no additional cost), and the first tests with this manager are being performed as this article is being written. This test rack makes it possible to test HPCM and any future versions of the entire software stack without affecting the stability of the production system. If everything goes well, the aim is to switch all of Dardel over to HPCM and the latest software stack before the summer of 2024. We hope this will lead to a system that is considerably easier to manage and more stable.
Status of Dardel
A new software stack, Raspberry, is now installed on Dardel in parallel with the previous software stack, which is known as Strawberry. Raspberry is now configured and is undergoing an internal test (by PDC) to determine whether recompilation of all or only some applications is needed. Information for user testing will be sent out when the internal testing has been completed. We are promised that the annoying Lustre bug (resulting in SIGBUS crashes) has been corrected in Raspberry.
Compensation for Delays
The Dardel project has been considerably delayed by the vendor, and, in accordance with the contract, PDC is entitled to compensation for that. The compensation can be used to improve and extend the system. For example, it could be used to purchase more GPUs or some newer GPUs for testing purposes, or to adapt the storage system for other types of workloads. Researchers who are using Dardel are invited to contact PDC at support@pdc.kth.se and give input and suggestions about how the compensation could be used to upgrade, improve or extend the system: what would be most useful for your research? A minor lack of performance in the GPU benchmarks has already resulted in six more GPU nodes that will be installed during the spring of next year.