After Russia’s full-scale invasion of Ukraine on 24 February 2022, the demand for virtual machines by URAN’s user organisations, the Ukrainian NREN, grew exponentially driven by the need to prevent data loss from the possible destruction of the physical infrastructure. Numerous universities were able to take advantage of the free cloud services offered by URAN, but the increased demand led to an unforeseen crisis caused by disk storage overload in the NREN’s data centre.
In early September, all the virtual machines running in the URAN cloud unexpectedly shut down. eduGAIN (the interfederation service that connects identity federations around the world) and eduVPN (a secure and encrypted internet connection service) and the learning management systems (LMS) of Simon Kuznets Kharkiv National University of Economics and Odesa State Agrarian University distant learning were blocked. Even domain names registered for URAN users stopped working as a consequence of the shutdown of URAN DNS servers. It took the URAN technical team over 12 hours to identify and eliminate the cause of this problem.
“Diagnostics is the first step in such circumstances, our main challenge was that all our diagnostic tools showed the absence of any issues, and yet nothing seemed to be working”, says Yevhenii Preobrazhenskyi, URAN executive director.
URAN technical specialists noticed that one of the disks in the cloud storage had reached more than 95% capacity.
The fault-tolerant data storage system called Ceph used in URAN’s cloud storage sends a warning when one of its disks reaches a certain capacity. When a disk reaches 85%, it shows a “nearfull” warning, at 90% it sends a “backfillfull” warning, and at 95% a “full” disk warning.
“The system reacted in a way that we didn’t expect when it reached 95% capacity,” Oleh Yurchenko, URAN System Administrator, comments. “But after going through all the documentation about the Ceph data storage, we discovered a small footnote explaining that when one disk reaches 95% it causes the entire cluster to switch to “read-only” mode, hence blocking the entire system, in order to secure and save the stored data”.
The problem was solved in three steps. The first task was to make the virtual machines operational again as they had not worked for 8 hours, then to eliminate the cause of the shutdown and prevent its recurrence.
Oleh Yurchenko continues: “We first changed the percentage corresponding to a ‘full disk’ state, then we rebalanced the system. The third step consisted in the addition of a number of disks into the cluster. We connected a server, previously acquired within the frame of the EaPConnect project, to the system and the cluster finally stabilised. As a result, user organisations received a reliable and stable cloud service and were able to continue offering their services to their users. Students and teachers regained access to the virtual learning space, the university websites were up and running again, and distance learning resumed”.
“Moreover, we are now also ready to accept applications for new virtual machines since the demand for this service in war conditions remains extremely high,” said Yevhenii Preobrazhenskyi, URAN executive director
URAN’s technical team believes that although the addition of 5 disks will be able to satisfy current demand, it will not ensure a sustainable future development. Therefore, looking ahead, URAN plans to purchase additional servers within the frame of the EaPConnect project were the NREN represents the interests of the Ukraine’s research and education community.
The team is also working on a technical project to retrofit some old servers with new SDD disks with a much higher reading and writing speed in order to improve the provision of cloud services and help URAN to keep up with times.
Funded by the European Union, EaPConnect is part of the European initiative EU4Digital. The project aims to unite the research and education communities of the EU and the Eastern partner countries, as well as reduce the digital divide.
For more information please contact our contributor(s):