Federated HPC Clouds Applied to Radiation Therapy

Andrés Gómez, CESGA

We present an autonomous fault-tolerance virtual cluster architecture developed in the European BonFIRE project which can handle the failure of a cloud site or the variability of its performance. This architecture includes an elasticity engine that uses the application performance to make decisions about the size of the cluster to meet the expectation of the user. In case of a specific deadline objective, the elasticity engine starts new machines if needed, and adds them to the virtual cluster. In case the virtual cluster is deployed on two different cloud sites and one site fails, the cluster at the other site resizes itself to recover from the failure. These features will be demonstrated and results will be shown with a real application from the radiation therapy project e-IMRT.