ISC HPC Blog
Simplified HPC for Everyone
The complexity and great variety of distributed platforms (clusters, Clouds, supercomputers, multicore workstations, grids) make it hard, sometimes even impossible, for HPC users to work efficiently. These users almost always look for simplicity over performance.
When using computing/simulation applications with huge computing power and storage requirements, researchers and engineers need to know not only the characteristics of their application (and sometimes even of its parallelization), but also details about the machine they can use. These details can be static (characteristics of the nodes, memory, network, storage, etc.) or dynamic (availability, load). As a consequence, users tend to use the first machine they can access, even if it is not the most adapted to their application. On top of that, if several machines are available, users must manage several account details.
For infrastructure administrators, the management of a heterogeneous set of machines (of different architectures and generations, using different resource- and data-managers) can quickly become very complex and time-consuming. It also requires a lot of investment in training on all the different tools and interfaces. In a nutshell, it is impossible to use these platforms optimally and with a reduced cost without having a standard, unified interface.
One of the first features needed on these platforms is the ability to manage the entire set of resources in a distributed way and to schedule resources (computing or storage) as a whole. Managing a lot of resources accessed by a lot of users thus requires management tools that are extensible (to handle the load of increasing resources and user requests), efficient (to optimize the platform's throughput for administrators and the latency for users), and most of all transparent for users, so that they do not need to explicitly chose the platform adapted to their needs.
With HPC gaining popularity, increasingly varied applications have appeared. As a result, the scheduling used must be able to adapt to those applications (e.g., parametric, simulation, or big-data applications). When combined to the monitoring of relevant information on the platform, it allows making optimal scheduling decisions. Regarding data management, given the amount of communications between the machines on the network, it is crucial to schedule data transfers (using persistence and/or replication when needed) to limit the overhead.
For administrators (and most users, too!), security is an essential component of a distributed infrastructure. It must be possible to ensure the security of all the platforms and to isolate critical applications, all the while keeping it simple to use and allowing users to start remote applications through firewalls (for example). Also, users should obviously not have to keep a different account on each machine. Access to the entire platform should be possible through APIs or scripts but, for most users, a simple portal would be key to the adoption of the distributed infrastructure. Such a portal would allow the user to submit input parameters to applications, to start simulations and computations, to access results and outputs, and possibly even to get feedback on the application's execution (performance, resource usage, cost, and so on).
Our software suite (SysFera-DS) has been developed to address those concerns. We aim to provide a simple, adapted access to servers managing pre-installed applications (sequential or parallel) on clusters, supercomputers, or even Clouds (built in a company's data centers or rented from companies such as Amazon or Rackspace). This allows federating distributed resources, thus simplifying resource management for administrators and hiding complexity from users. The software can then schedule computing tasks (be they sequential, parallel, or as workflows with constraints and dependencies).
About the author
Frédéric Desprez (http://graal.ens-lyon.fr/~desprez/) is a director of research at Inria and holds a position at the LIP laboratory (ENS Lyon, France). He co-founded the Lyon-based SysFera where he holds a position as scientific advisor. He received his PhD in C.S. from Institut National Polytechnique de Grenoble, France, in 1994 and his MS in C.S. from ENS Lyon in 1990