Computer 03

Manycore & Fault Tolerance Techniques

Monday, May 31, 2010, 4:00pm – 6:00pm, Hall 8

>> View detailed schedule

Chair


Multi- and manycore are today’s standard architectures with the certainty that more cores per chip will come. For diverse reasons the complete usage of every core of a chip by the running applications is the exception. Memory-, I/O- and interconnect bandwidth, cache organization, parallelization or load distribution strategy reduce the number of cores that contribute to the application performance. In this context of “abundance” of cores, it is quite obvious to think about additional services, that cores can provide for applications: Intel’s idea of “helper-cores” is a good example. The use of redundant cores for fault tolerance measures is a convincing idea presented during this session.

The first contribution from Carsten Trinitis and Max Walter of TU München revisit the basic design principles of fault-tolerant systems based on redundancy in form of additional structures, time, functions, and information that have been applied since the early days of computing. In particular, they will discuss how multiple cores can be used to mask out, detect, and recover from transient, intermittent, and permanent faults occurring on both the hardware and software level.

Herbert Cornelius, Director of the Advanced Computing Center EMEA, Intel, discusses challenges and opportunities of fault tolerance with manycore architectures in the context of general purpose processors.

High performance processing for space application is presented by Christopher Kühl from Astrium EADS. With the increasing use of remote sensing and earth observation technologies, the large amount of data collected onboard requires high performance and fast processing hardware. Also the flexibility and processing requirements for regenerative processor payloads are of a magnitude larger than those which could be successfully handled by classical processors. The processing hardware has to deal with very harsh conditions, including accelerations during launch and radiation in the space environment. The presentation will give an insight into the requirements on processing elements for space applications and an overview of the current status of processing capabilities.

Finally Andreas Herkersdorf, Institute for Integrated Systems of TU München, will discuss bio-inspired self-organization for embedded many-core resilience. CMOS technology will continue to see capacity growth enabling integration of even more processor cores, memory and diverse functions on a single chip. On the down side, transistor feature sizes approaching physical limits in the 10 nm range lead to transistor devices becoming increasingly vulnerable and sensitive to all forms of environmental and manufacturing variability. Statistical bit flips will originate from ionizing radiation (soft errors) or transient signal timing violations are due to temperature or supply voltage fluctuation. Traditional fault tolerance concepts based on information and spatial redundancy are challenged to cope with these exposures in a cost efficient manner. This presentation proposes decentralized, self-organization principles, which are inherent to collective behaviors of swarms in nature, to be adopted at different abstraction levels of multi-/many core processor systems for improving resilience and fault tolerance. We will present topical research results how CPU data paths can self-correct SEU, SET (single event upsets, single event transients) and timing errors “on the fly”, and how to incorporate low overhead reinforcement-based machine learning techniques for optimizing the individual processor core operating parameters in order to achieve robust overall application behavior.