Mehdi Tahoori
Karlsruher Institut für Technologie
|
Keynote Presentation: Cross-Layer Reliability Modeling
and Mitigation
As the minimum feature size continues to shrink, a host of
vulnerabilities influence resiliency of VLSI circuits,
such as increased process variation, radiation-induced
soft errors, as well as runtime variations due to voltage
and temperature fluctuations together with transistor and
interconnect aging. For cost-efficient resilient system
design, reliability issues must be addressed at various
design steps. In this talk, I will discuss some approaches
to model and mitigate various reliability issues at
different stages of design cycle, from circuit to
architecture, by considering the interplay of various
reliability phenomena.
Mehdi Tahoori, is professor and Chair of Dependable
Nano-Computing (CDNC) at Karlsruhe Institute of Technology
(KIT) in Germany since 2009. Before that he was an
associate professor of ECE at Northeastern University,
Boston, USA. He received his Ph.D. and M.S. in Electrical
Engineering from Stanford University in 2003 and 2002,
respectively. He has been on the organizing and technical
program committee of various design automation, test, and
dependability conferences such as DATE, ICCAD, ITC, ETS,
GLSVLSI, DSN, and IOLTS. He has organized various
workshops, panels, tutorials and special session in major
design and test conferences, such as DATE, ICCAD, and
VTS. He is an associate editor of ACM Journal of Emerging
Technologies for Computing. He was the recipient of
National Science Foundation CAREER Award.
|
Giovanni Beltrame
École Polytechnique
|
Accelerating Design Space Exploration for Reliability
with Design Space Pruning
System-level design space exploration (DSE) is an
important process to optimize complex multi-processor
embedded system architectures. During DSE, a system's
configuration can be modified to improve, among other
metrics, the system's expected lifetime, usually based on
the estimation of the Mean-Time-To-Failure (MTTF) of the
system. Typically, the simulation time to evaluate the
MTTF of design points represents a bottleneck for the
whole DSE process. Therefore, the vast design space that
needs to be searched requires effective design space
pruning techniques. We present present a set of metrics to
identify similarities among architecture, mapping and wear
of a set of configurations, in order to reduce the number
of MTTF evaluations needed during system-level DSE.
Giovanni Beltrame received the M.Sc. degree in electrical
engineering and computer science from the University of
Illinois, Chicago, in 2001, the Laurea degree in computer
engineering from the Politecnico di Milano, Italy, in
2002, the M.S. degree in information technology from
CEFRIEL, Milan, in 2002, and the Ph.D. degree in computer
engineering from the Politecnico di Milano, in 2006. After
his PhD he worked as an engineer at the European Space
Agency on a number of projects spanning from
radiation-tolerant systems to computer-aided design. In
2010 he moved to Montreal, Canada where he is currently
Assistant Professor at Polytechnique Montreal. His
research interests include modeling and design of embedded
systems, artificial intelligence, and robotics.
|
Michael Glaß
Friedrich-Alexander-Universität, Erlangen-Nürnberg
|
Concurrent Consideration of Transient and Permanent
Faults using Success Trees
Recently, cross-level techniques to build reliable systems
from unreliable components have gained significant
attention in the research community. With faults typically
occurring at the lowest level (e.g. transistor) and
propagating up to the highest level of abstraction
(application), countermeasures can be introduced at almost
all levels of abstraction. Of course, each measure comes
at a certain cost-benefit ratio. To optimally apply
combinations of countermeasures, reliability analysis
techniques at each level of abstraction and the
arbitration between the levels are required. But, it is
not only the arbitration between levels: An important
aspect is to consider different sources of faults
concurrently. In fact, the significance of fault sources
may significantly vary even during the system's
lifetime. The latter has a direct impact on the
countermeasures to select and their benefit to expect. In
this talk, success trees that are a well-known technique
for reliability analysis at different levels of
abstraction are exploited to perform an automatic
concurrent analysis of transient and permanent
faults. This enables to (a) put transient and permanent
faults with respect to their impact on system reliability
into perspective (b) quantify their changing impact over
the system's lifetime and, hence, guides the designer to
apply the right countermeasures with respect to the
targeted mission profile.
Professor Michael Glaß holds an assistant
professorship for Dependable Embedded Systems and heads
the System-level Design Automation group at
Hardware/Software Co-Design,
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU),
Germany. He received his Diploma degree and Doctorate
degree in computer science from the FAU, Germany, in 2006
and 2011, respectively. Michael is reviewer for many
international scientific journals as well as member of
technical program committees. His research interests are
dependability engineering for embedded systems and
system-level design automation with particular focus on
formal analysis and design space exploration.
|
Warren Gross
McGill University
|
Energy-efficiency through Faulty Circuits: Opportunities at the Algorithm Level
The delay for a signal to propagate through a digital
circuit varies based on several factors and the magnitude
of this variation is increasing in advanced
technologies. In a synchronous circuits, the energy
efficiency is governed by the worst-case delay, but some
recent work in the VLSI community has focused on bringing
it closer to the typical case. For logic circuits, one
approach referred to as "voltage over-scaling" or "better
than worst-case design" allows the supply voltage to be
reduced past the critical point, such that some signal
transitions use more than one clock period. The errors
that are introduced can then be handled by additional
circuits, or at the algorithm level. This creates
possibilities for algorithm designers to participate in
improving the energy efficiency of circuit
implementations.
In this talk, we will show how an algorithm that is robust
to computation errors can lead to increased energy
efficiency of a circuit implementation. The robustness can
be achieved by modifying the algorithm to provide fault
tolerance, or by trading off some application performance
(or both). Some examples from the literature will be
provided.
Warren J. Gross received the B.A.Sc. degree in electrical
engineering from the University of Waterloo, Waterloo,
Ontario, Canada, in 1996, and the M.A.Sc. and
Ph.D. degrees from the University of Toronto, Toronto,
Ontario, Canada, in 1999 and 2003,
respectively. Currently, he is an Associate Professor with
the Department of Electrical and Computer Engineering,
McGill University, Montréal, Québec, Canada. His research
interests are in the design and implementation of signal
processing systems and custom computer architectures.
Dr. Gross is currently Vice-Chair of the IEEE Signal
Processing Society Technical Committee on Design and
Implementation of Signal Processing Systems. He serves as
Associate Editor for the IEEE Transactions on Signal
Processing. Dr. Gross has served as Technical Program
Co-Chair of the IEEE Workshop on Signal Processing Systems
(SiPS 2012) and as Chair of the IEEE ICC 2012 Workshop on
Emerging Data Storage Technologies. He has served on the
Program Committees of the IEEE Workshop on Signal
Processing Systems, the IEEE Symposium on
Field-Programmable Custom Computing Machines, the
International Conference on Field-Programmable Logic and
Applications and as the General Chair of the 6th Annual
Analog Decoding Workshop. Dr. Gross is a Senior Member of
the IEEE and a licensed Professional Engineer in the
Province of Ontario.
|
Adam Hartman
Carnegie Mellon University
|
Co-optimizing System Lifetime and Time to First
Failure in Embedded Chip Multiprocessors
Nearly all of the work on task mapping, and other
system-level methods for improving lifetime, is focused on
optimizing time to system failure (tsys).
These methods assume the system is designed to detect and
recover from one or more permanent component failures
given enough remaining resources to accommodate all
required tasks. Since this design paradigm comes with
increased design and verification difficulty due to the
number of combinations of components that can fail, its
use may not be realistic in systems with many components
or where safety and security are critical. Thus, there
are designs for which it is more important to maximize
time to first failure (tfirst) than
tsys, while other designs may require the
opposite or some co-optimization of the two.
Adam is a PhD student at Carnegie Mellon University
working with Professor Don Thomas. His research interests
include system-level design and optimization,
lifetime-aware design techniques, and embedded systems.
|
Brett H. Meyer
McGill University
|
Workload Effects on Execution Fingerprinting for
Low-cost Safety-Critical Systems
Execution fingerprinting has emerged as an alternative to
n-modular redundancy for verifying redundant execution
without requiring that all cores execute the same task or
even execute redundant tasks concurrently. Fingerprinting
takes a bit stream characterizing the execution of a task
and compresses it into a single, fixed- width word or
fingerprint. In this talk, we will explore the trade-offs
inherent in fingerprinting subsystem design, including:
(a) determining what application data to compress, as a
function of error detection probability and latency, and
(b) identifying a corresponding fingerprinting circuit
implementation. In this context, we present several case
studies demonstrating how application characteristics
inform fingerprinting subsystem design.
Brett H. Meyer is a Chwang-Seto Faculty Scholar and
assistant professor in the Department of Electrical and
Computer Engineering at McGill University. He received his
MS and PhD in Electrical and Computer Engineering from
Carnegie Mellon University in 2005 and 2009,
respectively. He received his BS in Electrical
Engineering, Computer Science and Math from the University
of Wisconsin-Madison in 2003. After receiving his PhD,
Meyer worked as a post-doctoral research associate in the
Computer Science Department at the University of
Virginia. He has been on the faculty at McGill since
2011. Meyer's research interests are focused on the design
and architecture of resilient multiprocessor computer
systems.
|
Hiren Patel
University of Waterloo
|
Reliable Computing with Ultra-reduced Instruction Set
Co-processors
We present a combined hardware and software approach for
reliably performing computation in the presence of hard
faults. The hardware proposed implements a
Turing-complete instruction called SUBLEQ in a
co-processor that can, in theory, mimic the semantics of
any other instruction. The software extends LLVM's
back-end with the ability to replace a subset of the MIPS
instructions with SUBLEQ instructions. A hard fault
rendering a MIPS instruction faulty can then be replaced
with a sequence of SUBLEQ instructions.
Hiren Patel is an assistant professor in the Electrical
and Computer Engineering department at the University of
Waterloo, Canada. His research interests are in
system-level design methodologies, computer architecture
and real-time embedded systems.
|
Mihai Pricopi
National University of Singapore
|
Thermal Reliability Aware Scheduling and Power Management for Heterogeneous Multi-cores
Moore's Law enables continued increase in the number of
cores on chip, but the failure of Dennard scaling is
bringing in the dark silicon era. For reliable operation,
significant fraction of the cores have to be left
un-powered, or dark, at any point in time to meet the
thermal design power (TDP) constraint. This phenomenon is
driving the emergence of asymmetric multi-cores
integrating cores with diverse power-performance
characteristics. We present a comprehensive power
management framework for asymmetric multi-cores—in the
context of mobile embedded platforms—that can provide
satisfactory user experience while minimizing energy
consumption within the TDP budget. Our framework includes
two key components: a power-performance estimation
technique that works across different core types, and a
formal hierarchical control-theoretic approach to
orchestrate various power management knobs towards meeting
the objectives. Results show that the framework can
efficiently exploit the asymmetry of the system by
improving performance while maintaining low power
consumption. Under TDP constraint, the system manages the
chip power below TDP through DVFS and graceful degradation
of the quality-of-service (QoS) of the tasks if necessary.
Mihai is a final year Ph.D. student working on Dynamic
Heterogeneous Computer Architectures, Asymmetric
Architectures, Embedded Systems and Energy Efficiency in
the Dark Silicon Era. His major work is focussed on a
heterogeneous processor architecture which allows
multi-cores to adapt dynamically creating more complex
cores. He is also working on scheduling techniques for
asymmetric and adaptive architectures. Mihai joined
National University of Singapore in 2009 after receiving
his Master's Degree in Computer Engineering. He obtained
his Bachelor's Degree in Computer Engineering from Faculty
of Automatic Control and Computing Engineering of Iasi,
Romania.
|
Aviral Shrivastava
Arizona State University
|
Compiler-Microarchitecture Cooperation for Resilience
Against Soft Errors
The focus of our research is to protect computing systems
from the onslaught of rapidly increasing soft errors. It
has become abundantly clear that the effects of soft
errors cannot be contained and mitigated by techniques at
one level, be it fabrication, or micro architecture, or
system level. Rather protection scheme from soft errors
has to be cross-layer -- distributed through multiple
levels of design hierarchy. However, the main challenge
in this is that how do we know that the protection at the
different levels are complementary and not overlapping?
In this talk, I will present the motivation for cross
layer schemes, particularly hybrid
compiler-microarchitecture techniques, and then, I will
give a sample of how microarchitecture-compiler-system can
collaborate to provide effective, yet cost-efficient soft
error protection schemes.
Aviral Shrivastava is Associate Professor in the School of
Computing Informatics and Decision Systems Engineering at
the Arizona State University, where he has established and
heads the Compiler and Microarchitecture Labs (CML). He
received his Ph.D. and Masters in Information and Computer
Science from University of California, Irvine, and
bachelors in Computer Science and Engineering from Indian
Institute of Technology, Delhi. He is a 2011 NSF CAREER
Award Recipient, and 2012 Outstanding Junior Researcher
award in the School of Computing, Informatics and Decision
Systems Engineering at ASU. His research focuses in three
important directions, 1. Manycore architectures and
compilers, 2. Programmable accelerators and compilers,
and 3. Quantitative Resilience. His research is funded by
DOE, NSF and several industries including Intel, Nvidia,
Microsoft, Raytheon Missile Systems, Samsung etc. He
serves on organizing and program committees of several
premier embedded system conferences, including ISLPED,
CODES+ISSS, CASES and LCTES, and NSF and DOE review
panels.
|
Joseph Sloan
University of Texas at Dallas
|
Algorithmic Approaches to Enhancing and Exploiting
Application-Level Error Tolerance
As late-CMOS process scaling leads to increasingly
variable circuits/logic and as most post-CMOS technologies
in sight appear to have largely stochastic
characteristics, hardware reliability has become a first
order design concern. To make matters worse, emerging
computing systems are becoming increasingly power
constrained. Traditional hardware/software approaches are
likely to be impractical for these power constrained
systems due to their heavy reliance on redundant,
worst-case, and conservative designs. The primary goal of
this research has been to investigate how we can leverage
inherent application and algorithm characteristics
(e.g. natural error resilience, spatial and temporal
reuse, and fault containment) to build more efficient
robust systems. In this talk, I will describe algorithmic
approaches that leverage application and
algorithm-awareness for building such systems. These
approaches include a) application-specific techniques for
low-overhead fault detection b) an algorithmic approach
for error correction using localization, and c) a
numerical optimization-based methodology for converting
applications into a more error tolerant form. This
research shows that application and algorithm-awareness
can significantly increase the robustness of computing
systems, while also reducing the cost of meeting
reliability targets.
Joseph Sloan is an Assistant Professor in the Electrical
Engineering Department at the University of Texas at
Dallas. He received a B.S. degree in electrical
engineering and a B.S. degree in computer engineering from
Iowa State University in 2007, and his M.S. and Ph.D
degrees in electrical and computer engineering from the
University of Illinois at Urbana-Champaign (UIUC) in 2011
and 2013, respectively. His research interests include
fault-tolerant computing, high performance and scientific
computing, computer architecture, and low-power
design. Joseph’s research has been recognized by the
Yi-MinWang and Pi-Yu Chung Endowed Research Award, a Best
Paper in Session Award at SRC TECHCON 2011, a 2012
ECE/Intel Computer Engineering Fellowship, and and has
been the subject of several keynote talks, invited plenary
lectures, and invited articles. His research also forms a
core component of 2010 NSF Expedition in Computing Award
and has been covered by media sources, including BBC News,
IEEE Spectrum, and HPCWire. When not working on his
research, Joseph and his wife enjoy running, hiking,
climbing outdoors, and spending time together at the
orchestra and theater.
|