CASA 2013 Abstracts and Bios

Mehdi Tahoori
Karlsruher Institut für Technologie

Keynote Presentation: Cross-Layer Reliability Modeling and Mitigation

As the minimum feature size continues to shrink, a host of vulnerabilities influence resiliency of VLSI circuits, such as increased process variation, radiation-induced soft errors, as well as runtime variations due to voltage and temperature fluctuations together with transistor and interconnect aging. For cost-efficient resilient system design, reliability issues must be addressed at various design steps. In this talk, I will discuss some approaches to model and mitigate various reliability issues at different stages of design cycle, from circuit to architecture, by considering the interplay of various reliability phenomena.

Mehdi Tahoori, is professor and Chair of Dependable Nano-Computing (CDNC) at Karlsruhe Institute of Technology (KIT) in Germany since 2009. Before that he was an associate professor of ECE at Northeastern University, Boston, USA. He received his Ph.D. and M.S. in Electrical Engineering from Stanford University in 2003 and 2002, respectively. He has been on the organizing and technical program committee of various design automation, test, and dependability conferences such as DATE, ICCAD, ITC, ETS, GLSVLSI, DSN, and IOLTS. He has organized various workshops, panels, tutorials and special session in major design and test conferences, such as DATE, ICCAD, and VTS. He is an associate editor of ACM Journal of Emerging Technologies for Computing. He was the recipient of National Science Foundation CAREER Award.

Giovanni Beltrame
École Polytechnique

Accelerating Design Space Exploration for Reliability with Design Space Pruning

System-level design space exploration (DSE) is an important process to optimize complex multi-processor embedded system architectures. During DSE, a system's configuration can be modified to improve, among other metrics, the system's expected lifetime, usually based on the estimation of the Mean-Time-To-Failure (MTTF) of the system. Typically, the simulation time to evaluate the MTTF of design points represents a bottleneck for the whole DSE process. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. We present present a set of metrics to identify similarities among architecture, mapping and wear of a set of configurations, in order to reduce the number of MTTF evaluations needed during system-level DSE.

Giovanni Beltrame received the M.Sc. degree in electrical engineering and computer science from the University of Illinois, Chicago, in 2001, the Laurea degree in computer engineering from the Politecnico di Milano, Italy, in 2002, the M.S. degree in information technology from CEFRIEL, Milan, in 2002, and the Ph.D. degree in computer engineering from the Politecnico di Milano, in 2006. After his PhD he worked as an engineer at the European Space Agency on a number of projects spanning from radiation-tolerant systems to computer-aided design. In 2010 he moved to Montreal, Canada where he is currently Assistant Professor at Polytechnique Montreal. His research interests include modeling and design of embedded systems, artificial intelligence, and robotics.

Michael Glaß
Friedrich-Alexander-Universität, Erlangen-Nürnberg

Concurrent Consideration of Transient and Permanent Faults using Success Trees

Recently, cross-level techniques to build reliable systems from unreliable components have gained significant attention in the research community. With faults typically occurring at the lowest level (e.g. transistor) and propagating up to the highest level of abstraction (application), countermeasures can be introduced at almost all levels of abstraction. Of course, each measure comes at a certain cost-benefit ratio. To optimally apply combinations of countermeasures, reliability analysis techniques at each level of abstraction and the arbitration between the levels are required. But, it is not only the arbitration between levels: An important aspect is to consider different sources of faults concurrently. In fact, the significance of fault sources may significantly vary even during the system's lifetime. The latter has a direct impact on the countermeasures to select and their benefit to expect. In this talk, success trees that are a well-known technique for reliability analysis at different levels of abstraction are exploited to perform an automatic concurrent analysis of transient and permanent faults. This enables to (a) put transient and permanent faults with respect to their impact on system reliability into perspective (b) quantify their changing impact over the system's lifetime and, hence, guides the designer to apply the right countermeasures with respect to the targeted mission profile.

Professor Michael Glaß holds an assistant professorship for Dependable Embedded Systems and heads the System-level Design Automation group at Hardware/Software Co-Design, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany. He received his Diploma degree and Doctorate degree in computer science from the FAU, Germany, in 2006 and 2011, respectively. Michael is reviewer for many international scientific journals as well as member of technical program committees. His research interests are dependability engineering for embedded systems and system-level design automation with particular focus on formal analysis and design space exploration.

Warren Gross
McGill University

Energy-efficiency through Faulty Circuits: Opportunities at the Algorithm Level

The delay for a signal to propagate through a digital circuit varies based on several factors and the magnitude of this variation is increasing in advanced technologies. In a synchronous circuits, the energy efficiency is governed by the worst-case delay, but some recent work in the VLSI community has focused on bringing it closer to the typical case. For logic circuits, one approach referred to as "voltage over-scaling" or "better than worst-case design" allows the supply voltage to be reduced past the critical point, such that some signal transitions use more than one clock period. The errors that are introduced can then be handled by additional circuits, or at the algorithm level. This creates possibilities for algorithm designers to participate in improving the energy efficiency of circuit implementations. In this talk, we will show how an algorithm that is robust to computation errors can lead to increased energy efficiency of a circuit implementation. The robustness can be achieved by modifying the algorithm to provide fault tolerance, or by trading off some application performance (or both). Some examples from the literature will be provided.

Warren J. Gross received the B.A.Sc. degree in electrical engineering from the University of Waterloo, Waterloo, Ontario, Canada, in 1996, and the M.A.Sc. and Ph.D. degrees from the University of Toronto, Toronto, Ontario, Canada, in 1999 and 2003, respectively. Currently, he is an Associate Professor with the Department of Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. His research interests are in the design and implementation of signal processing systems and custom computer architectures. Dr. Gross is currently Vice-Chair of the IEEE Signal Processing Society Technical Committee on Design and Implementation of Signal Processing Systems. He serves as Associate Editor for the IEEE Transactions on Signal Processing. Dr. Gross has served as Technical Program Co-Chair of the IEEE Workshop on Signal Processing Systems (SiPS 2012) and as Chair of the IEEE ICC 2012 Workshop on Emerging Data Storage Technologies. He has served on the Program Committees of the IEEE Workshop on Signal Processing Systems, the IEEE Symposium on Field-Programmable Custom Computing Machines, the International Conference on Field-Programmable Logic and Applications and as the General Chair of the 6th Annual Analog Decoding Workshop. Dr. Gross is a Senior Member of the IEEE and a licensed Professional Engineer in the Province of Ontario.

Adam Hartman
Carnegie Mellon University

Co-optimizing System Lifetime and Time to First Failure in Embedded Chip Multiprocessors

Nearly all of the work on task mapping, and other system-level methods for improving lifetime, is focused on optimizing time to system failure (tsys). These methods assume the system is designed to detect and recover from one or more permanent component failures given enough remaining resources to accommodate all required tasks. Since this design paradigm comes with increased design and verification difficulty due to the number of combinations of components that can fail, its use may not be realistic in systems with many components or where safety and security are critical. Thus, there are designs for which it is more important to maximize time to first failure (tfirst) than tsys, while other designs may require the opposite or some co-optimization of the two.

Adam is a PhD student at Carnegie Mellon University working with Professor Don Thomas. His research interests include system-level design and optimization, lifetime-aware design techniques, and embedded systems.

Brett H. Meyer
McGill University

Workload Effects on Execution Fingerprinting for Low-cost Safety-Critical Systems

Execution fingerprinting has emerged as an alternative to n-modular redundancy for verifying redundant execution without requiring that all cores execute the same task or even execute redundant tasks concurrently. Fingerprinting takes a bit stream characterizing the execution of a task and compresses it into a single, fixed- width word or fingerprint. In this talk, we will explore the trade-offs inherent in fingerprinting subsystem design, including: (a) determining what application data to compress, as a function of error detection probability and latency, and (b) identifying a corresponding fingerprinting circuit implementation. In this context, we present several case studies demonstrating how application characteristics inform fingerprinting subsystem design.

Brett H. Meyer is a Chwang-Seto Faculty Scholar and assistant professor in the Department of Electrical and Computer Engineering at McGill University. He received his MS and PhD in Electrical and Computer Engineering from Carnegie Mellon University in 2005 and 2009, respectively. He received his BS in Electrical Engineering, Computer Science and Math from the University of Wisconsin-Madison in 2003. After receiving his PhD, Meyer worked as a post-doctoral research associate in the Computer Science Department at the University of Virginia. He has been on the faculty at McGill since 2011. Meyer's research interests are focused on the design and architecture of resilient multiprocessor computer systems.

Hiren Patel
University of Waterloo

Reliable Computing with Ultra-reduced Instruction Set Co-processors

We present a combined hardware and software approach for reliably performing computation in the presence of hard faults. The hardware proposed implements a Turing-complete instruction called SUBLEQ in a co-processor that can, in theory, mimic the semantics of any other instruction. The software extends LLVM's back-end with the ability to replace a subset of the MIPS instructions with SUBLEQ instructions. A hard fault rendering a MIPS instruction faulty can then be replaced with a sequence of SUBLEQ instructions.

Hiren Patel is an assistant professor in the Electrical and Computer Engineering department at the University of Waterloo, Canada. His research interests are in system-level design methodologies, computer architecture and real-time embedded systems.

Mihai Pricopi
National University of Singapore

Thermal Reliability Aware Scheduling and Power Management for Heterogeneous Multi-cores

Moore's Law enables continued increase in the number of cores on chip, but the failure of Dennard scaling is bringing in the dark silicon era. For reliable operation, significant fraction of the cores have to be left un-powered, or dark, at any point in time to meet the thermal design power (TDP) constraint. This phenomenon is driving the emergence of asymmetric multi-cores integrating cores with diverse power-performance characteristics. We present a comprehensive power management framework for asymmetric multi-cores—in the context of mobile embedded platforms—that can provide satisfactory user experience while minimizing energy consumption within the TDP budget. Our framework includes two key components: a power-performance estimation technique that works across different core types, and a formal hierarchical control-theoretic approach to orchestrate various power management knobs towards meeting the objectives. Results show that the framework can efficiently exploit the asymmetry of the system by improving performance while maintaining low power consumption. Under TDP constraint, the system manages the chip power below TDP through DVFS and graceful degradation of the quality-of-service (QoS) of the tasks if necessary.

Mihai is a final year Ph.D. student working on Dynamic Heterogeneous Computer Architectures, Asymmetric Architectures, Embedded Systems and Energy Efficiency in the Dark Silicon Era. His major work is focussed on a heterogeneous processor architecture which allows multi-cores to adapt dynamically creating more complex cores. He is also working on scheduling techniques for asymmetric and adaptive architectures. Mihai joined National University of Singapore in 2009 after receiving his Master's Degree in Computer Engineering. He obtained his Bachelor's Degree in Computer Engineering from Faculty of Automatic Control and Computing Engineering of Iasi, Romania.

Aviral Shrivastava
Arizona State University

Compiler-Microarchitecture Cooperation for Resilience Against Soft Errors

The focus of our research is to protect computing systems from the onslaught of rapidly increasing soft errors. It has become abundantly clear that the effects of soft errors cannot be contained and mitigated by techniques at one level, be it fabrication, or micro architecture, or system level. Rather protection scheme from soft errors has to be cross-layer -- distributed through multiple levels of design hierarchy. However, the main challenge in this is that how do we know that the protection at the different levels are complementary and not overlapping? In this talk, I will present the motivation for cross layer schemes, particularly hybrid compiler-microarchitecture techniques, and then, I will give a sample of how microarchitecture-compiler-system can collaborate to provide effective, yet cost-efficient soft error protection schemes.

Aviral Shrivastava is Associate Professor in the School of Computing Informatics and Decision Systems Engineering at the Arizona State University, where he has established and heads the Compiler and Microarchitecture Labs (CML). He received his Ph.D. and Masters in Information and Computer Science from University of California, Irvine, and bachelors in Computer Science and Engineering from Indian Institute of Technology, Delhi. He is a 2011 NSF CAREER Award Recipient, and 2012 Outstanding Junior Researcher award in the School of Computing, Informatics and Decision Systems Engineering at ASU. His research focuses in three important directions, 1. Manycore architectures and compilers, 2. Programmable accelerators and compilers, and 3. Quantitative Resilience. His research is funded by DOE, NSF and several industries including Intel, Nvidia, Microsoft, Raytheon Missile Systems, Samsung etc. He serves on organizing and program committees of several premier embedded system conferences, including ISLPED, CODES+ISSS, CASES and LCTES, and NSF and DOE review panels.

Joseph Sloan
University of Texas at Dallas

Algorithmic Approaches to Enhancing and Exploiting Application-Level Error Tolerance

As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS technologies in sight appear to have largely stochastic characteristics, hardware reliability has become a first order design concern. To make matters worse, emerging computing systems are becoming increasingly power constrained. Traditional hardware/software approaches are likely to be impractical for these power constrained systems due to their heavy reliance on redundant, worst-case, and conservative designs. The primary goal of this research has been to investigate how we can leverage inherent application and algorithm characteristics (e.g. natural error resilience, spatial and temporal reuse, and fault containment) to build more efficient robust systems. In this talk, I will describe algorithmic approaches that leverage application and algorithm-awareness for building such systems. These approaches include a) application-specific techniques for low-overhead fault detection b) an algorithmic approach for error correction using localization, and c) a numerical optimization-based methodology for converting applications into a more error tolerant form. This research shows that application and algorithm-awareness can significantly increase the robustness of computing systems, while also reducing the cost of meeting reliability targets.

Joseph Sloan is an Assistant Professor in the Electrical Engineering Department at the University of Texas at Dallas. He received a B.S. degree in electrical engineering and a B.S. degree in computer engineering from Iowa State University in 2007, and his M.S. and Ph.D degrees in electrical and computer engineering from the University of Illinois at Urbana-Champaign (UIUC) in 2011 and 2013, respectively. His research interests include fault-tolerant computing, high performance and scientific computing, computer architecture, and low-power design. Joseph’s research has been recognized by the Yi-MinWang and Pi-Yu Chung Endowed Research Award, a Best Paper in Session Award at SRC TECHCON 2011, a 2012 ECE/Intel Computer Engineering Fellowship, and and has been the subject of several keynote talks, invited plenary lectures, and invited articles. His research also forms a core component of 2010 NSF Expedition in Computing Award and has been covered by media sources, including BBC News, IEEE Spectrum, and HPCWire. When not working on his research, Joseph and his wife enjoy running, hiking, climbing outdoors, and spending time together at the orchestra and theater.