TY - JOUR
T1 - NT-SwiFT
T2 - Software implemented fault tolerance on Windows NT
AU - Liang, Deron
AU - Emerald Chung, Pi
AU - Huang, Yennun
AU - Kintala, Chandra
AU - Lee, Woei Jyh
AU - Tsai, Timothy K.
AU - Wang, Chung Yih
N1 - Funding Information:
The authors would also like to thank Gaurav Suri and Yi-Min Wang who implemented the first prototype of watchd and libft in NT-SwiFT and are grateful to Dave Korn for his help in using UWIN and comments on this paper. Deron Liang received a BS degree in electrical engineering from National Taiwan University in 1983, and an MS and a Ph.D. in computer science from the University of Maryland at College Park in 1991 and 1992 respectively. He is a faculty of Computer Science Department, National Taiwan Ocean University, Taiwan since 2001. He also holds joint appointment with the Institute of Information Science (IIS), Academia Sinica, Taipei, Taiwan, Republic of China. He was with IIS from 1993 till 2001. His current research interests are in the areas of software fault-tolerance, system security, and system reliability analysis. Dr. Liang is a member of ACM and IEEE. Pi-Yu Emerald Chung is a member of the Core Engineering at Siebel Systems. She designs and develops data synchronization services for handheld and mobile devices. Her research interests include mobile network architecture, security, fault-tolerance computing and distributed transaction processing. She received a BS in electrical engineering from National Taiwan University and an MS and a Ph.D. in electrical engineering from the University of Illinois at Urbana-Champaign. Chandra M.R. Kintala is Vice President of Research Realization Center in Avaya Labs in Basking Ridge, NJ. He received Ph.D. in Computer Science; and has five patents and over 40 research publications in software fault tolerance, programming environments and theoretical computer science and a Smithsonian medal sponsored by ComputerWorld for his pioneering contributions to software-implemented fault tolerance (SwiFT). Previously, Chandra was Head of Distributed Software Research in Bell Labs in Murray Hill, NJ. He was also an Adjunct Professor of Computer Science at Stevens Institute of Technology. He edited two special issues of Bell Labs Technical Journal on Software. Yenmm Huang is currently the VP of Engineering for PreCache Inc. in charge of R&D of the company. Before he joined the PreCache in 2001, he was the Division Manager of Dependable Distributed Computing Research Department in AT&T Labs. He received his B.S. degree from National Taiwan University in 1982, a M.S. degree and Ph.D. from University of Maryland at College Park in 1986 and 1989, respectively. He has worked on many software tools and techniques to improve reliability and availability of software systems. He also helped many projects in reliable system designs and implementations. He worked for Bell Labs from 1989 to 1999. He was the only recipient of the Lucent Commemorating Stock Certificate in recognition for work in inventing and pioneering the development of software fault tolerance from Bell Labs Research in 1996. His SwiFT work won the Computerworld Smithsonian Awards in 1998. He has 11 US patents and more than 30 publications. He has served in many program committees such as FTCS, DSN, ICDCS, SRDS, WWW, etc. His main research interests are fault-tolerant computing, distributed event processing, distributed object technologies, performance evaluation, distributed systems and cluster computing, etc. Woei-Jyh Lee received his BS degree from the Department of Computer Science and Information Engineering at the National Taiwan University in 1993, and his M.S. degree from the Department of Computer Science at the New York University in 1998. He worked at Bell Laboratories Research, Lucent Technologies, from 1998 till 2000. He is currently pursuing a Ph.D. degree in the Department of Computer Science at the University of Maryland at College Park. His research interests include systems simulation and performance evaluation, network policies and management, distributed systems, and Internet protocols. Timothy K. Tsai is a member of the technical staff at Avaya Labs, Avaya Inc. His research interests include computer security, fault-tolerant system design and validation, software engineering, and distributed systems. He received a B.S. in electrical engineering from Brigham Young University and an M.S. and a Ph.D. in electrical engineering from the University of Illinois at Urbana-Champaign. He is a member of IEEE and IEEE Computer Society. Chung-Yih Wang is currently the Senior Research Engineer for PreCache Inc. in charge of distributed information service. He received his B.S. degree from National Taiwan University in 1993, a M.S. degree from National Tsing-Hwa University 1995. His research interests include distributed systems, software reliability, and data networking.
PY - 2004/4
Y1 - 2004/4
N2 - Today, there are increasing demands to make application software more tolerant to failures. Fault-tolerant applications detect and recover from failures that are not handled by the application's underlying hardware or operating system. In recent years, an increasing number of highly available applications are being implemented on Windows NT. However, the current version of Windows (NT4.0, 2000) and its utilities, such as Microsoft Cluster Server (MSCS), do not provide some facilities (such as transparent checkpointing, and message logging) that are needed to implement fault-tolerant applications. In this paper, we describe a set of reusable software components collectively named software implemented fault tolerance (NT-SwiFT) that facilitates building fault-tolerant and highly available applications on Windows NT, 2000. NT-SwiFT provides components for automatic error detection and recovery, checkpointing, event logging and replay, and communication error recovery, and incremental data replication. Using NT-SwiFT, we conducted fault injection experiments on three commercial server applications - Apache web server, Microsoft IIS web server, and Microsoft SQL - to study the failure coverage and the overhead of NT-SwiFT components. Preliminary results show that NT-SwiFT can detect and recover more application failures than MSCS does in all three applications.
AB - Today, there are increasing demands to make application software more tolerant to failures. Fault-tolerant applications detect and recover from failures that are not handled by the application's underlying hardware or operating system. In recent years, an increasing number of highly available applications are being implemented on Windows NT. However, the current version of Windows (NT4.0, 2000) and its utilities, such as Microsoft Cluster Server (MSCS), do not provide some facilities (such as transparent checkpointing, and message logging) that are needed to implement fault-tolerant applications. In this paper, we describe a set of reusable software components collectively named software implemented fault tolerance (NT-SwiFT) that facilitates building fault-tolerant and highly available applications on Windows NT, 2000. NT-SwiFT provides components for automatic error detection and recovery, checkpointing, event logging and replay, and communication error recovery, and incremental data replication. Using NT-SwiFT, we conducted fault injection experiments on three commercial server applications - Apache web server, Microsoft IIS web server, and Microsoft SQL - to study the failure coverage and the overhead of NT-SwiFT components. Preliminary results show that NT-SwiFT can detect and recover more application failures than MSCS does in all three applications.
KW - Automatic error detection and recovery, checkpointing, event logging and replay, communication error recovery, and incremental data replications
KW - Microsoft Cluster Server
KW - Software implemented fault tolerance
KW - Windows NT
UR - http://www.scopus.com/inward/record.url?scp=1242342966&partnerID=8YFLogxK
U2 - 10.1016/S0164-1212(02)00154-1
DO - 10.1016/S0164-1212(02)00154-1
M3 - 期刊論文
AN - SCOPUS:1242342966
SN - 0164-1212
VL - 71
SP - 127
EP - 141
JO - Journal of Systems and Software
JF - Journal of Systems and Software
IS - 1-2
ER -