[Colloquium] Service Reliability and Speed in Distributed Computing Systems with Stochastic Node Failures and Communication Delays

Majeed M. HayatProfessor Electrical and Computer Engineering University of New Mexico

Abstract: The ability to model and optimize reliability and task-execution speed is central in designing survivable distributed computing systems (DCSs) where servers are prone to fail, possibly permanently and in a spatially correlated manner. Correlated component failures in networks have been receiving attention in recent years from government agencies due to their association with damage from weapons of mass destruction. In this talk we discuss the problem of modeling service reliability and task-execution speed of a DSC in uncertain topologies as well as the problem of load balancing in such environments. Service reliability and the mean task-execution time are analytically characterized by means of a novel regeneration-based probabilistic technique. The analysis takes into account the stochastic failure times of servers, the heterogeneity and uncertainty in service times and communication delays, as well as arbitrary task-reallocation policies. Two models are presented: the first one assumes Markovian (exponentially distributed) communication and service random times, and the second relaxes this assumption. The theory is utilized to optimize certain load-balancing policies for maximal service reliability or minimal task-execution time; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte-Carlo simulations and experiments.

Bio:Majeed M. Hayat was born in Kuwait, in 1963. He received the B.S. degree (summa cum laude) in electrical engineering from the University of the Pacific, Stockton, CA, in 1985, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Wisconsin-Madison, Madison, in 1988 and 1992. He is currently a Professor of Electrical and Computer Engineering and a member of the Center for High Technology Materials at the University of New Mexico, Albuquerque. His research contributions cover a broad range of topics in signal/image processing and applied probability. His current areas of interest include image processing and noise reduction in thermal images, algorithms for infrared spectral sensing and recognition, queuing models and strategies for resilient distributed systems and networks, modeling of noise and stochastic carrier dynamics in avalanche photodiodes, and performance characterization of optical receivers and photon counters.