6 Acknowledgments First of all, I would like to express my sincere gratitude to all the people who made this work together with me, participating intensively along these four years of adventure. Four years of intensive learning with a mix of excellent experiences and difficulties, but certainly a wonderful period of my live. My gratitude to the directors of the LAAS-CNRS, Jean-Claude Laprie and Malik Ghallab and also to David Powell and Jean Arlat the successive heads on dependable computing and fault tolerance research group (TSF), for their support allowing me to work in the best conditions. A very special thanks to the main responsibles of this adventure. Mohamed Kaâniche for being an advisor always ready to give all his energy with an inestimable determination and a remarkable methodology. Karama Kanoun for leading our discussions with her strategic vision, constant help and guidance in the research. I need to say that my multiple intrusions in their office were answered with availability and much attention. Moreover, the remarks and the way both conducted this work were not only fundamental to this thesis, but also they had a special value for me in the process of formation. My sincere thanks to the people of group TSF with whom I had the pleasure of interacting and collaborating not only on technical aspects but specially personally. I think the most important lesson learned is how to do scientific research. Special thanks go to my friends. For me, you are like brothers and sisters and an incredible source of motivation. Finally, I thank my family for their love and support during these years of study. My parents Darci and Silvia two of the most wonderful people in the world. Tiago and Diego who are fantastic brothers. My wife Robertinha, who is my inspiration and the most special person in the world. I just thank you from the bottom of my heart for loving me. v

10 CONTENTS ix Sensitivity to MTTD Sensitivity to service rate Impact of traffic model Sensitivity to traffic burstiness Load effects on Í Conclusion Service unavailability due to long response time Availability measure definition Single server queueing systems Modeling unavailability due to long response time Conditional response time distribution Service availability modeling Sensitivity analysis Variation of response time Effects of Ã and on Í Finite buffer effects on Í Approximation for Í Multi-server queueing systems Modeling unavailability due to long response time Conditional response time distribution Service availability modeling Sensitivity analysis Variation of response time distribution Load effects on Í Impact of aggregated service rate on Í Impact of the number of servers on Í Conclusion Conclusion 105 Appendix I 109 Appendix II 119 Bibliography 127

13 xii LIST OF FIGURES 3.3 A web cluster with servers available and load balancing Markov Modulated Poisson Process modeling the request arrival process Impact of MTTF on Í Impact of failure detection duration on Í for both recovery strategies Impact of service rate on Í MMPP traffic models representing the traffic distribution along the day Effects of the traffic burstiness on Í for both recovery strategies Effects of service load on Í È Ê µ variation for single server queueing system Í for an M/M/1 queue system model as a function of Ã Í for an M/M/1 queue system model as a function of The effect of finite buffer size on Í Í as a function of using equation (4.8) and equation (4.11) È Ê µ variation for multi-server queuing systems

14 List of Tables 2.1 Examples of functions provided by e-business web sites TA user scenarios with associated probabilities Profile of user class A Profile of user class B User scenario probabilities (in %) Scenario categories for user classes A and B Mapping between functions and services External service availability Application and database service availability Web service availability Function level availabilities Numerical values of the model parameters User availabilities for classes A and B Closed-form equations for NCT recovery strategy Closed-form equations for CT recovery strategy Numerical values of the model parameters The MMPP models and traffic burstiness Closed-form equations for single server queueing systems Effects on Í as increases for ¼ Closed-form equations for multi-servers queueing systems Configurations for an aggregated service rate of ½¼requests/sec Í for an aggregated service rate of ½¼ requests/sec Í for an aggregated service rate of ¼requests/sec Í in days:hours:minutes per year for ¼ xiii

15 xiv LIST OF TABLES

16 Introduction Deal with the faults of others as gently as with your own. Chinese proverb OVER the past years, the Internet has become an huge infrastructure used daily by millions of people in the world. The world wide web (www or web) is a publishing medium used to disseminate information quickly through this infrastructure. The web has had a rapid growth in size and usage, with an extensive development of web sites delivering a large variety of personal, commercial and educational material. Virtual stores on the web allow to buy books, cds, computers, and many other products and services. New web applications such as e-commerce, digital libraries, video on-demand and distance learning make the issue of dependability evaluation increasingly important, in particular with respect to the service perceived by web users. Businesses and individuals are increasingly depending on web-based services for any sort of operations. Web-based services 1 connect departments within organizations, multiple companies and the population in general. In addition, the web is often used for critical applications such as online banking, stock trading, booking systems requiring high availability and performance of the service. In those applications, a temporary service unavailability may have unacceptable consequences in terms of financial losses. A period of service unavailability may cost millions to the site depending on the duration and on the importance of the period 2. Itmaybeverydifficulttoaccessa major newspaper or TV site after some important news due to site overload. Those periods are usually most important to the web service provider because unplanned 1 In this thesis, web services refer to the services delivered by a web application. Such services do not refer to the standards and protocols proposed by W3C and OASIS (both responsible for the architecture and standardization of web services). 2 According to [Patterson et al. 2002] well-managed servers today achieve an availability of 99.9% to 99%, or equivalently between 8 to 80 hours of downtime per year. Each hour can be costly e.g. $200,000 per hour for Amazon. 1

17 unavailability is more expensive than planned unavailability [Brewer 2001]. For instance, ebay web site was unavailable for a period of 22 hours on June 1999, leading to a lost revenue of approximatively 5 billion dollars. Dependability analysis and evaluation methods are useful to understand, analyze, design and operate web-based applications and infrastructures. Quantitative measures need to be evaluated in order to estimate the quality of service and the reliance that can be placed on the provided service. This evaluation may help web site designers to identify the weak parts of the architecture which can be used for improving the provided service. Dependability evaluation consists in estimating quantitative measures allowing to analyze how hardware, software or human related failures affect the service dependability. It can be carried out by dependability modeling with the goal of analyzing the various design alternatives in order to choose the final solution that better satisfies the requirements. Evaluation can be carried out using two complementary approaches: i) measurement and ii) modeling. Measurements provide information for characterizing the current and past behavior of already existing systems. Recently, much research effort has been devoted to the analysis of service availability using measurements based on monitoring of operational web sites. On the other hand, it is fundamental to anticipate future behavior about the infrastructure supporting the service. Modeling is useful to guide the development of a web application during its design phase by providing quantitative measures characterizing its dependability. In the context of web-based services, modeling has been mainly used for performance evaluation purposes. However, less attention has been devoted to the dependability modeling and evaluation of web-based services and applications, specially from the user perceived perspective. The user perceived availability of web-based services is affected by a variety of factors (e.g., user behaviors and workload characteristics, fault tolerance and recovery strategies, etc.). Due to the complexity of the web-based services and the difficulty to combine various types of information, a systematic and pragmatic modeling approach is needed to support the construction and processing of dependability models. The contributions presented in this thesis are aimed at fulfilling these objectives, introducing a pragmatic approach for analyzing the availability of such services from the user point of view. The traditional notion of availability is extended in order to include some of the main causes of service unavailability relying on a performability modeling approach. Our modeling approach is based on a combination of Markov reward models and queuing theory, in which we investigate the potential existing closed-form equations. In fact, analysis from pure performance viewpoint tends to be optimistic because it ignores the failure/repair behavior of the system. On the other hand, pure dependability analysis that does not include performance levels of service delivery tends to be too conservative. Therefore, a performability based approach is well suited to capture the various degraded states, measuring not only whether the service is up or down but also operational degraded states. The causes of service unavailability considered explicitly in our modeling based approach fall into the following categories: i) hardware and 2

18 software failures affecting the servers; and ii) performance-related failures including: overload, loss of requests, and long response time. Ì ÓÙØÐÒ The core of this thesis deals with web service availability modeling and evaluation and is structured in four chapters. Figure 1 presents the interdependence of chapters. Introduction Chapter 1 Context and background Chapter 2 Availability modeling framework Chapter 3 Web service availability: impact of recovery strategies and traffic models Chapter 4 Service unavailability due to long response times Appendix I Proofs and implementation Appendix II Proofs and implementation Conclusion Figure 1: Interdependence of chapters 3

19 Chapter 1 states the motivation and the context of the work. It briefly presents the theory and techniques for dependability modeling providing a background for our investigation. We provide a discussion on dependability modeling starting by the main concepts of dependability. A brief overview of dependability evaluation is presented with some existing methods useful to build and to solve models. Some approaches in the probabilistic evaluation domain are reported. Also, the related works are reviewed presenting prior studies and contributions on web availability evaluation including measurements and modeling approaches. The general problem addressed in this thesis is introduced in Chapter 2. This chapter presents the proposed framework using a web-based travel agency as example, illustrating the main concepts and the feasibility of the framework. The framework is based on the decomposition of the web based application following a hierarchical description. The hierarchical description is structured into four levels of abstraction. The highest level describes the dependability of the web application as perceived by the users. Intermediate levels describe the dependability of functions and services provided to the users. The lowest level describes the dependability of the component systems on which functions and services are implemented. Sensitivity analyses are presented to show the impact of users operational profile, the fault coverage and the travel agency architecture on user perceived unavailability. Chapter 3 provides a modeling based approach of web service availability supported by web cluster architectures. We are particularly interested in fault-tolerant web architectures. Our interest is justified by the fact that web clusters architectures are leading architectures for building popular web sites. Web designers require to find an adequate sizing of these architectures to ensure high availability and performance for the delivered services. Moreover, it is crucial to study the impact of recovery strategies supported by these architectures on web service availability. Thus, we address especially recovery strategies issues and traffic burstiness effects on web service availability. Web cluster architectures are studied taking into account the number of nodes, recovery strategies after a node failure and the reliability of the nodes. Various causes of request loss are considered explicitly in the web service availability measure: losses due i) to buffer overflow or ii) to node failures, or iii) during recovery time. Closed-form equations for request loss probability are derived for both recovery strategies. Two simple traffic models (Markov Modulated Poisson Process (MMPP) and Poisson) are used to analyze the impact of traffic burstiness on web service availability. From the user perspective, the service is perceived as degraded or even unavailable if the response time is too long compared to what the users are expecting. Certainly, the long response time has an impact on the overall service availability. To our knowledge, however, there has not been a quantitative evaluation of the long response time effects on service availability especially from the web user perspective. Chapter 4 introduces a flexible analytic modeling approach for computing service unavailability due to long response times. The proposed approach relies on Markov reward models and queuing theory. We introduce a mathematical abstraction that is general enough 4

20 ÄÁËÌ Ç ÌÄË to characterize the unavailability behavior due to long response times. The computation of the service unavailability measure isbasedontheevaluationoftheresponse time distribution. Closed-form equations are derived for conditional response-time distribution and for the service unavailability due to long response time, considering single and multi-server queueing systems. The developed models are implemented using tools such as gnu-octave and maple. Since we specifically focus on small models aiming to obtain closed-form equations as much as possible, these tools are enough for evaluating the obtained equations and for supporting the sensitivity analyses. Appendix I and II show the obtained equations proofs as well as the models implementation of the chapters 3 and 4.

21 ÄÁËÌ Ç ÌÄË

22 Chapter 1 Context and background All models are wrong. Some models are useful. Albert Einstein THIS chapter presents the motivation and the context of our work. We provide a discussion on dependability evaluation starting by the main concepts of dependability. A brief overview of dependability evaluation is presented through some existing methods useful to build and to solve models. Various approaches used in probabilistic evaluation domain are non-exhaustively reported, illustrating the considerable advances that have extended the capabilities of analytic models. We introduce the modeling process indicating its phases. The phases are described presenting some of the main problems and methods used in each phase. After that, two major problems related to models construction and processing are discussed: largeness and stiffness. We report some approaches useful to build large models. Finally, we review the related studies that form the basis for our investigation on web availability evaluation including measurements and modeling approaches. ½º½ ÓÒØÜØ Ò ÑÓØÚØÓÒ The web is an evolving system incorporating new components and services at a very fast rate. A large number of new applications such as e-commerce, digital libraries, video on-demand, distance learning have been migrated to the web. Many web site projects are built in three or four months because they need to beat competitors and quickly establish a web presence. This requirement of becoming visible online often 7

23 ÀÈÌÊ ½º ÇÆÌÌ Æ ÃÊÇÍÆ comes without a careful design and testing, leading to some problems on dependability and performance. Recently, many high-tech companies providing service on the web have experienced operational failures. Financial web services experienced intermittent outages as the volume of visitors has increased. A report presented in [Meehan 2000] showed that online brokerage companies were concerned with system outages and with the inability to accommodate growing numbers of online investors. During those outages, users and investors could not access real-time quotes. Such operational problems have resulted, in many cases, in degraded performance, unsatisfied users and heavy financial losses. Quantitative methods are needed to understand, analyze, design and operate such large infrastructure. Quantitative measures need to be evaluated in order to estimate the quality of service and the reliance that can be placed on the provided service. This evaluation may help system designers to identify the weak parts of the system that should be improved to support an acceptable dependability level for the provided service. Dependability evaluation consists in estimating quantitative measures allowing to analyze how hardware, software or human related failures affect the system dependability. It can be carried out by dependability modeling with the goal of analyzing the various design alternatives in order to choose the final solution that better satisfies the requirements. The rest of this chapter is structured as follows. Section 1.2 presents the main concepts of dependability. Section 1.3 outlines the main approaches that can be used for dependability evaluation. Section 1.4 reports some formalisms and tools used in the analytic modeling process. Section 1.5 describes some of the main problems related to large models and the existing techniques to deal with large models. Section 1.6 presents the related work on the evaluation of web-based services. Finally, section 1.7 summarizes the chapter. ½º¾ ÔÒÐØÝ ÓÒÔØ Dependability is defined as the trustworthiness of a computer system such that reliance can justifiably be placed on the service it delivers [Laprie 1995, Laprie et al. 1996, Avizienis et al. 2004]. It is a global concept which includes various notions that can be grouped into three classes: threats, means and attributes. Dependability goal is to specify, conceive and investigate systems in which a fault is natural, predictable and tolerable. The threats to dependability are: faults, errors and failures; they are undesired - but not in principle unexpected - circumstances causing or resulting from undependability. The means for dependability are: fault prevention, fault tolerance, fault removal and fault forecasting or prediction; these are the methods and techniques that enable one

ReSIST NoE Resilience for Survivability in IST Modelling and Evaluation of Largeness in Evolving Systems Andrea Bondavalli University of Firenze (here PISA) Introduction Systems complexity has always been

Module 1: Introduction to Computer System and Network Validation Module 1, Slide 1 What is Validation? Definition: Valid (Webster s Third New International Dictionary) Able to effect or accomplish what

Discrete-Event Simulation Prateek Sharma Abstract: Simulation can be regarded as the emulation of the behavior of a real-world system over an interval of time. The process of simulation relies upon the

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

A SIMULATION STUDY FOR DYNAMIC FLEXIBLE JOB SHOP SCHEDULING WITH SEQUENCE-DEPENDENT SETUP TIMES by Zakaria Yahia Abdelrasol Abdelgawad A Thesis Submitted to the Faculty of Engineering at Cairo University

AP FRENCH LANGUAGE 2008 SCORING GUIDELINES Part A (Essay): Question 31 9 Demonstrates STRONG CONTROL Excellence Ease of expression marked by a good sense of idiomatic French. Clarity of organization. Accuracy

General Certificate of Education Advanced Level Examination June 2012 French Unit 4 Speaking Test Candidate s Material To be conducted by the teacher examiner between 7 March and 15 May 2012 (FRE4T) To

(EPR) Analysis of in the EU and development of guiding principles for their functioning In association with: ACR+ SITA LUNCH DEBATE 25 September 2014 Content 1. Objectives and 2. General overview of in

Software testing cmsc435-1 Objectives To discuss the distinctions between validation testing and defect testing To describe the principles of system and component testing To describe strategies for generating

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods Enrique Navarrete 1 Abstract: This paper surveys the main difficulties involved with the quantitative measurement

I will explain to you in English why everything from now on will be in French Démarche et Outils REACHING OUT TO YOU I will explain to you in English why everything from now on will be in French All French

Hardware safety integrity Comments on this report are gratefully received by Johan Hedberg at SP Swedish National Testing and Research Institute mailto:johan.hedberg@sp.se Quoting of this report is allowed

31 CHAPTER 3 CALL CENTER QUEUING MODEL WITH LOGNORMAL SERVICE TIME DISTRIBUTION 3.1 INTRODUCTION In this chapter, construction of queuing model with non-exponential service time distribution, performance

B.Com(Computers) II Year RELATIONAL DATABASE MANAGEMENT SYSTEM Unit- I 1 1. What is Data? A. Data is a collection of raw information. 2. What is Information? A. Information is a collection of processed

CAPACITY AND AVAILABILITY MANAGEMENT A Project Management Process Area at Maturity Level 3 Purpose The purpose of Capacity and Availability Management (CAM) is to plan and monitor the effective provision

Load balancing model for Cloud Data Center ABSTRACT: Cloud data center management is a key problem due to the numerous and heterogeneous strategies that can be applied, ranging from the VM placement to

1 INTRODUCTION TO SYSTEM ANALYSIS AND DESIGN 1.1 INTRODUCTION Systems are created to solve problems. One can think of the systems approach as an organized way of dealing with a problem. In this dynamic

Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat

Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free

Engineering Process We need to understand the steps that take us from an idea to a product. What do we do? In what order do we do it? How do we know when we re finished each step? Production process Typical

Speech on the occasion of the adoption of the resolution Building a peaceful and better world through sport and the Olympic ideal UN General Assembly, New York, 26 October 2015 -Check against delivery-

299 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference

1. Implementation of a testbed for testing Energy Efficiency by server consolidation using Vmware Cloud Data centers used by service providers for offering Cloud Computing services are one of the major

Syslog Analyzer ABOUT US OSSera, Inc. is a global provider of Operational Support System (OSS) solutions for IT organizations, service planning, service operations, and network operations. OSSera's multithreaded

7 Conclusions and suggestions for further research This research has devised an approach to analyzing system-level coordination from the point of view of product architecture. The analysis was conducted

A little refresher: What are we modelling? Lecture 9: Requirements Modelling Requirements; Systems; Systems Thinking Role of Modelling in RE Why modelling is important Limitations of modelling Brief overview

The Define/Align/Approve Reference Series NEEDS BASED PLANNING FOR IT DISASTER RECOVERY Disaster recovery planning is essential it s also expensive. That s why every step taken and dollar spent must be

CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

FOR TEACHERS ONLY The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION F COMPREHENSIVE EXAMINATION IN FRENCH Friday, June 16, 2006 1:15 to 4:15 p.m., only SCORING KEY Updated information