Comments 0

Document transcript

Monitoring service in distributed systems: review of INCAmonitoring system.

Harshad Joshi

Abstract:

Distributed computing is currently one of the most exploited computing platform with theemergingtechniques such ascloud computing. Thus it is becomingincreasinglyimportant tounderstand all aspects of the distributed computing. From setting up the hardware to applicationsof various softwares on the distributed systems, probably the most important factor is“monitoring service”. In this survey, Inca service is as implemented on Teragrid computingservice is analyzed in detail

As the size of the distributed system increases, the control of its operation andmaintenance

becomes increasingly difficult. For geographically distributed system (such as Teragrid, which isa high-performance scientific computing facility spread across the US),Monitoring andcontrollingis quite a challenge. The main challenge occurs due to the requirement of someoperations to be real-time or quasi real-time [1].In this paper,

somestandard

publish/subscribemiddleware candidates, specially designed and developed for Grid

are examined fortheirarchitecture and functionality, and the advantages and disadvantages

of these are discussed.

Distributed systems (DS) are becoming very popular and in the near future, there will be a largernumber of services based on the concepts of DS that become routine for many operations.Previously, when CPU power and/or memory was limited the main driving force for DS was tobuild the system with larger compute power and large amount of (shared) memory to tackle morecompute-intensive task (mainly scientific and engineering problems). This also led tosupercomputer and cluster architecture. However, with theadvent of hardware technologies, notonly in CPU technology(Moore’s law still driving the increase rate for CPU power [2]), but alsofor other hardware such as memory, GPU computing, the cluster computing is achievingpetascale computing with a modest size cluster [3]. With internet,

and other technologies such asmobile computing,

however, in the last decade,new concepts started to realize, such asgeographically distributedcomputing,and cloud computingand this has attracted both academia(to construct large scale scientific computing facilities such as TeraGrid [4])as well as industry[5-7].These systems are highly dispersed in different physical

locations.On one hand, theseadvances are really attractive making computing boundariesinvisible and making what isnowcalled“Internet of things” a reality [8]. On the other hand, the ever increasing complexity ismaking it rather a challenging taskfor the control and monitoring of these systems to make surethat everything works just as fine

[9].

Monitoring a DStypicallyneedsproduction ofdata that can be collected remotely and updatedfrequently.This requires algorithms that can efficiently work even with slow connectivity,accurately yielding a real-time (or at least quasi real-time) control over the system of interest.Forexample, if acompute node of a remote supercomputerhas been switched on but does notrespond for a long time then it will be considered to be malfunctioning. A real-time system doesnot need to be very fast but should be stable and respond within a reasonable predefined timelimit.So that amonitoringsystemisconsidered asa distributed realtime monitoring system,mostof the data for monitoring should be received within areasonable time limit. Traditionalmonitoring systems are highly centralizedandmaynot scale very well.

Detailed reviews

of how the monitoring service should be can befound in the lliterature

(e.g., ref[11]). Here we restrict the discussion by stating that a

publish/subscribe (pub/sub) systemseemsto be the best solution for monitoring services due both its ability to disseminatemany-to-manydataand highly distributed nature of the DS. Publishers publish data and subscribers receive datathat they are interested in. Publishers and subscribersin pub/sub systemare independent andneed to know nothing about each other. The middlewarenot onlydelivers data to its destination

but also exhibits higher functionality features such asdata discovery, dissemination, filtering,persistence and reliability, etc. The subscriber can be automatically notified when new databecomes available. Compared to a traditional centralized client/server communication model,pub/sub system is asynchronous and is usually distributed and scalable.

A variety of monitoring and discovery services exists, ZenOSS, VMware vCloud, XCat,MonALISA, INCA, to name a few [12]. In the following we review the features of INCA, assuccessfully implemented on TeraGrid computing platform

[13].

INCA architecture and Features

Inca is developed at SDSC to create the monitoring system for TeraGrid portal.Inca is deployedon a wide variety of production Grids such as TeraGrid, GEON, TEAM, University of California(UC Grid), ARCS, DEISA, NGS and ZIH. Inca has also been used to monitor Open SourceDataTurbine deployments on CREON and GLEON as well as execute and collect performancedata from IPM instrumented applications.Inca offers a variety of web status pages fromcumulative summaries to reporter execution details and result histories [see Fig. 1].While other

Grid monitoring tools provide system-level information on

the utilization of Grid resources, theInca system provides

user-level Grid monitoring with periodic, automated user

level

testing ofthe software and services required to support

Grid operation.Thus,Inca can be used by Gridoperators, system

administrators, andapplication users to identify, analyze,

and troubleshootuser-level Grid failures, thereby improving

Grid stability.

User-level Grid monitoring providesGrid infrastructure

testing and performance measurement from a generic, impartial

user’sperspective. The goal of user-level monitoring

is to detect andfix Grid infrastructure problemsbefore users

notice them–

user complaints should not be thefirst indication of Grid failures.Asuccessful user-level Grid

monitoring needs to includefollowing

features

(cf. Inca Tech. reportsfrom Inca website)

•Runs from a standard user account in order to reflect

regular user experiences.

•Executes with a standard user GSI credential mapped

to a standard user account when tests orperformance

measurements require authentication to Grid services.

•Emulates a regular user by using tests and performance

measurements created and configuredbased on user

documentation, rather than on system administrator

knowledge (of hostnames,ports, pathnames, etc.). In

cases where documentation and tests are developed simultaneously

during pre-production, test development

should be closely coordinated with the documentation

as it is written.

•Centrally manages the configuration of user-level tests

or performance measurements in orderto ensure consistent

testing across resources.

•Easily updates and maintains user-level tests and performance

measurements. This is importantbecause

tests and measurements are often updated when Grid

infrastructure changes. Also,multiple iterations of test

development are often required to determine whether

a detected testfailure stems from a faulty test, incomplete

user documentation, or a failed Grid resource.

•Provides a representative indication of Grid status by

testing documented user commands andindividual

Grid software components.

•Automates the periodic execution of user-level tests or

performance measurements tounderstand Grid behavior

over time.

•Executes locally on Grid resources to verify user accessible

Grid access points. Executes fromeach resource

to every other resource (all-to-all) to detect

site-to-site configuration errors such asauthentication

problems.

Inca implementationprovides

these features

to provide a user-level Grid monitoring system.

Inca

Features

Incais a system that provides user-level monitoring of

Gridfunctionality and performance. Itwas designed to begeneral, flexible, scalable, and secure, in addition to being

easy to deploy andmaintain. Inca benefits Grid operators

who oversee the day-to-day operation of a Grid, systemadministrators

who provide and manage resources, and users

who run applications on a Grid.

The Inca system

(taken from Inca user manual [15]):

1. Collects a wide variety of user-level monitoring results

(e.g., simple test data to more complexperformance

benchmark output).

2. Captures the context of a test or benchmark as it executes

(e.g., executable name, inputs,source host, etc.)

so that system administrators have enough information

to understand the resultand can troubleshoot system

problems without having to know the internals

of Inca.

3. Eases the process of writing tests or benchmarks and

deploying them into Inca installations.

4. Provides means for sharing tests and benchmarks between

Inca users.

5. Easily adapts to new resources and monitoring requirements

in order to facilitate maintenanceof a running

Inca deployment.

6. Stores and archives monitoring results (especially any

error messages) in order to understandthe behavior of

a Grid over time. The results are available through a

flexible querying interface.

7. Securely

manages short-term proxies for testing of

Grid services using MyProxy.

8. Measures the system impact of tests and benchmarks

executing on the monitored resources inorder to tune

their execution frequency and reduce the impact on resources

as needed.

Figure 1

shows the architecture of Inca, which incorporates

three core components (highlightedbox)–

the agent,

depot, and reporter manager. Theagentandreporter managers

coordinate theexecution of tests and performance

measurements on the Grid resources

and thedepotstores

andarchives the results. The inputs to Inca are one or

morereporter repositoriesthat contain user-level tests and

benchmarks, calledreporters, and a configuration file describing

how to executethem on the Grid resources. Thisconfiguration is normally created using an administration

GUItool calledincat(Inca Administration Tool). The output

or results collected from the resourcesare queried by the

data consumerand displayed to users. The following steps

describe how anInca

administrator would deploy user-level

tests and/or performance measurements to theirresources.

1. The Inca administrator either writes reporters to monitor

the user-level functionality andperformance of

their Grid or uses existing reporters in a published

repository.

2. The Inca administrator creates a deployment configuration

file that describes the user-levelmonitoring for

their Grid using incat and submits it to the agent.

3. The agent fetches reporters from the reporter repository,

creates a reporter manager on eachresource, and

sends the reporters and instructions for executing them

to each reporter manager.

4. Each reporter manager executes reporters according to

its schedule and sends data to thedepot.

5. Data consumers display collected data by querying the

depot.

A

reporter

is an executable program that tests or measures some aspect of the system or installedsoftware.

A

report

is the output of a reporter and is a XML document complying to the reporter schema.

A

suite

specifies aset of reporters to execute on selected resources, their configuration, and frequencyof execution.

A

reporter repository

contains a collection of reporters and is available via an URL

A

reporter manager

is responsible for managing the schedule andexecution of reporters on a singleresource.

A

agent

is a server that implements the configuration specified by the Inca Administrator.

incat

is a GUI used by the Inca administrator to control and configure the Inca deployment on a set ofresources.

A

depot

is a server that is responsible for storing the data produced by reporters.

A

data consumer

is typically a web page client that queries a depot for data and displays it in a user-friendly format

It emulates a Grid user by running under a standard user account andexecuting tests thus ensuringconsistent testing across resources

with centralized testconfiguration.

Inca manages and collects a large number of results through a GUI interface(incat).It measures resource usage of tests and benchmarks to help Inca administrators balancedata freshness with system impact.

Data is collected by reporters, executables that measureparticularaspects

of the system and output the result as XML. Multiple types of data can becollected

sinceInca offers a number of prewritten test scripts, called

reporters, for monitoringGrid health. Reporter APIs make it easy to create new Inca tests. By storing and archivingcomplete monitoring results it allows system administrators to debug detected failures usingarchived execution details. Inca offers a variety of Grid data views from cumulative summariesto reporter execution details and result histories. Inca components communicate using SSLmaking it a secure monitor for DS service testing.