The selection of software architecture style is an important decision of design stage,and has a significant impact on various system quality attributes. To determinesoftware architecture, after architectural style selection, the software functionalitieshave to be distributed among the components of software architecture. In this paper,a method based on the clustering of use cases is proposed to identify softwarecomponents and their responsibilities. To select a proper clustering method, first theproposed method is performed on a number of software systems using differentclustering methods, and the results are verified by expert opinion, and the bestmethod is recommended. By sensitivity analysis, the effect of featureson accuracy ofclustering is evaluated. Finally, to determine the appropriate number of clusters (i.e.the number of software components), metrics of the interior cohesion of clusters andthe coupling among them are used. Advantages of the proposed methodinclude; 1)no need for weighting the features, 2) sensitivity analysis of the effect of features onclustering accuracy, and 3) presentation of a clear method to identify softwarecomponents and their responsibilities.

1

INTRODUCTION

Software architecture is a fundamental artifact inthesoftware life cycle withanessential role in supporting quality attributes of

thefinal software product.Making useof architecture styles is one of the ways to designsoftware systems and guarantee thesatisfaction of their quality attributes

[1]. After architectural styleselection,

only typeof software architecture organization is specified. Then software components and theirresponsibilities need to be identified. On the other hand, component-based development(CBD) is

nowadays an effective solution for the subject of development andmaintenance of information systems [2]. Component isa

basic block that can bedesigned and if necessarybe

combined with other components [3].Partitioningsoftware

systemto components, while effective on later software development stages,

2

JOURNAL OFOBJECTTECHNOLOGY

has a central role in defining the system architecture.

Componentidentification is oneof the most difficult tasks in thesoftwaredevelopment process

[4].Indeed

a fewsystematic components

identificationmethods have been presented,and there are noautomatic and semi-automatic tools to help expertsfor

identifyingthe components, andcomponents

identificationis usually made based on expert experience without usingautomatic mechanisms.

relationship between objects in which the rows show objects and columnsshow activity (creating or using objects). Dynamic relationship between objects isdetermined based onsimilarityof activities that use or create these objects. In [5],software componentshave

been determined using use case model, object model,

anddynamic model (i.e. collaboration diagram). For clustering related functions, functionaldependency of use cases is calculated and related use cases are clustered. In [6], thestatic and dynamic relationships between classes are used for clustering related classesin components. Static relationship

measures the relationship strength using differentweights, and dynamicrelationship measures the frequency of message exchange atruntime. Tocomputeoverallstrength

of relationship between

classes, the results of tworelationshipsare

combined. In [7], in order to identify components, the use cases andbusiness type modelare

used. Relationship between classes is the main factor toidentify components. The core class is the center of each clustering, and responsibilitiesderived from use cases are used to guide the process. In [8], componentsare

identifiedbased on scenarios or use cases and their features. In [9], a framework for identifyingstable business components has been suggested.

Disadvantages of most of thesemethods are: 1) Lack of validation of the method by anumber of software systems; 2) Lack of an approach for determining the number ofsystem components; 3) No sensitivity analysis of the effect of features onaccuracy ofclustering; 4) High dependency of the method to expert opinion; 5) requirement ofmanual weighting of the features used

in clustering and 6) No evaluation of the effectof using different clustering methods.

Since

use cases are applied to describe the functionality of the system, in this paper amethod for automatic identification of system software components isproposedbasedon the use case model (in analysis phase). In this method, at first using the systemanalysis model including use case model, class diagram and collaboration diagram,some features are extracted. Then, usingthe proposed method

and applying variousclustering methods, use cases are clustered in several components. To evaluate theclustering methods, components results from clustering methods are compared to expertopinion and the method with most conformity with the expert opinion will be selected.In most methods, the number of clusters (K) is the input parameter of clustering. Butfor partitioning of system to some components, the number of these components is notspecified beforehand. Thus, in the proposed method, clustering is repeated for differentvalues of K, and then the most appropriate value of K (components number) is chosenregarding high cohesion of components and low coupling among them. In order to

JOURNAL OFOBJECTTECHNOLOGY

3

increase clustering accuracy, the effect of features onaccuracy ofclusteringisdetermined using sensitivity analysis on use case features. Finally, by choosing a properfeature selection method, minimum features achieving the required accuracy inclustering are selected.

Next, we present clustering in the second part. The proposed method for determiningcomponents and the evaluation model of the proposed method are presented in sections3 and 4, respectively. In section 5, the conclusion is presented and the proposed methodis compared with other methods.

2

CLUSTERING

In order tounderstand new objects and phenomena, their features are described, andthen comparedtoother known objects

or phenomena, based on similarity ordissimilarity [10].All oftheclustering methods

includethe three common key steps: 1)Determine the object features and data collection, 2) Compute the similaritycoefficients of the data set, and 3) Execute the clustering method.

Each input data set consists of an

object-attribute matrix in which objects are theentities grouped based on their similarities. Attributes are the properties of the objects.A similarity coefficient for a given pair of objects shows the degree of similarity ordissimilarity between these two objects, depending on the way the data are represented.Thesimilarity coefficient could be qualitative or quantitative. A data object is describedby a set of features represented as a vector. The features are quantitative or qualitative,continuous or binary, nominal or ordinal. Features type determines the correspondingmeasure mechanisms.

2-1

Similarity and Dissimilarity Measures

To join (separate) the most similar (dissimilar) objects of a data set

X

in some cluster,clustering algorithms

apply a function that can make a quantitative measure amongvectors.This quantitative measure is arranged in a matrix called proximity matrix. Twotypes of quantitative measures are Similarity Measures, and Dissimilarity Measures.

X. The dissimilarity coefficient, dij, is small when objects i and j are alike,otherwise, dij

become larger. A dissimilarity measure must satisfy the followingconditions:

• 0

dij



1

• dij

= 0

• dij

= dji

Typically, distance functions are used to measure continuous features, while

similaritymeasures are more important for qualitative features [10].

Selection of differentmeasures is problem dependent [10]. For binary features,thesimilarity measure iscommonly used.Let us assume that a number of parameters with two binary indexesare used for countingfeatures in two objects. For example,n00

and n11

denote

thenumber of simultaneous absenceandpresence of features in two objects

respectively,and n01

and n01

count the features presented

only in one object. The equations

(1) and

(2) show two types of commonly used similarity measures for data points. w=1

forsimple matching coefficient, w=2 for

Rogers and Tanimoto measure

and

w=1/2

forGower and Legendre measure

are

(1)

)(011000110011nnwnnnnSij

used in equation (1). These measures compute the match between two objectsdirectly.Equation (2) focuses on the co-occurrence features while ignoring the effect of co-absence. w=1

forJaccard coefficient, w=2 forSokal and Sneath

measure and w=1/2

forGower and Legendre

measure are used

in equation (2).

(2)

)(01101111nnwnnSij

2-2

Clustering Methods

In this section, some

of the

main clustering methods areintroduced.

A-

Hierarchical Clustering

(HC).

In this method, hierarchical structure of data

isorganized according to aproximitymatrix.

HC algorithms organize data into ahierarchical structure according to the proximity matrix. The results of HC are usuallydepicted by a binary tree or dendrogram. The root node of the dendrogram representsthe whole data set and each leaf node

is regarded as a data object. The intermediatenodes, thus, describe the extent that the objects are proximal to each other; and theheight of the dendrogram usually expresses the distance between each pair of objects orclusters, or an object and a cluster. The ultimate clustering results can be obtained bycutting the dendrogram at different levels. HC algorithms are mainly classified asagglomerative methods and divisive methods [10]. Agglomerative clustering starts withN clusters andeach of them includes

exactly one object. A series of merge operations

JOURNAL OFOBJECTTECHNOLOGY

5

then follow that finally lead all objects to the same group. Based on the differentdefinitions for distance between two clusters, there are many agglomerative clusteringalgorithms. Let Ci and Cj be two clusters, and let |Ci| and |Cj | denote the number ofobjects that each one have. Let d(Ci,Cj) denote the dissimilarity measures betweenclusters Ci and Cj , and d(i, j) the dissimilarity measure between two objets i, and jwhere i is an object of Ci and

j is an object of Cj. The simplest method is single linkage(SLINK) technique.In the SLINK method, the distance between two clusters iscomputed by the equation (3).The common problem of classical HC algorithms is lackof robustness and they are, hence,

sensitive to noise and outliers. Once an object isassigned to a cluster, it will not be considered again, which means that HC algorithmsare not capable of correcting possible previous misclassifications [10].

(3)

jicjcijijidccd,),(min),(

B-

Squared Error—Based Clustering.Partitional clustering assigns a set of objectsinto clusters with no hierarchical structure. The optimal partition, based on somespecific criterion, can be found by enumerating all possibilities. However, this methodis impossible in

practice, due to expensive computation. Thus, heuristic algorithmshave been developed in order to seek approximate solutions.

One of the importantfactors in partitional clustering is the criterion function.The sum of squared errorfunctions is one of the most widely used criteria [10]. The main problem of partitionalmethods isuncertainty of the clustering solution to randomly selected cluster centers.

The K-means algorithmbelongs to this category. This method is very simple and canbe easily implemented in solving many practical problems.

But

there is no efficient anduniversal method for identifying the initial partitions and the number of K clusters. Theiteratively optimal procedure of K-means cannot guarantee convergence to a globaloptimum. K-means is sensitive to outliers and noise. Thus, many variants of K-meanshave appeared in order to overcome these obstacles.K-way clustering algorithms withthe repeated bisection

(RB, RBR)and direct clustering (DIR) are expansion of thismethod that are introduced briefly[11].

RBClustering

Method. In this method, the desiredk-way clustering solution iscomputed by performing a sequence ofk−

1 repeated bisections. In each step, thecluster is selected for further partitioning is the one whose bisection will optimize thevalue of the overall clustering criterion function. In this method, the criterion functionis locally optimized within each

bisection. This process continues until the desirednumber of clusters is found.

RBRClustering

Method.In this method, the desiredk-way clustering solution iscomputed in a fashion similar to the repeated-bisecting method but at the end, theoverall solution is globally optimized.

can bedescribedbymeans of graphs.Nodes of a weighted graph correspond to data points in the patternspace, and edges reflect the proximities between each pair of data points. If thedissimilarity matrix is defined as

a threshold value, the graph is simplified to anunweighted threshold graph.

Graph theory is used for hierarchical and non-hierarchicalclustering

[10].

D-Fuzzy Clustering

Method.In this method, the object can belong to all of theclusters with a certain degree of membership. This is mainly useful when theboundaries among the clusters are not well separated and ambiguous. Moreover, thememberships may help us discover more sophisticated relations between a given objectand the disclosed clusters.

FCM is one of the most popular fuzzy clustering algorithms

[12].FCM attempts to find a partition (cfuzzy clusters) for a set of data points xjRd,j=1,…, N while minimizing the cost function.

FCM suffers from the presence of noiseand outliers and the difficulty to identify the initial partitions.

E.Neural Networks-Based Clustering.

In competitive neural networks, activeneurons reinforce their neighborhood within certain regions while suppressing theactivities of other neurons. A typical example is self-organizing feature map(SOFM)[10].

2-3

Methods to Determine the Number of Clusters

In most methods, the number of clusters (K) is the input parameter of clustering. Butthe quality of resulting clusters is largely dependent on the estimation of K.So

manyattempts have been made to estimate the appropriate

k.For the data points that can beeffectively projected onto a two-dimensional Euclidean space, direct observations canprovide good insight on the value of K but only to a small scope of applications.

Most presented methods have presented formulas that emphasize on the compactnesswithin the cluster and separation between clusters, and the comprehensive effect ofseveral factors such as defined squares error, geometric or statistical feature of data andthe number of patterns.Two of them are briefly introduced as follows:



CH Index[14].This index is computed by equation (4),whereN

is the total number ofpatterns

and

Tr(SB)

andTr(SW)

are the trace of the between and within class scattermatrix, respectively. TheK

that maximizes the value of CH(K) is selected as theoptimal.

(4)

Ray

andTuri index[15].

In thisindex, theoptimal

K

value is calculated by equation(5). In this equation,

Intra

is the average intra-cluster distance measurethat we want0),(10otherwisedxxDifDjiij

JOURNAL OFOBJECTTECHNOLOGY

7

to minimize andis computed by equation (6).Nis the number of patterns, andziis thecluster

centre of clusterCi.Inter

isdistance betweencluster centers

calculated byequation (7).Meanwhile, we want to maximize the inter-cluster

distance, i.e., theminimum distance between any two

cluster

centers.

TheK

that minimizes the value ofvalidity

measure is selected as the optimalin k-means clustering.

InterIntraValidity

(5)

211kiCxiizxNIntra

(6)

KijandKizzerji,...,11,...,2,1),min(int2

(7)

3

AUTOMATIC DETERMINATION OF SYSTEM SOFTWARECOMPONENTS

In this section, the proposed method for use cases clustering, orin other words,automatic determination of system software components is presented. Softwarefunctions clustering is done using artifacts of requirements analysis phase, soallfeatures of the use case model, class diagram and collaboration diagram (if any) areused in clustering.

Each use case indicates a section of system functionality.So, use case is the main wayto express functionality of the system. Each use case iscomposed of a number ofexecutive scenarios in the system, producing a measurable value

for a particular actor.A set of descriptions of use cases describes the completefunctionality of the system.Each actor is a coherent set of roles played by the users during interaction with usecases [16]. Each use case diagram shows interaction of the system with external entitiesand system functionality from user viewpoint.Considering the above statements,softwarecomponentsof the system are identified by relying on identification ofcoherent

use case of the system. Thus,use cases of the system are stimulators of theproposed method to identify software components of the system.

Stages of the proposed method are: 1) Extraction of use case features 2) Construction of

proximity matrix of use cases and 3) Clustering of system use cases, which areindividually introduced.

3-1

Extraction of Use Cases Features

By evaluation of artifacts of requirements analysis phase including use case model,class diagram andcollaborationdiagram, the following features can be defined for usecases clustering. Features 1 to 4 are binary and other features are continuous.

1–

Actor. Use cases initiated or called by the same actor are more related than other usecases because the actors usually play similar roles in the system. So, each actor is

8

JOURNAL OFOBJECTTECHNOLOGY

considered as a feature, taking a value 1 or 0 based on its presence or absence in theuse case.

2–

Entity classes. Use cases working with the same data are more related than otheruse cases. So, each entity class is considered as a feature taking a value 1 or 0 basedon its presence or absence in the use case.

3–

Control classes. In each use case, the class or classes are responsible forcoordination of activities between interface classes and entity classes, known ascontrol class. Use cases controlled by the same control class are more related thanother use cases. Each control class is considered as a feature taking a value 1 or 0based on its presence or absence in the use case.

4–

Relationship between use cases. Based on relationship between use cases, thefollowing features can be extracted:

• If several use cases are related toUi

use case in an extend relationship, a new featureis added to existing use cases features and its value is 1 forUi

and related use cases,and 0 for other use cases.

• If several use cases are specialized from a generalized use case, a new feature is addedto existing use

cases features and its value is 1 for them and 0 for other use cases.

• IfUi

andUj

are related through

include relationship, the relationship betweenUj

anduse cases other thanUi

should be investigated.Uj

may also be included byUk

(asshown in Figure 1). In this case, ifUi

has a relatively strong relationship withUk

(atleast 2 or more shared features), a new feature is added to existing use cases featuresand its value is 1 forUi

andUjand 0 for other use cases.

Figure1. Include relationship between use cases

5-Weight of control class. Considering the number of entity classes and interfaceclasses managed by each control class, a weight is assigned to each control classusing equation (8), whereNeci

andNici

arerespectively,

the number of

entity andinterface classes under control of control classi; andm and l

are total number ofentity and interface classes of the system,respectively.

ljjmjjiiiNicNecNicNecwcc11

(8)

6–Association weight of use case. This feature is calculated by equation (9), whereNcci

is the number of control classes of each use case,Naeci is

the number ofrelationships between entity classes of the use case andNeciis the number of entityclasses of the use case

(each control class has an association with entity classes ofUiUkUjincludeinclude

JOURNAL OFOBJECTTECHNOLOGY

9

the use case).

The variable u is the number ofuse casesof the system anddenominator of fraction is total dependency of alluse cases of the system.

ujiiiiiiiNecNaecNccNecNaecNccwuca1)(

(9)

7–The similarity rate of each use case with other use cases. This feature is computed interms of binary features (1 to 4 features) using equation (2) and coefficient ofJaccard. In this equation,n11

is the number of binary features with a value of 1 inboth use cases,n01

is the number of binary features having

a value of 0 for the firstuse case and 1 for the others; and the inverse relation exists forn10.Since similarityof eachuse case

with the other (N-1)use cases

is calculated, (N-1)featuresareadded to existingfeatures.

3-2

Constructing Proximity Matrix of Use Cases

As mentioned in section 2,clustering is done based on eitherfeatures matrix orproximity matrix (similarity/dissimilarity) of objects.As discussed in the previous step,

some of the features are continuous and some

are

binary. In clustering objects withmixedfeatures (bothbinary and continuous features), we can either mapall thesefeaturesinto the interval (0, 1) and use distance measures, ortransform

them intobinary features and use similarity functions. The problem of both methods is theinformation loss [10]. We canconstruct similarity matrix for binary features anddissimilarity (distance) matrix for continuous features, then convert dissimilarity matrixto similarity matrix, and use equation (10) to combine them in a single similarity matrix[17]. w1

Combining similarity matrices of stages 1 and 3. Using the equation (10), thesimilarity matrices of stages 1 and 3 are combined.

3-3

Clustering System Use Cases

In

Section 3-2, use cases similarity matrix was established. This matrix is the maininput of most clustering methods used in this study. For clustering use cases of thesystem, the following clustering methods are used: (1) RBR, (2) RB, (3) Agglomerative(Agglo), (4) Direct, (5) Graph-based, (6) FCM and (7) Competitive Neural Network(CNN).The best clustering method is chosen based on the assessment performed inSection 4.

4

EVALUATION

OF THE PROPOSED METHOD

In the previoussection, the proposed method for determining the software componentsof the system was described.

In the proposed method, several clustering methods areused.

In this section, to select the best clustering method, first the results of functionspartitioning of several software systems using introduced methods are compared toexpert opinion, then the best method, i.e. the method with most conformity with theexpert opinion will be selected. In addition, using criteriabased on high cohesion ofclusters and low coupling among them, the suitable number of clusters is determined.In addition, using sensitivity analysis, the effect of each featureon accuracy ofclusteringis determined. Finally, we determine theset offeaturesclose to the optimum

giving enough precision in clustering while being minimum.

In methods (1) to (5), clustering is done based on the similarity matrix using theCLUTO tool and various optimization functions [11, 18, 19]. CLUTO is a softwarepackage

for clustering low and high dimensional datasets and for analyzing thecharacteristics of the different clusters. In most CLUTO’s clustering methods, theclustering problem is treated as an optimization process, which seeks to maximize orminimize a particular clustering criterion function defined either globally or locallyover the entire clustering solution space. CLUTO provides seven different criterionfunctions (h2, h1, g'1, g1, e1, i2, i1) that can be used in both partitional andagglomerative clustering methods. In addition to these criterion functions, CLUTOprovides some of the more traditional local criteria such asSLINK

that can be used inthe agglomerative clustering. Also, CLUTO provides graph-partitioning-basedclustering algorithms. In FCM and CNN methods, clustering is done based on featuresmatrix of use cases using MATLAB software [20].

4-1

Evaluation Method

The steps of evaluation method are as follows:

•

Comparison of the Clustering Method Results to Expert Opinion to Select theBest Method. In this step, the results of functions clustering of some software systemsare compared to the desired expert clustering methods, and the method with the results

JOURNAL OFOBJECTTECHNOLOGY

11

of which show most conformity with the expert clustering method will be selected asthe best method. In this stage, the number of clusters in each system is determinedbased on expert opinion.

Error oftheclustering method is computed by equation (13),

whereCEj

andCTj

are theset of use cases ofj-thcomponents

from expert and clustering method view,respectively.

symbol

in equation is the symmetric difference of two sets.

(13)

KjjjCTCEError121

Overall performance of methods in functionsclustering of some software systemsiscalculated in terms of the number of errors byequation (14).

In this equation,NCEK

isthe number of errors of the clustering method for theK-th

system,NUCK

is total usecases of theK-th

system andNS

is the number of systems. Since by increasing the sizeof system, the number of use cases is increased and accuracy of clustering is decreased.Also, by dividing the errors of clustering to number of use cases of each system andcalculation of the mean of these values, a criterion is obtained showing mean error ofthe clustering method with a specific criterion function.

The lower theQCFi,i

value, the higher the quality ofthei-th

clustering method with

the

j-th

criterion.

NSkKKjiNUCNCENSQCF1,1

(14)

•Sensitivity Analysis. In this stage, by eliminating each feature, its effect in clusteringis examined and

the features with negative effect or no effect in clustering are identifiedand removed.

•

Determining the Minimum Features Set. To select the minimum features set thatwhile being minimum, their accuracy is sufficient for clustering, the sequencebackward selection (SBS) method [21] is used. InSBS

method, we begin with allfeatures and repeatedly eliminate a feature, so that the best performance of clustering isreached. This cycle is repeated until no improvement results from reduction of featuresset.



Determination

of

Suitable Number of Clusters.

In section 3-2, two methods havebeen mentioned for determining cluster number. In this stage, using these methods,the number of clusters for sample software systems is determined and the suitablemethod is selected.

4-2

Introduction of sample software system

In this section, the proposed method is validated using four software systems of asoftwaredevelopmentcompany in Iran. Use cases features of systems are shown inTable 1.The second column shows the number of use cases and third column shows thenumber of components of each system. Other columns show the number of featuresincludingtheactor number, entity class number, control class number and number ofdifferent relations among use cases, weight of control class, association weight of usecase, similarity rate of each use case with other use cases, and the last column is total

12

JOURNAL OFOBJECTTECHNOLOGY

number of features.Note that for each use case, there is one control class weight featureand one use case association weight feature.

Table 1. Characteristics of sample software systems

Systemname

Number of

Number of different relationshipamong use cases

Number of

Similarity

Rate

of each

Use case

Total

Features

Usecases

Components

Actors

Entity

classes

Control

classes

Extend

Specialization,Generalization

Include

Weightofcontrolclass

Associationweight ofuse case

System 1

53

6

10

17

24

4

0

0

1

1

52

109

System 2

23

4

3

6

10

2

0

7

1

1

22

52

System

3

21

4

5

18

11

0

0

0

1

1

20

56

System 4

11

3

4

6

7

0

0

0

1

1

10

29

4-3

Evaluation of Clustering Methods

First, use cases features of systems introduced in table 1 are extracted and whileforming similarity matrix of use cases of each system, use cases are clustered usingmentioned clustering methods and different criterion functions. In equation (10),

thevalues of weights are considered equal.

Since degree of membership of each use case to each cluster is determined in FCMmethod, for assigning each use case to the most related cluster, defuzzication process isused. Table 2 shows the clustering results of software systems with different clusteringmethods.

The numbers inserted in the columns related to each system are the error number ofclustering method with specified criterion function based on equation (13). The numberof components in each system is determined by expert opinion.

Results of use cases clustering by RBR, RB, Direct and Graph-based methodsreveal

that in each of these methods, the average error per criterion functions (QCF) i1, i2, h1,and h2 is the same. Thus, only the results of h2 criterion function are displayed in table2.

Average error of clustering methods RBR, RB and Direct methods for other criterionfunctions is higher and equal to 0.141,

and was not inserted in table 2.

According to the results of table 2, and based on equation (14), RBR and Directmethods with criterion

functions i1, i2, h1, and h2 have the most conformity withexpert opinion. Thus, these methods with desired criterion functions are recommended.

Table 2. Clustering Results of systems use cases with different clustering methods

ClusteringMethod

CriterionFunction

Number of Clustering Error of systems

Average Error ofClustering Method(QCF[i,j])

System 1 (53)

System 2 (23)

System 3(21)

System 4 (11)

RBR

h2

6

2

0

2

0.095

RB

h2

6

3

0

3

0.129

Direct

h2

6

2

0

2

0.095

Graph-based

h2

6

7

7

0

0.188

Agglo

i2

6

3

4

0

0.109

FCM

-

7

1

4

4

0.182

CNN

-

14

6

5

4

0.282

JOURNAL OFOBJECTTECHNOLOGY

13

4-4

Determining the Appropriate Number of Clusters

As stated in section 2-3, the basis of most cluster number determination methods is theintra-cluster compactness and inter-cluster

coupling. To automatically determine thenumber of clusters, the CH and Ray indices are used. Table 3 shows the number ofcomponents of four software systems based on expert opinion and these indices.According to table 3, the results of CH index has little conformity with the expertopinion and thus is not a suitable method for determining the number of clustersbecause it is expected that system possess a reasonable number of components, and theresults of this method except for system 1, do notlead

toa proper estimation ofcomponents number.The results of ray index are close to expert opinions, so we acceptthe results ofthis

index.

Table 3. The number of components in sample software systems

Number ofComponents

Expertopinion

NumberofUsecases

Systemname

Ray

index

CH index

Difference

Difference

-1

5

-1

5

6

53

System 1

+1

5

+6

10

4

23

System 2

0

4

-2

2

4

21

System

3

0

3

+3

6

3

11

System 4

4-5

Sensitivity

Analysis

For sensitivity analysis, by eliminating each feature, its effect on accuracy of clusteringis evaluated and features with negative effect or features without effect upon clusteringare identified and deleted. Table 4 shows features and their effect in clustering.Absence of feature is

shown by "-"symbol.

Table 4. Features and their effect in clustering

System 4

System3

System2

System 1

System name

Features

row

Feature Impact onClustering

Feature Impact onClustering

Feature Impact onClustering

Feature Impact onClustering

Noeffect

Negative

Positive

Noeffect

Negative

Positive

Noeffect

Negative

Positive

No

effect

Negative

Positive









Actor

1









EntityClasses

2









ControlClasses

3

-

-





Extend

DifferentRelationshipamong

Use cases

4

-

-

-

-

GeneralizationSpecialization

-

-



-

Include









Weight of Control Class

5









Association weight of usecase

6









Similarity Rate of each UseCase with other Use Cases

7

14

JOURNAL OFOBJECTTECHNOLOGY

Table 5 shows quantitative results of sensitivity analysis in terms of number of errorsresulting from inclusion or exclusion of features in clustering. It is noted that similarityrate of

use cases with each other is computed based on binary features.

Table 5. Quantitativeresults of features sensitivity analysis in terms of the number of errorsin

clustering

SystemName

AllFeature

OnlyBinaryFeature

OnlyContinuousFeatures

All Features without

Binary Features without

Similarity Rate of each Use Casewith other Use Caseswithout

Actors

ControlClasses

EntityClasses

Actors

ControlClasses

EntityClasses

Actors

ControlClasses

EntityClasses

System 1

0

0

1

1

5

5

1

3

7

1

3

5

System 2

0

1

0

6

3

7

5

3

7

5

3

7

System 3

0

1

0

11

1

0

9

1

0

9

1

0

System 4

0

0

0

3

0

0

4

0

0

3

0

0

Results of sensitivity analysis show that:

1-

The number of features of rows 1 to 3 and 7 (Table 4) in each system is high andtheir effects in use case clustering is significant.

2-

The effect of features of rows 4 to 6 (Table 4) is negligible compared to otherfeatures. One reason for this is the small number of features relative to otherfeatures, while the value of features of rows 5 and 6 is usually less than

0.3 andcauses their negligible effect on clustering.

4-5-1

Sensitivity Analysis of Weight of Binary and Continuous SimilarityMatrices

In equation (10) of step 2-3, in order to combinebinary and continuous similaritymatrices, importance weight of these

matrices was considered equal. As features "

Similarity Rate of each Use Case with other Use Cases" has no effect onaccuracy offunctions clustering

system 1(as shown in Table4),system 2 was used to assess theeffect of change inmatrices

In this section, the set of features close to optimum of each system use cases aredetermined and listed in table 6

using the SBS method.

Table 6. Minimum System Features for

functions clustering

row

System Name

Actors

Entity classes

Control classes

Number

Minimum

Number

Minimum

Number

Minimum

1

System 1

10

3

17

2

24

11

2

System 2

3

2

6

4

10

1

3

System

3

5

4

18

1

11

0

4

System 4

4

1

6

1

7

1

4-7 Comparison of Results to

Kim Method

Considering thefollowing

points, there is nopossibility of determining the componentsof software systems

(systems 1, 2, 3 and 4)using

works related to this research, andcomparison of the results with the proposed method is not possible.

1) Most methods require a series of weighting actions and there are no exact guidelines

for weighting.

2) Steps of methods were not clearly described and execution of the stepsis not

possible. Even in some cases, the features used in clustering were not defined.

3) The basis of their clustering is different from the proposed method, and it is not

feasible to compare the efficiency of methods with the proposed method.

4) In some cases, the use of this method requires information that is not available from

software systems.

As Kim andhis colleague

method [5] is based on clustering of use cases, assigning thesame weight to features and using this method, components of four software systemswere determined and the results compared to the proposed method. As shown in table7, the proposed method achieves better result than Kim method.

Table 7. Comparison of

the

proposed method results with Kim method

Error number for different systems

Method

System4

System 3

System 2

System 1

0

2

0

6

ProposedMethod

0

4

8

9

KimMethod

Advantages of the proposed method in comparison with the related works are asfollows:

1-Presentation of a clear method to determine system software components by learning

from past experience in software development.

2–Extraction of more features for clustering and sensitivity analysis of the effect of

features for refining them. The proposed method uses more features than otherrelated works, and determines their effect in clustering through sensitivity analysis.

16

JOURNAL OFOBJECTTECHNOLOGY

3-

Usingdifferent clustering methods and choosing the best method in terms of the

highest conformity to expert opinion.

4–Verifying

the results of clustering methods with expert opinion and ensuring

accuracy of the proposed method.

5–Using a number of software system for validating the method.

6-Sensitivity analysis

by elimination of every feature and assessment of the effect of

their elimination in increasing or decreasing the accuracy of clustering.

7-Elimination of the need to assign weight to features in clustering.

4-8 Extension

For further research,pre-conditions and post-conditions

of each use case are alsoconsidered asa new feature.

Use cases with similarpre-conditions/ post-conditions

aremore related than other use cases. Eachpre-condition/ post-condition

is considered afeature taking a value 1 or 0 based on its presence or absence in the use case. Insamplesoftware systems, only use cases of system 2 had pre-conditions/ post-conditions. Soconsideringpreconditions/ post-conditions of each use case, the clustering was repeatedand the number of clustering errors relative to past decreased. In the RBR and Directclustering

methods and RB method, theclustering

errors became 0, 0, and 1respectively. Thus, this feature can also be used in the use case clustering.

5

CONCLUSION

In this paper, a method was proposedto automatically determine system softwarecomponents based on clustering of use case features. First, the system use case featureswere extracted and the components were determined based on the proposed methodusing clustering methods. Then, the appropriate clustering method was selected bycomparison of clustering methods results with expert opinion. To determine theappropriate number of clusters, metrics of the interior cohesion of clustersand thecoupling among them are used. By sensitivity analysis, the effect of each feature onaccuracy of clustering was determined and finally the closet to optimum set of featuresproviding the required accuracy in clustering were determined using SBS method. Thecasestudies conductedwith four software systems, while validating the method,showed thatRBR

and Direct clustering methods that are extensions of K-means methodhave the most conformity with expert opinion. So, it was selected and recommended asthe most appropriate method. Innovation of this research is to propose a systematicmethod to determine system software components with specifications mentioned.

not beencompared with expert opinion; (2) the presented methods have not been validated usinga number of software systems; (3) various clustering methods have not been used; and(4)the effect of features onaccuracy of clusteringis not determined using sensitivity

JOURNAL OFOBJECTTECHNOLOGY

17

analysis, (5) there has been no guideline for determining clusters number, and (6) usingless features than the proposed method in clustering, while these shortcomings havebeen addressed in this research. Related works were introduced in introduction section.The problems of these methods, in addition to the points mentioned, are as follows:

in 2001, and theB.Sc. degree in Software engineering from Ferdowsi University ofMashhad in 1990.His

main research interests are softwareengineering,quantitative evaluation of softwareArchitecture,software metrics andsoftwarecost estimation. Currently,he workson his Ph.D thesis onDesignand Evaluation of SoftwareArchitecture.

E-mail:Shahmohamadi@modares.ac.ir.

Saeed Jalili received the Ph.D. degree from Bradford University in1991 and the M.Sc. degree in computer science from SharifUniversity of Technology in 1979. Since 1992, he has beenassistant professor at the Tarbiat Modares