4 Abstract Model transformations take as input a source model and generate as output a target model. The source and target models conform to given meta-models. We distinguish between two transformation categories. Exogenous transformations are transformations between models expressed using different languages, and the whole source model is transformed. Endogenous transformations are transformations between models expressed in the same language. For endogenous transformations, two steps are needed: identifying the source model elements to transform and then applying the transformation on them. In this thesis, we propose three principal contributions. The first contribution aims to automate model transformations. The process is seen as an optimization problem where different transformation possibilities are evaluated and, for each possibility, a quality is associated depending on its conformity with a reference set of examples. This first contribution can be applied to exogenous as well as endogenous transformation (after determining the source model elements to transform). The second contribution is related precisely to the detection of elements concerned with endogenous transformations. In this context, we present a new technique for design defect detection. The detection is based on the notion that the more a code deviates from good practice, the more likely it is bad. Taking inspiration from artificial immune systems, we generate a set of detectors that characterize the ways in which a code can diverge from good practices. We then use these detectors to determine how far the code in the assessed systems deviates from normality. The third contribution concerns transformation mechanism testing. The proposed oracle function compares target test cases with a base of examples containing good quality transformation traces, and assigns a risk level based on the dissimilarity between the two. The traces help the tester understand the origin of an error. The three contributions are evaluated with real software projects and the obtained results confirm their efficiencies. Keywords : Model-driven engineering, by example, design defects, search-based software engineering, artificial immune-system

16 Chapter 1: Introduction Research Context Software engineering is concerned with the development and evolution of large and complex software-intensive systems. It covers theories, methods and tools for the specification, architecture, design, testing, and maintenance of software systems. Today s software systems are significantly large, complex and critical. Such systems cannot be developed and evolved in an economic and timely manner without automation. Automated software engineering applies computation to software engineering activities. The goal is to partially or fully automate software engineering activities, thereby significantly increasing both quality and productivity. This includes the study of techniques for constructing, understanding, adapting and modelling both software artefacts and processes. Automatic and collaborative systems are both important areas of automated software engineering, as are computational models of human software engineering activities. Knowledge representations and artificial intelligence techniques that can be applied in this field are of particular interest; they represent formal and semi-formal techniques that provide or support theoretical foundations. Automated software engineering approaches have been applied in many areas of software engineering. These include requirements engineering, specification, architecture, design and synthesis, implementation, modelling, testing and quality assurance, verification and validation, maintenance and evolution, reengineering, and visualisation [40], [64]. This thesis is concerned with two important fields of automated software engineering: (1) model driven engineering and (2) maintenance. The contributions to the first field consist of model transformation automation and testing using different techniques; those to the second field include two tasks, of which the improvement of code quality by automating the detection and correction of bad programming practices. This task can be viewed as a special kind of

17 transformation, called endogenous transformation, where the source and target models are the same. The second task is the validation of a transformation mechanism in order to detect potential errors Automated Model Transformation A first distinction concerns the kinds of software artefacts being transformed. If they are programs (i.e., source code, bytecode, or machine code), we use the term program transformation; if they are models, we use the term model transformation (MT). In our view, the latter term encompasses the former, since a model can range from abstract analysis models to very concrete models of source code. Hence, model transformations also include transformations from a more abstract to a more concrete model (e.g., from design to code) and vice versa (e.g., in a reverse engineering context). Model transformations are obviously needed in common tools such as code generators and parsers. Kleppe et al. [5] provide the following definition of model transformation, as illustrated in Figure 17: a transformation is the automatic generation of a target model from a source model, according to a transformation definition. A transformation definition is a set of transformation rules that together describe how a model in the source language can be transformed into a model in the target language. A transformation rule is a description of how one or more constructs in the source language can be transformed into one or more constructs in the target language.

18 Figure 17 Model Transformation Process In order to transform models, they need to be expressed in some modeling language (e.g., UML for design models, and programming languages for source code models). The syntax and semantics of the modeling language itself are expressed by a meta-model (e.g., the UML meta-model). Based on the language in which the source and target models of a transformation are expressed, a distinction can be made between endogenous and exogenous transformations. Endogenous transformations are transformations between models expressed in the same language. Exogenous transformations are transformations between models expressed using different languages. This distinction is essentially the same as the one that is proposed in the Taxonomy of Program Transformation [113], but ported to a model transformation setting. In this taxonomy, the term rephrasing is used for an endogenous transformation, while the term translation is used for an exogenous transformation. Typical examples of translation (i.e., exogenous transformation) are: Synthesis of a higher-level, more abstract, specification (e.g., an analysis or design model) into a lower-level, more concrete, one (e.g., a model of a Java program). A typical example of synthesis is code generation, where the source code is translated into byte-code

19 (that runs on a virtual machine) or executable code, or where the design models are translated into source code. Reverse engineering is the inverse of synthesis and extracts a higher-level specification from a lower-level one. Migration from a program written in one language to another, but keeping the same level of abstraction. Typical examples of rephrasing (i.e., endogenous transformation) are: Optimization, a transformation aimed to improve certain operational qualities (e.g., performance), while preserving the semantics of the software. Refactoring, a change to the internal structure of software to improve certain software quality characteristics (such as understandability, modifiability, reusability, modularity, adaptability) without changing its observable behaviour. As shown in Figure 18, there is a principal difference between endogenous and exogenous transformation. In the first category, we transform the whole source model to this equivalent target model conforming to different meta-models. However, for the second category we have two steps. The first one consists of identifying the elements to transform in the source model and then the second step consists of transforming these elements. The endogenous transformations are principally related to maintenance activities (refactoring, performance, etc). In modern software development, maintenance accounts for the majority of the total cost and effort in a software project. Especially burdensome are those tasks which require applying a new technology in order to adapt an application to changed requirements or a different environment. The high cost of software maintenance could be reduced by automatically improving the design of object-oriented programs without altering their behaviour [64]. The potential benefit of automated adaptive maintenance tools is not limited to a single domain, but rather spans a broad spectrum of modern software development. The primary concern of developers is to produce highly efficient and optimized code, capable of solving intense scientific and engineering problems in a minimal amount of time. One of

20 the important issues in automated software maintenance is to propose automated tools that improve software quality. Indeed, in order to limit costs and improve the quality of their software systems, companies try to enforce good design development practices and similarly to avoid bad practices. The underlying assumption is that good practices will produce good software. As a result, these practices have been studied by professionals and researchers alike with a special attention given to design-level problems. There has been much research focusing on the study of bad design practices sometimes called defects, antipatterns [61], smells [10], or anomalies [84] in the literature. Although bad practices are sometimes unavoidable, in most cases, a development team should try to prevent them and remove them from their code base as early as possible. Thus, we define code transformation as the process related to modifying the code in order to eliminate detected defects and improve the quality of the software. Hence, many fully-automated detection and correction techniques have been proposed [89]. Like in model transformation, the vast majority of existing work in design defects detection and correction is rule-based. Different rules identify key features that characterize anomalies using combinations of techniques like metrics, structural analysis, and/or lexical information.

21 Figure 18 Automated Model-driven Engineering Figure 2 summarizes the different tasks to automate in this thesis. We have detailed in this section the first part about endogenous and exogenous transformations. The second part about validating a transformation mechanism will be detailed in the next section Automated Testing Transformation As the specification of automated model transformations can also be erroneous, it necessitates finding automated ways to verify the correctness of a model transformation. Indeed, the automated verification of model transformation results represents another

22 important issue in automated model driven engineering. If a transformation is not correct, it may inject errors in the system design. Thus, it is pertinent to have an upstream validation and verification process in order to detect errors as soon as possible, rather than dragging them on all along. The verification increases the reliability and the usability of model transformations [40]. Furthermore, automated verification may significantly reduce the duration, and ultimately the total cost, of performing a model transformation. To validate the transformation mechanisms, we distinguish between two main categories: formal verification and testing. For proving the correctness of a system model by formal verification, a large number of semi-automated tools exist, based on model checking or theorem proving [22],[3]. They can typically draw more general conclusions on a model by using theorem provers. However, their use requires a significant amount of mathematical expertise and user interaction (not fully automated). Model transformation testing typically consists of synthesizing a large number of different input models as test cases, running the transformation mechanism and verifying the result using an oracle function. In this context, two important issues must be addressed: the efficient generation/selection of test cases and the definition of the oracle function to analyze the validity of transformed models. Testing transformation mechanisms is an approximate method and represents the main difference with formal methods. The definition of an oracle function for model transformation testing is a challenge [64],[89] and requires addressing many problems as detailed in the next section. 1.2 Problem Statement As shown in the previous section, we distinguish between three main problems. Part 1: Automating model transformation Problem 1.1: Most of the available work on model transformation is based on the hypothesis that transformation rules exist and that the important issue is how to express them. However, in real problems, the rules may be difficult to define; this is often the case

23 when the source and/or target formalisms are not widely used or proprietary. Indeed, as for any rule-based system, defining the set of rules is not an obvious task and the following obstacles may hinder the results: (1) Incompleteness or missing rules in a rule base. As a result, useful information cannot be derived from the rule base. In the context of model transformation, the result of incompleteness can be viewed as a partial generation of the target model. (2) Inconsistency or conflicting rules. Defining individual transformation rules is not a fastidious task. However, ensuring coherency between the individual rules is not obvious and can be very difficult given the dependencies between model elements while applying transformation rules. (3) Redundancy or the existence of duplicated (identical) or derivable (subsumed) rules in the rule base. Redundant rules not only increase the size of the rule base but may cause useless additional inferences. Problem 1.2: In the case of dynamic models (e.g., sequence diagram to colored Petri nets), the definition of transformation rules is more difficult: In addition to the problems mentioned previously, dynamic models must consider order (time sequencing) while transforming model elements (composition). Furthermore, in the case of dynamic models, the systematic use of rules generates target models that may need to be optimized in terms of size and structures. Part 2: Exogenous transformation (design defects detection) The next five problems are related to design defect detection related to exogenous transformation. Problem 2.1: There is no exhaustive list of all possible types of design defects. Although there has been a significant work to classify defect types, programming practices, paradigms and languages evolve making it unrealistic for them to permanently support the detection of all possible defect types. Furthermore, there might be company or applicationspecific design practices.

24 Problem 2.2: For those design defects that are documented, there is no consensual definition of the symptoms and their severity of impact on the code. Defects are generally described using natural language and their detection relies on the interpretation of the developers. This is a major setback for automation. Problem 2.3: The majority of detection methods do not provide an efficient manner to guide the manual inspection of the candidate list. Potential defects are generally not listed in an order that helps developers address the most important ones first. There exist few works, such as the one of Khomh et al [89], where probabilities are used to order the results. Problem 2.4: How to define thresholds when dealing with quantitative information? For example, the Blob [8] detection involves information such as class size. Although, we can measure the size of a class, an appropriate threshold value is not trivial to define. A class considered large in a given program/community of users could be considered average in another. Problem 2.5: How to deal with the context? In some contexts, an apparent violation of a design principle is considered as a consensual practice. For example, a class Log responsible for maintaining a log of events in a program, used by a large number of classes, is a common and acceptable practice. However, from a strict defect definition, it can be considered as a class with abnormally large coupling. Part 3: Testing model transformation After defining a transformation mechanism, it is necessary to validate it. However, some limitations in exiting work: Problem 3.1: Current model-driven engineering (MDE) technologies and model repositories store and manipulate models as graphs of objects. Thus, when the expected output model is available, the oracle compares two graphs. In this case, the oracle definition problem has the same complexity as the graph isomorphism problem, which is NP-hard [121]. In particular, we can find a test case output and an expected model that look different

25 (contain different model elements), but have the same meaning. So, the complexity of these data structures makes it difficult to provide an efficient and reliable tool for comparison. Problem 3.2: the majority of existing works are based on constraints verification. The constraints are defined at the metamodel level and conditions are generally expressed in OCL. However, the number of constraints to define can be very large to cover all rules and patterns. This is especially the case of contracts related to one-to-many mappings. Moreover, being formal specifications, these constraints are difficult to write in practice [21]. Problem 3.3: transformation errors can have different causes: transformation logic (rules) or source/ target meta-models [23]. To be effective, a testing process should allow the identification of the cause of errors. 1.3 Contributions To overcome the previously identified problems, we propose the following contributions, organized in three major parts: Part 1: Model transformation by example Contribution 1.1 We propose an approach for model transformation that does not use or produce transformation rules. We start from the premise that experts give transformation examples more easily than complete and consistent transformations rules. In the absence of rules or an exhaustive set of examples, an alternative solution is to derive a partial target model from the available examples. The generation of such a model consists of finding situations in the examples that best match the model to transform. Thus, we propose the alternative view of MT as an optimization problem where a (partial) target model can be automatically derived from available examples. For this, we introduce a search-based approach to automate MT called MOTOE (model transformation as optimization by examples) [79],[74].

26 Contribution 1.2 We extend MOTOE to the case of dynamic model transformation, e.g., sequence diagram to colored petri net (CPN). The primary goal is to add to contribution 1.1 the constraint of temporal coherence during the transformation process. Another goal is to generate optimal target models (in terms of size) by using good example bases. Part 2: Design Defects Detection by example Contribution 2.1 In this effort, we view the detection of design defects as one that can be addressed by the mechanisms of detection-identification-response of an artificial immune system (AIS), which use the metaphor of a biological immune system. In both cases, known and unknown problems should be discovered. Instead of trying to find all possible infections, an immune system starts by detecting what is not normal. The more an element is abnormal, the more it is considered risky. This first phase is called discovery. After the risk has been assessed, the next phases consist of identifying if the risk corresponds to a known category of problems and subsequently producing the proper response. Similarly, our contribution is built on the idea that the higher the dissimilarity between a code fragment and a reference (good) code, the higher is the risk that this code could constitute a design defect. The efficiency of our approach is evaluated by studying the relationship between dissimilarity and risk for different open source projects. Contribution 2.2 we propose another solution by using examples of manually found design defects to derive detection rules. Such examples are in general available as documents as par of the maintenance activity (version control logs, incident reports, inspection reports, etc.). The use of examples allows the derivation of rules that are specific to a particular company rather than rules that are supposed to be applicable to any context. This includes the definition of thresholds that correspond to the company best practices. Learning from examples aims also at reducing the list of detected defect candidates. Our approach allows to automatically find detection rules, thus relieving the designer from doing so manually. Rules are defined as combinations of metrics/thresholds that better conform to known instances of design defects (defect examples). In our setting, we use a

27 music-inspired algorithm [56] for rule extraction. We evaluate our approach by finding potential defects in three different open-source systems Part 3: Testing Transformation by example Contribution 3. We also adapt the by-example approach based on the immune system metaphor to automate the test of transformation mechanisms. We propose an oracle function that compares target test cases to the elements of a base of examples containing good quality transformation traces, and then assigns a risk level to the former, based on dissimilarity between the two as determined by an AIS algorithm. As a result, one no longer needs to define an expected model for each test case and the traceability links help the tester understand the origin of an error. Furthermore, the detected faults are ordered by degree of risk to help the tester perform further analysis. For this, a custom tool was developed to visualize the risky fragments found in the test cases with different colors, each related to an obtained risk score. We illustrate and evaluate our approach with different transformation mechanisms. 1.4 Roadmap The remainder of this dissertation is organized as follows: Chapter 2 reviews related work on model transformation, design defect detection, transformation testing, by-example software engineering and search-based software engineering; Chapter 3 reports our contribution for automating model transformation using examples and search-based techniques. We present our Software and System Modeling journal paper [79] that shows an illustration of our approach for the case of static transformation. For dynamic transformation, our European Conference on Modelling Foundations and Applications [74] illustrates the application of our approach to sequence diagram to colored Petri nets transformation. Chapter 4 presents our approach to design defects detection based on an immune system metaphor. This contribution is illustrated via our Automated Software Engineering conference paper [82]. Chapter 5 details our

28 contribution for design defects rules generation. We present in this chapter our European Conference on Software Maintenance and Reengineering paper [80]. Chapter 6 presents a description for our contribution about testing transformation mechanism by example [78]. It is subject to a paper accepted in the Journal of Automated Software Engineering. Chapter 7 presents the conclusions of this dissertation and outlines some directions for future research.

29 Chapter 2: Related Work This chapter gives an overview of basic works related to this thesis. The work proposed in this thesis crosscuts four research areas: (1) endogenous and exogenous transformations; (2) correctness of model transformation; (3) by-example software engineering; and (4) search-based software engineering. The chapter provides a survey of existing works in these four areas and identifies the limitations that are addressed by our contributions. The structure of the chapter is as follows: Section 2.1 summarises exiting works in model transformation, including endogenous and exogenous transformations. We identify different criteria to identify them and we focus on by-example approaches. Section 2.2 discusses the state of the art in validating transformation mechanisms; Section 2.3 is devoted toward describing work based on the use of examples; Section 2.4 provides a description of leading work in search-based software engineering. 2.1 Model Transformation Model transformation programs take as input a model conforming to a given source meta-model and produce as output another model conforming to a target meta-model. The transformation program, composed of a set of rules, should itself be considered a model. Consequently, it has a corresponding meta-model that is an abstract definition of the used transformation language. As previously stated, we distinguish between two model transformation categories: (1) exogenous transformations in which the source and target meta-models are not the same, e.g., transforming a UML class diagram to Java code, and (2) endogenous transformations in which the source and target meta-models are the same, e.g., refactoring a

30 UML class diagram or code. Exogenous transformations are used to exploit the constructive nature of models in terms of vertical transformations, thereby changing the level of abstraction and building the bases for code generation, and also to allow horizontal transformation of models that are at the same level of abstraction [13]. Horizontal transformations are of specific interest to realize different integration scenarios such as model translation, e.g., translating a relational schema (RS) model into a UML class model. In contradistinction to exogenous transformations where the entire source model elements must be transformed to their equivalents in the target model, we distinguish two steps in endogenous transformations. The first step is the identification of source model elements (only some model fragments) to transform, and the second step is the transformation itself. In most cases, the endogenous transformations correspond to model refactoring where the input and output meta-model are the same. In this case, the first step is the detection of refactoring opportunities (e.g., design defects), and the second one is the application of refactoring operations (transformation). We now describe existing work according to these two categories: endogenous and exogenous transformation. For endogenous transformation, we focus on refactoring activities Exogenous Transformation Classification and Languages In the following, a classification of endogenous transformation approaches is briefly reported. Then, some of the available endogenous transformation languages are separately described. The classification is mainly based upon [110] and [13]. Several endogenous transformation approaches have been proposed in the literature. In the following, classifications of model-to-model endogenous transformation approaches discussed by Czarnecki and Helsen [110] are described:

31 Direct manipulation approach. It offers an internal model representation and some APIs to manipulate it. It is usually implemented as an object-oriented framework, which may also provide some minimal infrastructure. Users have to implement transformation rules, scheduling, tracing and other facilities in a programming language. An example of used tools in direct manipulation approaches is Builder Object Network (BON), a framework which is relatively easy to use and is still powerful enough for most applications. BON provides a network of C++ objects. It provides navigation and update capabilities for models using C++ for direct manipulation. Operational approach. It is similar to direct manipulation, but offers more dedicated support for model transformation. A typical solution in this category is to extend the utilized meta-modeling formalism with facilities for expressing computations. An example would be to extend a query language such as OCL with imperative constructs. Examples of systems in this category are Embedded Constraint Language (ECL) [50], QVT Operational mappings [91], XMF [122], MTL [26] and Kermeta [ 49]. Relational approach. It groups declarative approaches in which the main concept is mathematical relations. In general, relational approaches can be seen as a form of constraint solving. The basic idea is to specify the relations among source and target element types using constraints that, in general, are non-executable. However, the declarative constraints can be given executable semantics, such as in logic programming where predicates can describe the relations. All of the relational approaches are side-effect free and, in contrast to the imperative direct manipulation approaches, create target elements implicitly. Relational approaches can naturally support multidirectional rules. They sometimes also provide backtracking. Most relational approaches require strict separation between source and target models, that is, they do not allow in-place update. Examples of relational approaches are QVT Relations and ATL [36]. Moreover, in 14] the application of logic programming has been explored for the purpose. Graph-transformation based approaches. They exploit theoretical work on graph transformations and require that the source and target models be given as graphs. Performing model transformation by graph transformation means to take the abstract syntax

32 graph of a model, and to transform it according to certain transformation rules. The result is the syntax graph of the target model. More precisely, graph transformation rules have an LHS and an RHS graph pattern. The LHS pattern is matched in the model being transformed and replaced by the RHS pattern in place. In particular, LHS represents the pre-conditions of the given rule, while RHS describes the post-conditions. LHS RHS defines a part which has to exist to apply the rule, but which is not changed. LHS LHS RHS defines the part which shall be deleted, and RHS LHS RHS defines the part to be created. The LHS often contains conditions in addition to the LHS pattern, for example, negative conditions. Some additional logic is needed to compute target attribute values such as element names. GReAT [38] and AToM3 [54] are systems directly implementing the theoretical approach to attributed graphs and transformations on such graphs. They have built-in fixed point scheduling with non-deterministic rule selection and concurrent application to all matching locations. Mens et al [13] provide a taxonomy of model transformations. One of the main differences with the previous taxonomy is that Czarnecki and Helsen propose a hierarchical classification based on feature diagrams, while the Mens et al. taxonomy is essentially multi-dimensional. Another important difference is that Czarnecki et al. classify the specification of model transformations, whereas Mens et al. taxonomy is more targeted towards tools, techniques and formalisms supporting the activity of model transformation. For these different categories, many languages and tools have been proposed to specify and execute exogenous transformation programs. In 2002, OMG issued the Query/View/Transformation request for proposal [91] to define a standard transformation language. Even though a final specification was adopted at the end of 2008, the area of model transformation continues to be a subject of intense research. Over the last years, in parallel to the OMG effort, a number of model transformation approaches have been proposed both from academia and industry. They can be distinguished by the used paradigms, constructs, modeling approaches, tool support, and suitability for given problems. We briefly describe next some well-known languages and tools.

33 ATL (ATLAS Transformation Language) [35] is a hybrid model transformation language that contains a mixture of declarative and imperative constructs. The former allows dealing with simple model transformations, while the imperative part helps in coping with transformations of higher complexity. ATL transformations are unidirectional, operating on read-only source models and producing write-only target models. During the execution of a transformation, source models may be navigated through, but changes are not allowed. Transformation definitions in ATL form modules. A module contains a mandatory header section, import section, and a number of helpers and transformation rules. There is an associated ATL Development Toolkit available as open source from the GMT Eclipse Modeling Project [28]. A large library of transformations is available at [15], [43]. GReAT [1] (Graph Rewriting and Transformation Language) is a meta-modelbased graph transformation language that supports the high-level specification of complex model transformation programs. In this language, one describes the transformations as sequenced graph rewriting rules that operate on the input models and construct an output model. The rules specify complex rewriting operations in the form of a matching pattern and a subgraph to be created as the result of the application of a rule. The rules (1) always operate in a context that is a specific subgraph of the input, and (2) are explicitly sequenced for efficient execution. The rules are specified visually using a graphical model builder tool called GME [2]. AGG is a development environment for attributed graph transformation systems that support an algebraic approach to graph transformation. It aims at specifying and rapid prototyping applications with complex, graph structured data. AGG supports typed graph transformations including type inheritance and multiplicities. It may be used (implicitly in code ) as a general-purpose graph transformation engine in high-level Java applications employing graph transformation methods. The source, target, and common meta-models are represented by type graphs. Graphs may additionally be attributed using Java code. Model transformations are specified by graph rewriting rules that are applied non-deterministically until none of them can be

34 applied anymore. If an explicit application order is required, rules can be grouped in ordered layers. AGG features rules with negative application conditions to specify patterns that prevent rule executions. Finally, AGG offers validation support that is consistency checking of graphs and graph transformation systems according to graph constraints, critical pair analysis to find conflicts between rules (that could lead to a non-deterministic result) and checking of termination criteria for graph transformation systems. An available tool support provides graphical editors for graphs and rules and an integrated textual editor for Java expressions. Moreover, visual interpretation and validation is supported. VIATRA2 [118] is an Eclipse-based general-purpose model transformation engineering framework intended to support the entire life-cycle for the specification, design, execution, validation and maintenance of transformations within and between various modelling languages and domains. Its rule specification language is a unidirectional transformation language based mainly on graph transformation techniques. More precisely, the basic concept in defining model transformations within VIATRA2 is the (graph) pattern. A pattern is a collection of model elements arranged into a certain structure fulfilling additional constraints (as defined by attribute conditions or other patterns). Patterns can be matched on certain model instances, and upon successful pattern matching, elementary model manipulation is specified by graph transformation rules. There is no predefined order of execution of the transformation rules. Graph transformation rules are assembled into complex model transformations by abstract state machine rules, which provide a set of commonly used imperative control structures with precise semantics. VIATRA2 is a hybrid language since the transformation rule language is declarative, but the rules cannot be executed without an execution strategy specified in an imperative manner. Important specification features of VIATRA2 include recursive (graph) patterns, negative patterns with arbitrary depth of negation, and generic and metatransformations (type parameters, rules manipulating other rules) for providing reuse of transformations [118]. A conclusion to be drawn from studying the existing endogenous transformation approaches, tools and techniques is that they are often based on empirically obtained rules

35 [5]. In fact, the traditional and common approach toward implementing model transformations is to specify the transformation rules and automate the transformation process by using an executable model transformation language. Although most of these languages are already powerful enough to implement large-scale and complex model transformation tasks, they may present challenges to users, particularly to those who are unfamiliar with a specific transformation language. Firstly, even though declarative expressions are supported in most model transformation languages, they may not be at the proper level of abstraction for an end-user, and may result in a steep learning curve and high training cost. Moreover, the transformation rules are usually defined at the meta-model level, which requires a clear and deep understanding about the abstract syntax and semantic interrelationships between the source and target models. In some cases, domain concepts may be hidden in the meta-model and difficult to unveil (e.g., some concepts are hidden in attributes or association ends, rather than being represented as first-class entities). These implicit concepts make writing transformation rules challenging. Thus, the difficulty of specifying transformation rules at the meta-model level and the associated learning curve may prevent some domain experts from building model transformations for which they have extensive domain experience. To address these challenges inherent from using model transformation languages, an innovative approach called Model Transformation By Example (MTBE) is proposed that will be described in the next section Model Transformation by Example The commonalities of the by-example approaches for transformation can be summarized as follows: All approaches define an example as a triple consisting of an input model and its equivalent output model, and traces between the input and output model elements. These examples have to be established by the user, preferably in concrete syntax. Then, generalization techniques such as hard-coded reasoning rules, inductive logic, or

36 relational concept analysis are used to derive model transformation rules from the examples, in a deterministic way that is applicable for all possible input models which have a high similarity with the predefined examples. Varrò and Balogh [23] propose a semi-automated process for MTBE using Inductive Logic Programming (ILP). The principle of their approach is to derive transformation rules semi-automatically from an initial prototypical set of interrelated source and target models. Another similar work is that of Wimmer et al [ 31] who derive ATL transformation rules from examples of business process models. Both contributions use semantic correspondences between models to derive rules. Their differences include the fact that [ 31] presents an object-based approach that finally derives ATL rules for model transformation, while [41] derives graph transformation rules. Another difference is that they respectively use abstract versus concrete syntax: Varro uses IPL when Wimmer relies on an ad hoc technique. Both models are heavily dependent on the source and target formalisms. Another similar approach is that of Dolques et al. [123] which aims to alleviate the writing of transformations, and where engineers only need to handle models in their usual (concrete) syntax and to describe the main cases of a transformation, namely the examples. A transformation example includes the source model, the target model and trace links that make explicit how elements from the source model are transformed into elements of the target model. The transformation rules are generated from the transformation traces, using formal concept analysis extended by relations, and they are classified through a lattice that helps navigation and choice. This approach requires the examples to cover all the transformation possibilities and it is only applicable for one-to-one transformations. Recently, a similar approach to MTBE, called Model Transformation by Demonstration (MTBD), was proposed [124]. Instead of the MTBE idea of inferring the rules from a prototypical set of mappings, users are asked to demonstrate how the MT should be done, through direct editing (e.g., add, delete, connect, update) of the source model, so as to simulate the transformation process. A recording and inference engine was developed, as part of a prototype called MT-Scribe, to capture user operations and infer a user s intention during a MT task. A transformation pattern is then generated from the

37 inference, specifying the preconditions of the transformation and the sequence of operations needed to realize the transformation. This pattern can be reused by automatically matching the preconditions in a new model instance and replaying the necessary operations to simulate the MT process. However, this approach needs a large number of simulated patterns to be efficient and it requires a high level of user intervention. In fact, the user must choose the suitable transformation pattern. Finally, the authors do not show how MTBD can be useful to transform an entire source model and only provide examples of transforming model fragments. On the other hand, the MTBD approach, in contradiction with others by-example approaches is applied to endogenous transformations. Another very similar by demonstration approach was proposed by Langer et al. [97]. The difference with Sun et al. work, that uses the recorded fragments directly, Langer et al. use them to generate ATL rules. Another difference is that Langler approach is related to exogenous transformation. Brosch et al. [96] introduced a tool for defining composite operations, such as refactorings, for software models in a user-friendly way. This by-example approach prevents modelers from acquiring deep knowledge about the metamodel and dedicated model transformation languages. However, this tool able only to apply refactoring operations and do not detect automatically refactoring operations. The commonalities of the by-example approaches for exogenous transformation can be summarized as follows: All approaches define an example as a triple consisting of an input model and its equivalent output model, and traces between the input and output model elements. The examples have to be established by the user, preferably in concrete syntax. Then, generalization techniques such as hard-coded reasoning rules, inductive logic or relational concept analysis are used to derive model transformation rules from the examples, in a deterministic way that is applicable to all possible input models which have a high similarity with the predefined examples. None of the mentioned approaches claims that the generation of the model transformation rules is correct or complete. In particular, all approaches explicitly state that some complex parts of the transformation involving complex queries, attribute calculations

38 such as aggregation of values, non-deterministic transformations, and counting of elements have to be developed by the user, by changing the generated model transformations. Furthermore, the approaches recommend developing the model transformations using an iterative methodology. This means that, after generating the transformations from initial examples, the examples must be adjusted or the transformation rules changed if the user is not satisfied with the outcome. However, in most cases, deciding that the examples or the transformation rules need changing is not an obvious process to the user Traceability-based Model Transformation Some other meta-model matching works can also be considered as variants of byexample approaches. Garcia-Magarino et al. [46] propose an approach to generate transformation rules between two meta-models that satisfy some constraints introduced manually by the developer. In [47], the authors propose to automatically capture some transformation patterns in order to generate matching rules at the meta-model level. This approach is similar to MTBD, but it is used at the meta-model level. Most current transformation languages [66],[37],[58] build an internal traceability model that can be interrogated at execution time, for example, to check if a target element was already created for a given source element. This approach is specific to each transformation language and sometimes to the individual transformation specification. The language determines the traceability meta-model and the transformation specification determines the label of the traces (in case of QVT/Relational the traceability meta-model is deduced from the transformation specification). The approach taken only provides access to the traces produced within the scope of the current transformation. Marvie describes a transformation composition framework [100] that allows manual creation of linkings (traces). These linkings can then be accessed by subsequent transformation, although this is

39 limited to searching specific traces by name, introducing tight coupling between subtransformations Endogenous Transformation In contradistinction to exogenous transformations where the entire source model elements must be transformed to their equivalents in the target model, we distinguish two steps in endogenous transformations. The first step is the identification of source model elements (only some model fragments) to transform, and the second step is the transformation itself. In most cases, the endogenous transformations correspond to model refactoring where the input and output meta-model are the same. In this case, the first step is the detection of refactoring opportunities (e.g., design defects) and the second one is the application of refactoring operations (transformation). In this thesis, we focus on program-code transformation that represents the major parts of existing work in exogenous transformation. Code transformation can be performed as model transformations. In fact, a programming language have a defined meta-model (for example: JAVA) and a program can be considered as an instance of this metamodel. Given that all code transformations can be performed as model transformations, one can classify the source and target models of a transformation in terms of their structure. Code transformation has applications in many areas of software engineering such as compilation, optimization, refactoring, program synthesis, software renovation, and reverse engineering. The aim of code transformation is to increase programmer productivity by automating programming tasks, thus enabling programming at a higher-level of abstraction, and increasing maintainability and reusability. In our work, we are interested in code transformation as the identification and correction of design defects in code using refactoring. The term refactoring, introduced by Opdyke in his PhD thesis [90], refers to the process of changing an [object-oriented] software system in such a way that it does not alter the external behaviour of the code, yet

40 improves its internal structure. Refactoring can be regarded as the object-oriented equivalent of restructuring, which is defined by Chikofsky and Cross [31] as the transformation from one representation form to another at the same relative abstraction level, while preserving the subject system s external behaviour (functionality and semantics). [...] it does not normally involve modifications because of new requirements. However, it may lead to better observations of the subject system that suggest changes to improve aspects of the system. In other words, the refactoring process consists of a number of activities: (1) identify where the software should be refactored; (2) determine which refactorings should be applied to the identified places; (3) guarantee that the applied refactoring preserves behaviour; (4) apply the refactoring; (5) assess the effect of refactoring on software quality characteristics; (6) maintain consistency between refactored program code and other software artifacts (or vice versa). Each of these activities could be automated to a certain extent. Several studies have recently focused on detecting and correction (by applying refactorings) of design defects in software using different techniques. These techniques range from fully automatic detection techniques [99],[89], to manual inspection techniques. This section can be separated in three broad categories: metric-based approaches, correction opportunity based approaches, graph transformation and visualization Metric-based Approaches Marinescu [99] defined a list of rules based on metrics to detect design flaws of OO design at method, class and subsystem levels. However, the choice of the metrics to use and the proper threshold values for those metrics are not addressed explicitly in his research. Erni et al. [31] introduce the concept of multi-metrics, as an n-tuple of metrics expressing a quality criterion (e.g., modularity). Unfortunately, multi-metrics neither encapsulate metrics in a more abstract construct, nor do they allow a flexible combination of metrics. Alikacem et al. [64] express metrics in a generic manner based on fuzzy logic rules. However, they

41 use their technique only for rule activation to detect a defect and not to estimate the probability of design defect occurrence. In general, many limitations are related to the use of metrics. Also, the use of specific metrics does not consider context, and we need to adapt the related rules by hand to the context of use. Even for a single system, this task can be costly because of constant evolution. Another issue is the different interpretations of defect definitions by analysts. A final problem is the use of threshold values. Different systems can follow different development practices. Consequently, different thresholds might apply. These issues were partially addressed by Moha et al. [88] in their framework DECOR. They automatically convert high-level defect specifications into detection algorithms. Theoretically, the exact metrics used for the detection could vary, but this issue was almost not studied in practice. This explains the high number of false positives they detected. An additional problem is that the detected defects are not ordered. This implies that a maintainer does not have a clear idea of which possible defects should be inspected first. Khomh et al. [89] extended DECOR to support uncertainty in smell detection: they used Bayesian belief networks (BBNs) to implement rules from DECOR. The output of their model is probabilities that classes are occurrences of design defects. Although the technique allows ranking of the results (by probability), it still suffers from the problem of selecting specific metrics to conduct a detection Visualization-based techniques The need for visualization-based defect detection has been proposed to take advantage of the expertise of analysts. Visualization is considered to be a semi-automatic technique since the information is automatically extracted and then presented to an analyst for interpretation. Kothari et al. [60] present a pattern-based framework for developing tool support to detect software anomalies by representing potential defects with different colors using a specific metaphor. Dhambri et al. [57] propose a visualization-based approach to

42 detect design anomalies for cases where the detection effort already includes the validation of candidates. However, these approaches need a lot of human intervention and expertise. Their results show that by using visualization, instead of directly using metrics, the anomaly detection process suffers from fewer variations between maintainers, but the detection results are the same Correction Opportunity-based Approaches The authors in [73] introduce the concept of considering defect detection as an optimization problem; they use a combination of 12 metrics to measure the improvements achieved when methods are moved between classes. A fitness function (score) is computed by applying the sequence of transformations to the program at hand and by measuring the improvement in the metrics of interest [69]. Indeed, this search-based approach combines the detection and correction steps because an opportunity of refactoring is detected if a randomly selected correction improves the design quality. This is because the order of detected defects is related to the quality of improvements (difference in fitness). Furthermore, the problems mentioned before for metrics still apply for search-based techniques since they use a fitness function that consists of a combination of metrics. Graph transformations can lead to an underlying theory of refactoring [107] where each refactoring corresponds to a graph production rule, and each refactoring application corresponds to a graph transformation. The theory of graph transformation can be used to reason about applying refactorings in parallel, using theoretical concepts such as confluence and critical pair analysis. These categories of approaches combine the identification of code to refactor and which refactorings to apply. In them, programs can be expressed as graphs, and refactorings correspond to graph production rules or graph transformations. Mens et al [64] use graph rewriting formalism to prove that refactorings preserve certain kinds of relationships (updates, accesses and invocations) that can be inferred statically from the source code. Bottoni et al [110] describe refactorings as coordinated graph transformation

43 schemes in order to maintain consistency between a program and its design when any of them evolves by a refactoring. Heckel [102] uses graph transformations to formally prove the claim (and corresponding algorithm) of Roberts [27] that any set of refactoring post-conditions can be translated into an equivalent set of preconditions. Van Eetvelde and Janssens [116] propose a hierarchical graph transformation approach to be able to view and manipulate the software and its refactorings at different levels of detail. 2.2 Correctness of Model Transformation Correctness of model transformations can be analyzed from different perspectives. Existing works can be classified into categories: formal verification-based approaches and testing approaches. We start by describing existing work in the first category. Baleani et al. argue in [86] that correctness of model transformations for industrial tools should be based on formal models in order to ensure correctness by construction. For this purpose, they suggest to use a block diagram formalism called synchronous reactive model of computation. However, correct interpretation of the model transformation rules does not imply a correct result, one that is a model of the target language. Semantic correctness is discussed by Karsai et al. in [59], where specific behavior properties of the source model shall be reflected in the target model. In [42], semantic correctness is ensured by using the same rules for the model transformation, also for the transformation of the operational semantics, which is given by graph rules. By doing this, the behaviour of the source model can be compared with the one of the target model by checking mixed confluence. However, this paper concentrates on syntactical correctness based on the integrated language generated by the triple rules. [9],[118] are some works on using graph transformation rules to specify the dynamic behavior of systems. For example, [118] presents a meta-level analysis technique where the semantics of a modeling language are defined using graph transformation rules. A transition system is generated for each instance model, which can be verified using a

44 model checker. Furthermore, [9] verifies if a transformation preserves certain dynamic consistency properties by model checking the source and target models for properties p and q, where property p in the source language is transformed into property q in the target language. This transformation requires validation by a human expert. Especially in the area of graph transformations some work has been conducted that uses Petri Nets to check formal properties of graph production rules. Thereby, the approach proposed in [105] translates individual graph rules into a Place/-Transition Net and checks for its termination. Another approach is described in [117], where the operational semantics of a visual language in the domain of production systems are described with graph transformations. Varrò presents in [10] a translation of graph transformation rules to transition systems, serving as the mathematical specification formalism of various model checkers to achieve the formal verification of model transformation. Thereby, only the dynamic parts of the graph transformation systems are transformed to TS in order to reduce the state space. In [40], a simple error taxonomy for model transformations is presented, which is then used to automatically generate test cases for model transformations. A very similar approach is presented by Darabos et al. in [10], focusing on common errors in graph transformation languages in general, and on errors in the graph pattern matching phase in particular. Both taxonomies are, however, rather general and only describe possible errors in graph transformation specifications. After studying the existing work in formal verification-based approaches, we can conclude that one its important problem is that the results of a formal analysis can be invalidated by erroneous model transformations. In fact, the systems' engineers cannot distinguish whether an error is in the design or in the transformation. Furthermore, existing work requires a significant amount of mathematical expertise and user interaction (not fully automated). In addition, the existing work based on model checking and graph transformation does not combine the syntactic and semantic correctness of model transformations in one approach. For a syntactic correctness analysis, one has to decide whether the result of the transformation is a well-formed model of the target language. In case of semantic correctness analysis, we need to decide if the model transformation preserves (transformation specific) correctness properties.

45 Many works exist on model transformation testing [125],[32]. Fleurey et al. [34] and Steel et al. [14] discuss the reasons why testing model transformation is distinct from testing traditional implementations: the input data are models that are complex when compared to simple data types. Both papers describe how to generate test data in MDA by adapting existing techniques, including functional criteria and bacteriologic approaches [14]. Lin et al. [125] propose a testing framework for model transformation, built on their modeling tools and transformation engine, that offers a support tool for test case construction, test execution and test comparison; but the test models are manually developed in their work. One of the widely-used techniques for test generation is mutation analysis. Mutation analysis is a testing technique that was designed to evaluate the efficiency of a test set. Mutation analysis consists of creating a set of faulty versions or mutants of a program with the ultimate goal of designing a test set that distinguishes the program from all its mutants. Mottu et al. [113] have adapted this technique to evaluate the quality of test cases. They introduce some modifications in the transformation rules (program-mutant). Then using the same test cases as input an oracle function compares between the results (target models). If all results are the same, we can assume that the input cases were not sufficient to cover all the transformation possibilities (rules). Comparing to our work, our goal is not to evaluate the quality of a data set but to propose a generic oracle function to detect transformation errors. Our oracle function compares between some potential errors (detectors) and transformation traces to evaluate. However, in mutation analysis the oracle function compares between two target models, one generated by the original mechanism (rules) and another after modifying the rules. In addition, our technique does not create program variation (rules modifications) but traces variation that differs from good ones. Furthermore, the mutation analysis technique needs to define an expected model for each test case in order to compare it with another target model obtained from the same test case after modifying the rules (mutant). Some other approaches are specific to test case generation for graph-transformation mechanism. Küster [58], addresses the problem of model transformation validation in a way

46 that is very specific to graph transformation. He focuses on the validation of the rules that define the model transformation with respect to termination and confluence. His approach aims at ensuring that a graph transformation will always produce a unique result. Küster s work is an important step for model transformation validation but it does not aim at validating the functionality of a transformation (i.e., it does not aim at running a transformation to check if it produces a correct result). Darabos et al. [25] also investigate the testing of graph transformations. They consider graph transformation rules as the specification of the transformation and propose to generate test data from this specification. Their testing technique focuses on testing pattern matching activity that is considered the most critical of a graph transformation process. They propose several fault models that can occur when computing the pattern match as well as a test generation technique that targets those particular faults. However, the Darabos approach is specific to test only graph transformation mechanisms. Sturmer et al. [22] propose a technique for generating test cases for code generators. The criterion they propose is based on the coverage of graph transformation rules. Their approach allows the generation of test cases for the coverage of both individual rules and rule interactions but it requires the code generator under test to be fully specified with graph transformation rules. Sampath et al. [11] propose a similar method for verification of model processing tools such as simulators and code-generators. They use a meta-model based test-case generation method that generates test-cases for model processors. Mottu et al. [32] describes six different oracle functions to evaluate the correctness of an output model. These six functions can be classified in three general categories. For the first category, current MDE technologies and model repositories store and manipulate models as graphs of objects. Thus, when the expected output model is available, the oracle compares two graphs. In this case, the oracle definition problem has the same complexity as the graph isomorphism problem, which is NP-hard [6]. In particular, we can find a test case output and an expected model that look different (contain different model elements) but have the same meaning. So, the complexity of these data structures makes it difficult to provide an efficient and reliable tool for comparison [22]. Still, several studies have proposed simplified versions with a lower computation cost [12]. For example, Alanen et

47 al. [4] present a theoretical framework for performing model differencing. However, they rely on the use of unique element identifiers for the model elements. To illustrate the specification conformance category, we present two contributions: design by contract and pattern matching [112]. For design by contract, the transformation is specified by pre- and post-conditions, and transformation invariants that must be satisfied. The constraints are defined at the meta-model level and expressed in OCL. For pattern matching, templates are used to specify the expected features of the input and output models with pre- and postconditions for the transformation. The difference with design by contract approaches is that specific constraints must be defined for each output model. Both oracles are difficult to define. Indeed, the number of constraints to define can be very large to cover all rules and patterns [112]. This is especially the case of contracts related to one-to-many mappings. Moreover, being formal specifications, these constraints are difficult to write in practice. In pattern matching, the constraints are described at the model level and may lead to a fastidious task to define them for each model instance [112]. More general, when many test models are necessary, at least many test cases are created. To reduce the effort and the risk of making an error, it is necessary that each test case does not have its own oracle, but that an oracle is reused in different test cases. Such an oracle is generic and not dedicated to a test case, its test model, and its corresponding output model. Oracle functions using patterns or expected models are not adapted since they need the writing of at least one oracle data for each test case. Generic oracle data are preferable since they are written only once, and could be used with their corresponding oracle function in any test case. In addition, all these approaches to model transformation validation and testing consider a particular technique for model transformation and leverage the specificities of this technique to validate the transformation. This has the advantage of having validation techniques that are well-suited to the specific faults that can occur in each of these techniques. The results of these approaches are difficult to adapt to other transformation techniques (that are not rule-based).

48 2.3 By-Example Software Engineering Examples play a key role in the human learning process. There exist numerous theories on learning styles in which examples are used. For a description of today s popular learning style theories, see [95],[7]. Our work is based on using past transformation examples. Various by-example approaches have been proposed in the software engineering literature. What does by-example really mean? What do all by-example approaches have in common? The main idea, as the name already suggests, is to give the software examples of how things are done or what the user expects, and let it do the work automatically. In fact this idea is closely related to fields such as machine learning or speech recognition. Common to all by-example approaches is the strong emphasis on user friendliness and a short learning curve. According to [20] the by-example paradigm dates back to see Learning Structure Descriptions from Examples in [90]. Programming by example [95] is the best known by-example approach. It is a technique for teaching the computer new behavior by demonstrating actions on concrete examples. The system records user actions and generalizes a program that can be used for new examples. The generalization process is mainly based on user responses to queries about user intentions. Another well-known approach is Query by Example (QBE) [7]. It is a query language for relational databases constructed from sample tables filled with example rows and constraints. QBE is especially suited for queries that are not too complex and can be expressed in terms of a few tables. In web-engineering, Lechners et al [62] present the language TBE (XML transformers by example) that allows defining transformers for WebML schemes by example, i.e., stating what is desired instead of specifying the operations to get it. Advanced XSLT tools are also capable of generating XSLT scripts using examples from schema level (like MapForce from Altova) or document (instance) level mappings (such as the pioneering XSLerator from IBM Alphaworks, or the more recent StylisStudio).

49 The problems addressed by the above-mentioned approaches are different from ours in both the nature and the objectives Search-based Software Engineering Our approach is largely inspired by contributions in Search-Based Software Engineering (SBSE). SBSE is defined as the application of search-based approaches to solving optimization problems in software engineering [72]. Once a software engineering task is framed as a search problem, there are numerous approaches that can be applied to solving that problem, from local searches such as exhaustive search and hill-climbing to meta-heuristic searches such as genetic algorithms (GAs) and ant colony optimisation [70]. Many contributions have been proposed for various problems, mainly in cost estimation, testing, and maintenance [101],[72]. Module clustering, for example, has been addressed using exhaustive search [70], genetic algorithms [72] and simulated annealing (SA)[103]. In those studies that compared search techniques, hill-climbing was perhaps surprisingly found to produce better results than meta-heuristic GA searches. Model verification has also been addressed using search-based techniques. Shousha et al. [70] propose an approach to detect deadlocks in UML models, but the generation of a new quality predictive model starting from a set of existing ones by using simulated annealing (SA) that is reported in [103] is probably the problem that is the most similar to MT by examples. In that work, the model is also decomposed into fine-grained pieces of expertise that can be combined and adapted to generate a better prediction model. To the best of our knowledge, inspired among others by the road map paper of Harman [72], the idea of treating model transformation as a combinatorial optimization problem to be solved by a search-based approach was not studied before our proposal.

50 2.5 Summary This chapter has introduced the existing work in different domains related to our work. The closest work to our proposal is model transformation by example (MTBE). The commonalities of the by-example approaches for model transformation can be summarized as follows: All approaches define an example as a triple consisting of an input model and its equivalent output model, and traces between the input and output model elements. These examples have to be established by the user, preferably in concrete syntax. Then, generalization techniques such as hard-coded reasoning rules, inductive logic [23], or relational concept analysis or pattern are used to derive model transformation rules from the examples, in a deterministic way that is applicable for all possible input models which have a high similarity with the predefined examples. One conclusion to be drawn from studying the existing by-example approaches is that they use semi-automated rules generation, with the generated rules further refined by the user. In practice, this may be a lengthy process and require a large number of transformation examples to assure the quality of the inferred rules. In this context, the use of search-based optimization techniques can be a more preferable transformation approach since it directly generates the target model from the existing examples, without using the rules step. This also leads to a higher degree of automation than exiting by-example approaches. Table 1 summarizes existing transformation by-example approaches according to given criteria. The majority of these approaches are specific to exogenous transformation and based on the use of traceability.

51 By-example Exogenous Endogenous Traceability Rules approaches transformat transformation generation ion Varrò et al. [23] X X X Wimmer et al. [65] X X X Sun et al. [124] X X Dolques et al. [123] X X X Langler et al. [97] X X X Brosch et al. [96] X X Table 2 By-example Approaches As shown in the search-based section, like many other domains of software engineering, MDE is concerned with finding exact solutions to these problems, or those that fall within a specified acceptance margin. Search-based optimization techniques are well-suited for the purpose. For example, when testing model transformations, the use of deterministic techniques can be unfeasible due to the number of possibilities to explore for test case generation, in order to cover all source meta-model elements. However, the complex nature of MDE problems sometimes requires the definition of complex fitness functions [73]. Furthermore, the definition is specific to the problem to solve and necessitate expertise in both search-based and MDE fields. It is thus desirable to define a generic fitness function, evaluating a quality of a solution that can be applied to various MDE problems with low adaptation effort and expertise.

52 To tackle these challenges, our contribution combines search-based and byexample techniques. The difference with case-based reasoning approaches is that many sub-cases can be combined to derive a solution, not just the most adequate case. In addition, if a large number of combinations have to be investigated, the use of search-based techniques becomes beneficial in terms of search speed to find the best combination. In the next chapters, we detail our contribution based on this combination between by-example and search-based techniques.

53 Part 1: Exogenous Transformation by Example The first part of this thesis presents our solution for the problem of automating exogenous transformation based on the use of examples. Most of the available work on model transformation is based on the hypothesis that transformation rules exist and that the important issue is how to express them. However, in real problems, the rules may be difficult to define as is often the case when the source and/or target formalisms are not widely used or proprietary. Indeed, as for any rule-based system, defining the set of rules is not an obvious task and many difficulties accompany the results [24]. As a solution, we described MOTOE (Model Transformation as Optimization by Example), a novel approach to automate model transformation (MT) using heuristic search. MOTOE uses a set of transformation examples to derive a target model from a source model. The transformation is seen as an optimization problem where different transformation possibilities are evaluated and, for each possibility, a quality is associated depending on its conformance with the examples at hand. The search space is explored with two methods. In the first one, we use PSO (Particle Swarm Optimization) with transformation solutions generated from the examples at hand as particles. Particles progressively converge toward a good solution by exchanging and adapting individual construct transformation possibilities. In the second method, a partial run of PSO is performed to derive an initial solution. This solution is then refined using a local search with SA (Simulated Annealing). The refinement explores neighboring solutions obtained by trying individual construct transformation possibilities derived from the example base. In both methods, the quality of a solution considers the adequacy of construct transformations as well as their mutual consistency.

54 We distinguish two types of models to transform: dynamic and static. The dynamic model is used to express and model the behaviour of a problem domain or system over time, whereas the static model shows those aspects that do not change over time. UML static models are mainly expressed using a class diagram that shows a collection of classes and their interrelationships, for example generalization/specialization and association. The transformation of dynamic models is more difficult than static ones. It may be not obvious to realize, due to two main reasons [29]. First, defining transformation rules, for dynamic models, can be difficult since the source and target languages have constructs with different semantics; therefore, 1-to-1 mappings are not sufficient to express the semantic equivalence between constructs. Second, in addition to ensuring structural (static) coherence, it should guarantee behavioral coherence in terms of time constraints and weak sequencing. We evaluate our by example-approach to the two kind of models. For static models, we consider class diagram to relational schema transformation; for dynamic models, we adapt our approaach to sequence diagram to colored Petri nets transformation. We detail these two contributions in the next two chapters.

55 Chapter 3: Static Model Transformation by Example Introduction In this chapter, we describe our solution for the problem of automating static model transformation using examples. This contribution has been accepted for publication in the Journal of System and Software Modeling (SOSYM) [79]. The paper, entitled Searchbased Model Transformation by Example, is presented next. 3.2 Class Diagram to Relational Schema Transformation by Example

56 Search-based Model Transformation by Example MAROUANE KESSENTINI 1, HOUARI SAHRAOUI 1, MOUNIR BOUKADOUM 2 AND OMAR BEN OMAR 1 1 Département d'informatique et Recherche Opérationnelle, Université de Montréal CP 6128, succ Centre-Ville, Montréal QC H3C 3J7, Canada {kessentm, 2 DÉPARTEMENT D'INFORMATIQUE, UNIVERSITÉ DU QUÉBEC À MONTRÉAL CP 8888, succ Centre-ville, Montréal QC H3C 3P, Canada Abstract Model transformation (MT) has become an important concern in software engineering. In addition to its role model driven development, it is useful in many other situations such as measurement, refactoring, and test-case generation. Roughly speaking, MT aims to derive a target model from a source model by following some rules or principles. So far, the contributions in MT have mostly relied on defining languages to express transformation rules. However, the task of defining, expressing, and maintaining these rules can be difficult, especially for some formalisms. In other situations, companies have accumulated examples from past experiences. Our work starts from these observations to view the transformation problem as one to solve with fragmentary knowledge, i.e. with only examples of source-to-target model transformations. Our proposal has two main advantages: 1) for any source model, it always proposes a transformation, even when rule induction is impossible or difficult to achieve. 2) it is independent from source and target formalisms; aside from the examples, no extra information is needed. In this context, we propose an optimization-based approach that consists of finding in the examples combinations of transformation fragments that best cover the source model. To this end, we use two strategies based on two search-based algorithms: Particle Swarm Optimization (PSO) and Simulated Annealing (SA). The results of validating our approach on industrial projects show that the obtained models are accurate. Keywords example. Search-based software engineering, Automated model transformation, Transformation by 1 Introduction In the context of model driven development (MDD) [37], the creation of models and model transformations is a central task that requires a mature development environment, based on the best practices of software engineering principles. For a comprehensive approach to MDD, models and model transformations must be designed, analyzed, synthesized, tested, maintained and subjected to configuration management to ensure their quality. This makes

57 model transformation a central concern in the MDD paradigm: not used only in forward engineering, it allows concentrating the maintenance effort on models and using transformation mechanisms to generate code. As a result, many transformation languages are emerging. Practical model-to-model transformation languages are of prime importance. Despite the many approaches [33,36,20] that addressed the request for proposals of OMG QVT RFP[34,35], the MT problem has no universal solution because the majority of exisiting approaches are dependent to the source and target metamodels. A popular view attributes the situation to the difficulty of defining or expressing transformation rules, especially for proprietary or non-widely used formalisms. Indeed, most contributions in MT are concerned with defining languages to express transformation rules. Transformation rules can be implemented using: (1) general programming languages such as Java or C#; (2) graph transformation languages like AGG [19] and VIATRA [14]; (3) specific languages such as ATL [18] and QVT [35]. Sometimes, transformations are based on invariants: preconditions and post-conditions specified in languages such as OCL [43]. These approaches have been successfully applied to transformation problems where there exists knowledge about the mapping between the source and target models. Still, there exist situations where defining the set of rules is a complex task and many difficulties accompany the results [45] (incompleteness, redundancy, inconsistency, etc.). In particular, the experts may find it difficult to master both the source and target meta-models [15]. On the other hand, it is recognized that experts can more easily give transformation examples than complete and consistent transformations rules [2]. This is particularly true for industrial organizations where a memory of past transformation examples can be found, and it is the main motivation for transformation-by-examples approaches such as the one proposed in [13]. The principle of this approach is to semi-automatically derive transformation rules from an initial set of examples (interrelated source and target models), using inductive logic programming (ILP). However, it is not adaptable to new situations where no examples are available. We can alternatively view MT as an optimization problem where a (partial) target model is to be automatically derived from available examples. In this context, we recently

58 introduced an optimization-based approach to automate MT called MOTOE (model transformation as optimization by examples) [29]. MOTOE views MT as essentially a combinatorial optimization problem where the transformation of a source model is obtained by finding, for each of its constructs, a similar transformation in an example base. Due to the large number of possible combinations, a heuristic-search strategy is used to build the transformation solution as a set of individual construct transformations. Comparing to our pervious paper [29], we extend MOTOE with a more sophisticated transformation building process and use a larger scale validation with industrial data. In particular, we compare two strategies: (1) parallel exploration of different transformation possibilities (call it population-based MT) by means of a global search heuristic implemented with PSO (Particle Swarm Optimization) [22], and (2) initial transformation possibility improvements (call it adaptation-based MT) implemented with a hybrid heuristic search that combines PSO with the local search heuristic SA (Simulated Annealing) [17]. The approach we propose has the advantage over rule-based algorithms that, for any source model, it always proposes a transformation, even when rule induction is impossible or difficult to achieve. Although, it can be seen as a form of case-based reasoning (CBR) [8], it actually differs from CBR approaches in that all the existing models are used to derive a solution, not only the most similar one. Another interesting advantage is that our approach is independent from source and target formalisms; aside from the examples, no extra information is needed. In conclusion, our approach is not meant to replace rule-based approaches; instead, it applies to situations where rules are not available, difficult to define, or non-consensual. In this paper, we illustrate and evaluate our approach on the well-known case of transforming UML class diagrams (CLD) to relational schemas (RS). As will be shown in Section 4, the models obtained using our transformation approach are comparable to those derived by transformation rules. Although transformation rules exist in this case, our choice of CLD-to-RS transformation is motivated by the fact that it is well-known and reasonably complex; this allows us to focus on describing the technical aspects of out approach and comparing its results with a well-known alternative. However, our approach can also be

59 applied to more complex transformations such as sequence diagrams-to-colored Petrinets [47]. The remainder of this paper is structured as follows. Section 2 is dedicated to the MTproblem statement. In Section 3, we describe the principles of our approach. The details are discussed in Section 4; they include the adaptation of two search algorithms for the MT problem. Section 5 contains the validation results of our approach with industrial projects and a comparison between the global- and adaptation-based strategies. In Section 6, the related work in model transformation is discussed. We conclude and suggest research directions in Section 7. 2 Approach Overview This section shows how, under some circumstances, MT can be seen as an optimization problem. We also show why the size of the corresponding search space makes heuristic search necessary to explore it. Finally, we give the principles of our approach. 2.1 Problem Statement Defining transformations for domain-specific or complex languages requires proficiency in high programming languages, knowledge of the underlying metamodels, and knowledge of the semantic equivalency between the meta-models concepts [37]. Therefore, creating MT rules may become a complex task [30]. On the other hand, it is often easier for experts to show transformation examples than to express complete and consistent transformation rules [15]. This observation has led to a new research direction: model transformation by example (MTBE), where, like in [13], rules are semi-automatically derived from examples. In the absence of rules or an exhaustive set of examples that allows rule extraction, an alternative solution is to derive a partial target model from the available examples. The generation of such models consists of finding, in the examples, some model fragments that

60 best match the model to transform. To characterize the problem, we start with some definitions Definition 3.1 (Model to Transform). A model to transform, M, is a model composed of n constructs expressed in a predefined abstract syntax. Definition 3.2 (Model Construct). A construct is a source or target model element. For example, a class in a CLD. It may contain properties that describe it, e.g. its name. Complex constructs may contain sub-constructs; for example, a class could have attributes. For graph-based formalisms, constructs are typically nodes and links between nodes. For instance, classes, associations, and generalizations are model constructs in UML class diagrams. Definition 3.3 (Block). A block defines a previously performed transformation trace between a set of constructs in the source model and a set of constructs in the target model. Constructs that should be transformed together are grouped within the same block. For example, a generalization link g between two classes A and B cannot be mapped independently from the mapping of A and B. In our case, we assume that blocks are manually defined by domain experts when transforming models. Finally, blocks are not general rules since they involve concept instances (e.g., class Student) instead of just concepts (e.g., class concept). In other words, where transformation rules are expressed in terms of metamodels, blocks are expressed in terms of concrete models. Definition 3.4 (Transformation Example). A transformation example, TE, is a mapping of constructs from a source model to the corresponding target model. Formally, we view a TE as a triple <SMD, TMD, MB>, where SMD denotes the source model, TMD denotes the target model, and MB is a set of mapping blocks that relate sets of constructs in SMD to their equivalents in TMD.

61 For example, the creation of a database schema from a UML class diagram describing student records is a transformation example. The Base of examples is a set of transformation examples. Our goal is to combine and adapt transformation blocks - which are fragments coming from one or more model transformations in the base of examples - to generate a new transformed model by similarity. A fragment from an example model is considered as similar to one from the source model if it shares the same construct types with similar properties. For instance, in a class diagram, a fragment with an association pays between two classes Client and Bill is similar to a fragment from another diagram containing an association evaluates relating classes ControlExam and Module. The degree of similarity depends on the properties of the classes and associations (attributes types, cardinalities, etc.). In the absence of transformation rules, any combination of blocks is a transformation possibility. For example, when transforming a class diagram into a database schema, any class can be translated into a table, a foreign key in an existing table, two tables, or any other possible combination of target constructs. However, with transformation examples, possibilities are reduced to transformations of similar constructs in these examples. The transformation of a model M with n constructs, using a set of examples that globally define m possibilities (blocks), consists of finding the subset from the m possibilities that best transforms each of the n constructs of M. Best transforms means that each construct can be transformed by one of the selected possibilities and that construct transformations are mutually consistent. In this context, m n possible combinations have to be explored. This number can quickly become huge. For example, an average UML class diagram with 40 classes and 60 links (generalization, associations, and aggregations) defines 100 constructs ( ). At the same time, an example base with a reasonable number of examples may contain hundreds of blocks, say 300. In this case, possible combinations should be explored. If we limit the possibilities for each construct to only blocks that contain similar constructs, the number of possibilities becomes m 1 m 2 m 3 m n where each m i m represents the number of transformation possibilities for construct i. Although the number

62 of possibilities is reduced, it could still be very large for large CLDs. In the same example, assuming that each of the 100 constructs has 8 or more mapping possibilities leads to exploring at least combinations. Considering these magnitudes, exploring all the possibilities cannot be done within a reasonable time frame. This calls for alternative approaches such as heuristic search. 2.2 Approach Overview We propose an approach that uses knowledge from previously solved transformation cases (examples) so that a new MT problem is solved using a combination of the past problem solutions, and the (partial) target model is automatically derived by an optimization process that exploits the available examples. Figure 1 shows the general structure of MOTOE. The approach takes as inputs a base of examples (i.e., a set of transformation examples) and a source model to transform, and generates as output a target model. The generation process can be viewed as the selection of the subset of transformation fragments (blocks) in the example base that best matches the constructs of the source model (using a similarity function). In other words the transformation is done as an assembly of building blocks. The quality of the produced target model is measured by the conformance of the selected fragments to structural constraints, i.e., by answering the following two questions: (1) did we choose the right blocks? and (2) did they fit together? Internal coherence constraints External coherence constraints Base of examples Source model Heuristic search Target model Similarities Fig 1. MOTOE overview

63 Figure 2 illustrates the case of a source model with 6 constructs to transform represented by dots. A transformation solution consists of assigning to each construct ci a mapping block, i.e. a transformation possibility from the example base (blocks are represented by rectangles in Figure 2). A possibility is considered to be adequate if the assigned block contains a construct similar to ci (similarity evaluation is discussed in Section 3.3). Fig 2. Illustration of the proposed transformation process As many block assembly schemes are possible, the transformation is a combinatorial optimization problem. In fact, the number of possible solutions becomes very high. Thus, a deterministic search is unfeasible and a heuristic search is needed to find an acceptable solution. The dimensions of the solution space are the constructs of the source model to transform. A solution is determined by the assignment of a transformation fragment (block) to each source model construct. The search is guided by the quality of the solution according to internal coherence (inside a block), and external coherence (between blocks). To explore the solution space, we study two different search strategies in this work. The first one uses a global heuristic search by means of the PSO algorithm [22]. The second one first uses a global search to reduce the search space and find a first transformation solution; then it uses a local heuristic search, using SA algorithm [17], to refine the first solution. To illustrate our example-based transformation mechanism, consider the case of model transformation between UML class diagrams (CLD) and relational schemas (RS). Figure 3 shows a simplified metamodel of the UML class diagram, containing concepts like class, attribute, relationship between classes, etc. Figure 4 shows a partial view of the relational schema metamodel, composed of table, column, attribute, etc. The transformation

64 mechanism, based on rules, will then specify how the persistent classes, their attributes and their associations should be transformed into tables, columns and keys Fig 3. Class diagram metamodel Fig 4. Relational schema metamodel The choice of this particular example is only motivated by considerations of clarity. As MOTOE is independent from the nature of the transformation problem because it does not depend from the source and target metamodels, it is applicable to any kind of formalisms where prior examples of successful transformation are available. A transformation example of a CLD to a RS is presented in Figures 5 and 6. The CLD is the source model (a) and the RS is the target one (b). The CLD contains 12 constructs that represent 7 classes (including 2 association classes), 3 associations, and 2 generalization links. The five non-associative classes are mapped to tables with the class attributes mapped to columns of the tables. The associations between Student and Module, and between Teacher and Module, are respectively translated into tables Register and Intervene with, as columns, the attributes of the associative classes. Each of these tables also contains two foreign keys to their related tables. Association evaluate becomes a foreign key in table ControlExam. Finally, the generalization links are mapped as foreign keys in the tables corresponding to the subclasses.

65 The decisions made in this transformation example are not unique alternatives. For instance, we can find many rules (point of views) to transform a generalization link. One of them maps abstract class Person as a duplication of its attributes in the tables that correspond to classes Student and Teacher. Following Definition 3.4 of Section 2.1, SMD corresponds to the CLD, TMD represents the corresponding RS and MB is the set of mapping blocks between the two models. For example, a block describes the mapping of the association evaluate and classes Module and ControlExam in Figure 5. This block respectively assigns tables Module and ControlExam to the two classes, and foreign key IDModule to the association (Figure 5). As mentioned earlier, the transformations of the three constructs are grouped within the same block since they are interdependent. Fig 5. Example of a CLD source model

66 Fig 6. Equivalent RS target model to the CLD source model of Figure 5 To ease the manipulation of the source and target models and their transformation, the models are described using a set of predicates that correspond to the included constructs. Each construct is represented by one or more predicates. For example, Class Teacher in Figure 5 is described as follows: Class(Teacher). Attribute(Level, Teacher,_). The first predicate indicates that Teacher is a class. The second states that Level is an attribute of that class and that its value is not unique ( _ instead of unique ).

67 Fig 7. Base of transformation examples and blocks generation in source model of TE4 The mapping blocks relate the predicates in the source model to their equivalent constructs in the target model. In Figure 7, for instance, block B37 1 which contains the generalization link and the two classes Teacher and Person is described as follows: Begin b37 Class(Person) : Table(Person). Attribute(IDPerson, Person, unique) : Column(IDPerson, Person,pk). Attribute(Name, Person,_) : Column(Name, Person,_). Attribute(FirstName, Person,_) : Column(FirstName, Person,_). 1 For ease of traceability, blocks are sequentially numbered, starting from the first transformation example in the example base. For instance, the 9 blocks of example TE1 are labeled B 1 to B 9. The 13 blocks of TE2 B 10 to B 22, and so on. When a solution is produced, it is relatively easy to determine which examples contributed to it.

68 Attribute(Address, Person,_) : Column(Address, Person,_). Class(Teacher) : Table(Teacher). Attribute(Level, Teacher,_) : Column(Level, Teacher,_). Generalization(Person, Teacher) : Column(IDPerson, Teacher, pfk). End b37 Mappings are expressed with the : character. So, the mapping between predicates Attribute(IDPerson, Person, unique) and Column(IDPerson, Person, pk) indicates that the unique attribute IDPerson in class Person is transformed into the column IDPerson in table Person with the status of primary key. Similarly, the mapping between Generalization(Person, Teacher) and Column(IDPerson, Teacher, pfk) indicates that the generalization link is represented by the primary-foreign key (pfk) IDPerson in table Teacher. A model M i to transform is characterized only by its description SMD i, i.e. a set of constructs expressed by predicates. A construct can be transformed in many ways, each having a degree of relevance. This depends on three factors: (1) the adequacy of the individual construct transformations; (2) the internal consistency of the individual transformations inside the blocks; (3) the transformation (external) coherence between the related blocks, i.e., blocks sharing the same constructs. For example, consider a model to transform that has two classes, Dog and Animal, related by a generalization link g. g could become a table, many tables, a column, a foreign key, or any other possibility. A possibility is considered adequate if there exists a block in the example base that maps a generalization link. For instance, the mapping of block B37 (Figure 7(b)) is adequate because it also involves a generalization link. It is also internally consistent since it maps a similar pair of classes. Finally, it is externally coherent if Dog and Animal are only mapped to tables in the other blocks that contain them. The transformation quality of a source model is the sum of the transformation qualities of its constructs. Consequently, finding a good transformation is equivalent to finding the combination of construct transformations that maximizes the global quality. But since the

69 number of combinations may be very large because of multiple mapping possibilities, it may become difficult, if not impossible, to evaluate them exhaustively. As stated previously, heuristic search offers a good alternative in this case. The search space dimensions are the constructs and the possible coordinates in these dimensions are the block numbers. A solution then consists of choosing a block number for each construct. The exploration of the search space using heuristic algorithms is presented next. 3 Transformation using Search-Based Methods We describe in this section the adaptation of PSO and SA to automate MT. To apply them to a specific problem, one must specify the encoding of solutions, the operators that allow movement in the search space so that new solutions are obtained, and the fitness function to evaluate a solution s quality. These three elements are detailed in subsections 3.1, 3.2, and 3.3, respectively. Their use by PSO and SA to solve the MT problem is presented in subsections 3.4 and Representing Transformation Solutions One key issue when applying a search-based technique is finding a suitable mapping between the problem to solve and the techniques to use, i.e., in our case, encoding a transformation between a source and a target model. As stated in Section 2, we view the set of potential solutions as points in a n-dimensional space where each dimension corresponds to one of the n constructs of the model to transform. Each construct could be mapped according to a finite set of blocks, which means that each dimension could take set of discrete values b = {i 1 i m}, where m is the number of blocks extracted from transformation examples. For instance, the transformation of the model shown in Figure 8 will generate a 7-dimensional space that accounts for the four classes and the 3 relationships. To define a particular solution, we associate with each dimension (construct) a block number that contains a transformation possibility. Each block number defines a coordinate in the corresponding dimension, and the resulting n-tuple of block numbers then defines a

70 vector position in the n-dimensional space. For instance, the solution shown in Table 1 suggests that construct1 (class Command) be transformed according to block28, construct2 (class Bill) according to block3, etc. Thus concretely, a solution is implemented as a vector where the constructs of the model to transform are the elements and the block numbers that refer to transformation possibility from the example base are the element values. Fig 8. Example of source model (UML-class diagram) Table 1. Solution Representation Dimension Construct Block number 1 Class(Command) 28 2 Class(Bill) 3 3 Class(Article) 21 4 Class(Seller) 13 5 Aggregation 9 6 Association(payable_by) 42 7 Association(pays) 5

71 The proposed coding is valid for both heuristics. In the case of PSO, as an initial population, we create k solution vectors with a random assignment of blocks. Alternatively, SA starts from a solution vector produced by PSO. 3.2 Deriving A Transformation Solution A change operator is a modification brought to a solution in order to produce a new one. In our case, it is the modification of a transformation of the source model in order to produce a new one. This is done by changing the blocks for some constructs, which is equivalent to changing the coordinates of the solution in search space. Unlike solution encoding, change operators are implemented differently by the PSO and SA heuristics. PSO changes blocks as a result of movement in the search space driven by a velocity function; SA performs the change randomly. In the case of PSO, a translation (velocity) vector is regularly updated and added to a position vector to define new solutions (see Section 3.4, Equations 3 and 4 for details). For example, the solution sown in Table 1 may lead to the new solution shown at the bottom of Figure 9. The velocity vector V assigns a real-valued translation for each element of the position vector. After adding the two vectors, the elements of the result are each rounded to the nearest integer to represent block numbers (The allowable values are bound by 1 and the maximum number of available blocks). As shown in Figure 9, the new solution updates the block numbers of all construct. Thus, block 42 replace block 19, block 7 remains here, block 49 replaces block 51, etc. + = X V X Fig 9. Change Operator in PSO For SA, the change operator involves randomly choosing l dimensions (l < n) and replacing their assigned blocks by randomly selected ones from the example base. For

72 instance, Figure 10 shows a new solution derived from the one of Table 1. Constructs 1, 5 and 6 are selected for change. They are assigned respectively blocks 52, 24, and 11 in place of 19, 16, and 83. The other constructs keep their transformation blocks. The number of blocks to change is a parameter of the SA algorithm (three in this example). X X Fig. 10. Change Operator in SA In summary, regardless of the search heuristic, a change consists of assigning new block numbers to one or more dimensions. Said otherwise, it drives new transformation solution X i+1 drived from the previous one X i Evaluating Transformation Solutions The fitness function quantifies the quality of a transformation solution, which basically is a 1-to-1 assignment of blocks from the example base to the constructs of the source model. As discussed in Section 2, the fitness function must consider the three following aspects for a construct j to transform: Adequacy of the assigned block to the construct j (a j ). Internal coherence of the individual construct transformation (ic j ). External coherence with the other construct transformations (ec j ). In this context, we define the fitness function of a solution as the sum of qualities of the n individual construct transformations. Formally, f = n j= 1 a j ( ic + ec ) (1) j In this equation, a j represents the adequacy factor with value 1 if the associated block contains a construct containing at least one construct of the same type as the j th construct, and value 0 otherwise. This factor basically penalizes the assignment of blocks that do not contain constructs of the same type as the construct to transform (by giving them a zero value). This is a way to reduce the search space. j

73 The internal-coherence factor ic j measures the similarity, in terms of properties, between the construct to transform and the construct in the assigned block that has the same type. As shown in Section 3.1, the properties of the constructs are represented by the parameters of the predicates. Formally: ic j = number of matched parameters in the predicates of the total number of parameters in the predicates of the j th j th construct constrcut In general, a block assigned to a construct j contains more constructs than the one that is adequate with j. The external-coherence factor ec j evaluates to which extent these constructs match the constructs that are linked to j in the source model. ec j is defined as ec j = number of matched total number of constructs constructs related related to to th the j construct th j construct To illustrate the fitness calculation, consider again the example of Figure 8. The association payable_by (6 th dimension) is defined by the predicate Association (1,n,1,1,_,Command, Bill) where the first four parameters indicates the multiplicities (1..n and 1..1), the fifth the name of the associative class if it exists, and the two last the source and target classes (Command and Bill). Consider a solution s 1 that assigns block 42 to this association: Begin b42 Class(Client) : Table(Client). Attribute(NClient, Client, unique) : Column(NClient, Client, pk). Attribute(ClientName, Client, ) : Column(ClientName, Client, ). Attribute(Address, Client,_) : Column(Address, Client,_). Attribute(Tel, Client,_) : Column(Tel, Client,_). Class(Reservation) : Table(Reservation).

74 Attribute(NReservation, Reservation, unique) : Column(NReservation, Reservation, pk). Attribute(StartDate, Reservation,_) : Column(StartDate, Reservation,_). Attribute(EndDate, Reservation,_) : Column(EndDate, Reservation,_). Attribute(Region, Reservation,_) : Column(Region, Reservation,_). Association (1,n,0,n,_, Client, Reservation) : Column(N_Client, reservation, fk). End b In this case, a 6 (adequacy for the 6 th construct) is equal to 1 because block 42 contains a predicate Association that relates classes Client and Reservation. This association predicate has five parameters over seven that match the ones of pays (1,n,x,x,_, origin and destination class names). As a result, we have ic 6 =5/7=0.71. Moreover, according to block 42, to be consistent with the transformation of payable_by, classes Bill and Command have to be mapped to tables. On the other hand, s 1 also assigns blocks 28 and 3 to classes Bill (dimension 2) and Client (dimension 4), respectively. These two blocks are defined as follows. Begin b28 Class(Position) : Table(Position).. Class(Employee) : Table(Employee). Association(0,1,,n,_, Position, Employee) : Column(IDPosition, Employee, fk). End b28

75 Begin b3 Class(Manager) : Table(Manager). Class(Employee) : Table(Employee). Generalization(Employee, Manager) : Column(IDEmployee, Manager, fk). End b In both blocks, classes are transformed into tables. Since this does not conflict with block 42, we have for the two related constructs c 6 =2/2=1. The fitness function also evaluates the completeness of a transformation indirectly. A solution that does not transform a subset of constructs will be penalized. Those constructs will have null values (a j being always equal to 0). Finally, to make the values comparable across models with different numbers of constructs, a normalized version of the fitness function is used. For a particular construct, the fitness varies between 0 and 2 (ic j and ec j can be both equal to 1). Considering the n constructs, we normalized the fitness function as follows: f nor f = (2) 2 n We used this normalized fitness function for both PSO and SA. 3.4 Global Search (Particle Swarm Optimization) PSO Principles PSO is a parallel population-based computation technique [22]. It was originally inspired from the flocking behavior of birds, which emerges from very simple individual conducts. Many variations of the algorithm have been proposed over the years, but they all share a common basis. First, an initial population (named swarm) of random solutions (named particles) is created. Then, each particle flies in the n-dimensional problem space with a

76 velocity that is regularly adjusted according to the composite flying experience of the particle and some, or all, the other particles. All particles have fitness values that are evaluated by the objective function to be optimized. Every particle in the swarm is described by its position and velocity. A particle position represents a possible solution to the optimization problem, and velocity represents the search distances and directions that guide particle flying. In this paper, we use basic velocity and position update rules defined by [22]: V = W V + C var1 ( Pid X id ) + C2 var2 ( P X id + 1 id 1 gd id X = X + V id+1 id id (4) At each time (iteration), V id represents the particle velocity and X id its position in the search space. P id (also called pbest for local best solution), represents the i th particle s best previous position, and P gd (also called gbest for global best solution), represents the best position among all particles in the population. w is an inertia term; it sets a balance between the global and local exploration abilities in the swarm. Constants c 1 and c 2 represent cognitive and social weights associated to the individual and global behavior, respectively. There are also two varaibles var 1 and var 2 (normally uniform in the interval [0, 1]) that represent stochastic acceleration during the attempt to pull each particle toward the pbest and gbest positions. For a n-dimensional search space, the i th particle in the swarm is represented by a n-dimensional vector, x i =(x i1,x i2,,x id ). The velocity of the particle, pbest and gbest are also represented by n-dimensional vectors. Algorithm 1 summarizes the generic PSO procedure. ) (3)

77 PSO algorithm 1: Initial population (particles) creating (initialization) 2: while Termination criterion not met do 3: for each particle do 4: Evaluate fitness 5: Update local/global best (if necessary) 6: Update velocity and position 7: end for 8: end while 9: Return solution corresponding to the global best Algorithm 1. PSO algorithm PSO for Model transformation The PSO swarm is represented as a set of K particles, each defined by a position vector corresponding to the n constructs of the model to transform. For a particle position, the values of the vector elements are the mapping blocks selected for each construct. Our version of PSO starts by randomly generating the particle positions and velocities in the swarm. This is done by randomly affecting a block number to each of the n constructs (dimensions). Thus, the initial particle population represents K different possibilities (solutions) to transform the source model by combining blocks from the transformation examples. The fitness of each particle is measured by the fitness function defined by Equations 1 and 2. The particle with the highest fitness is memorized as the global best solution during the search process. At each iteration, the algorithm compares the fitness of each particle with those of the other particles in the population to determine the gbest position for use to update the swarm. Then, for each particle, it compares its current positions with pbest, and update the latter if an improvement is found. The new positions affect the velocity of each

78 particle according to Equation 3. The algorithm iterates until the particles converge towards a good transformation solution of the source model. In our case, we define a maximum number of iterations after which we select the gbest as the transformation solution. The algorithm stops before if all the particles converge to the same solution. The parameters in Equation 3 have an important effect on the search efficiency of the PSO algorithm. Acceleration constants c 1 and c 2 adjust the amount of tension in the system. Low values allow particles to roam far from target regions before being tugged back, while high values result in abrupt movement toward, or past, target regions [40]. Based on past research experience, we set both constants to 1. Equations 3 and 4 may lead to large absolute values for V id and X id, so that a particle may overshoot the problem space. Therefore, V id and X id should be confined to a maximum velocity V max, and a maximum position X max, such that X max = N; X id = min(max(0, X id + V id ), Xmax ) (5) V max serves as a constraint to control the global exploration ability of a particle swarm. It should take values in the interval [-m, m], m being the number of blocks in the existing transformation examples. X id represents the block number affected to a construct; it must be a positive integer. Hence, a real value for X id is rounded to the closest block number by dropping the sign and the fractional part. The inertia weight (w) is another important parameter of the PSO search. A proper value for w provides a balance between global and local exploration, and results in less iterations to find a solution on average. In practice, it is often linearly decreased through the course of the PSO, for the PSO to have more global search ability at the beginning of the run and more local search ability near the end. For the validation experience in this paper, the parameter was set as follows [40]: W W W = W max min max iter iter (6) max where W max is the initial value of weighting coefficient, W min is a minimal value of weighting coefficient, iter max is the maximum number of iterations, and iter is the current iteration.

79 Local Search (Simulated Annealing) SA Principles In the case of a quick run of PSO (only a few iterations), the best transformation solution can be improved by using another search heuristic. We propose in this work to use SA in combination with PSO. SA [17] is a search algorithm that gradually transforms a solution following the annealing principle used in metallurgy. The generic behavior of this heuristic is shown by Algorithm 2. After defining an initial solution, the algorithm iterates on the following three steps: 1 Determine a new neighboring solution, 2 Evaluate the fitness of the new solution 3 Decide on whether to accept the new solution in place of the current one based on the fitness gain/lost. SA algorithm 1: current_solution initial_solution 2: current_cost evaluate (current_solution) 3: T T initial 4: while (T > T final ) do 5: for i=1 to iterations (T) do 6: new_solution move (current_solution) 7: new_cost evaluate(new_solution) 8: cost new_cost current_cost 9: if ( cost 0 OR e - cost/<t < random() ) 10: current_solution new_solution 11: current_cost new_cost 12: end if 13: end for 14: T next_temp(t)

80 : end while Algorithm2. SA algorithm When cost < 0, the new solution has lower cost than the current solution and it is accepted. For cost > 0 the new solution has higher cost. In this case, the new solution is accepted with probability e - cost /T. The introduction of a stochastic element in the decision process avoids being trapped in a local minimum solution. Parameter T, called temperature, controls the acceptance probability of a lesser good solution. T begins with a high value, for a high probability of accepting a solution during the early iterations. Then, it decreases gradually (cooling phase) to lower the acceptation probability as we advance in the iteration sequence. For each temperature value, the three steps are repeated for a fixed number of iterations. One attractive feature of the simulated annealing algorithm is that it is problemindependent and can be applied to most combinatorial optimization problems [42, 12]. However, SA is usually slow to converge to a solution SA for Model Transformation To obtain a more robust optimization technique, it is common to combine different search strategies in an attempt to compensate the deficiencies of individual algorithms [12]. In our context, the search for a solution is done in two steps. First, a global search is quickly performed to locate the portion of search space where good solutions are likely to be found. This is performed by PSO and results in a near-optimal solution. In the second step, the obtained solution is refined by the SA algorithm. As described in Section 3.1, solutions are coded by assigning a block number to each construct to form a vector. The SA algorithm starts with an initial solution generated by a quick run of PSO. As for PSO, the fitness function presented in section 3.3 measures the quality of the solution at the end of each iteration. The generation of a neighboring solution is obtained by randomly changing a number of dimensions with new randomly selected blocks.

81 The way in which we decrement our temperature is critical to the success of the algorithm. Theory states that we should allow enough iteration at each temperature so that the system stabilises at that temperature. Unfortunately, theory also states that the number of iterations at each temperature to achieve this might be exponential to the problem size. As this is impractical we need to compromise. We can either do this by doing a large number of iterations at a few temperatures, a small number of iterations at many temperatures or a balance between the two. One way to decrement the temperature is use a geometric cooling schedule [17]. The temperature is reduced using: Ti + 1 = α T i (7) where α is a constant less than 1. Experience has shown that α should be between 0.8 and 0.99, with better results being found in the higher end of the range. Of course, the higher the value of α, the longer it will take to decrement the temperature to the stopping criterion. 4 Evaluation and comparison To evaluate the feasibility of our approach, we conducted an experiment on industrial data. We start by presenting our experimental setting. Then, we describe and discuss the obtained results. We compare in particular the results of PSO with the PSO-SA combination. Finally, we evaluate the impact of the example base size on transformation quality. 4.1 Setting We used 12 examples of class-diagrams to relational schemas transformations to build an example base EB = {<CLD i, SR i > 1 i 12}. The examples were provided by an industrial partner. As showed in Table 2, the size of the CLDs varied from 28 to 92 constructs, with an average of 58. Altogether, the 12 examples defined 257 mapping blocks. Because our industrial partner uses Rational Rose to derive relational schemas from UML class models, we did not have transformation blocks defined by experts during the transformation. For the need of the experience, we automatically extracted the transformation traces from XMI files produced by Rational Rose. Then, we manually partitioned the traces into blocks.

82 To evaluate the quality of transformations produced by MOTOE, we used a 12-fold cross validation procedure. For each fold, one class diagram CLD j is transformed by using the remaining 11 transformation examples (EB j = {<CLD i, RS i > i j}). Then the transformation result of each fold is checked for correctness. The correctness of a transformation tcld j was measured by two methods: automatic correctness (AC) and manual correctness (MC). Automatic correctness consists of comparing the derived transformation tcld j to the known RS j, construct by construct. This method has the advantage of being automatic and objective. However, since a given CLD j may have different transformation possibilities, AC could reject a valid construct transformation because it yields a different RS from the one provided. To account for those situations, we also use MC which manually evaluates tcld j, here again construct by construct. In both cases, the correctness of a transformation is the proportion of constructs that are correctly transformed. To set the parameters of PSO for the global search strategy, we started with commonly found values in the literature [6, 7] and adapted some of them to the particularities of the transformation problem. The final parameters values were set as follows: The swarm is composed of 40 particles. We found this number to provide a good balance between population diversity and the quantity of available examples. The inertia weight W is initially set to 1.3 and gradually decreased after each iteration according to Equation 6), until it reaches 0.3, C 1 and C 2 are both equal to 1 to give equal importance to local and global search. The maximum number of iterations is set to twice the size of the population, i.e. 80. This is a generally accepted heuristic [40]. Since two different executions of a search heuristic may produce different results for the same model, we decided, for each of 12 folds, to take the best result from 5 executions. As mentioned previously, the initial particle positions are randomly generated. The range of values for each particle coordinate (construct) is defined as [0, MaxBlocks] where MaxBlocks is the total number of blocks extracted from the 11 examples of the fold. In our case, MaxBlocks is 257 minus the number of blocks of the fold example. For the hybrid search strategy, the SA algorithm was applied using the following parameters: The initial temperature of the process is randomly selected in the range [0, 1]

83 The geometric cooling coefficient α is The iteration interval for temperature update is (to account for SA s slowness). The number of dimensions to change for generating a neighboring solution is set to 2. This value offers a good balance with the large number of iterations. The stopping criterion (temperature threshold) is 0.1 To quickly generate an initial solution for SA, we limited the number of particles to 10 and the number of iterations to 20 for PSO. With these parameter values, the transformation of largest diagrams took only a few seconds of run time. We also tried other parameters and obtained similar results each time. 4.2 Results and Discussion Results Tables 2 and 3 respectively show the obtained correctness for each of the 12 folds, when using global and hybrid search. Both automatic and manual correctness values were high and, as expected, manual evaluation yielded better correctness since it considered all correct transformations and not only the specific alternatives chosen by our industrial partner. We consider correctness values (74% and 94% for respectively the automatic and the manual validation) as relatively high relatively given the context of no transformation rules and the limited number of used examples. Table 2 shows the correctness of the generated transformations using the PSO heuristic. The automatic correctness measure had an average value of 73.3%, with most of the models transformed with at least 70% precision. The manual correctness measure was much greater, with an average value of 93.2%; this indicates that the proposed transformations were almost as correct as the ones given by experts. The worst model (SM9) had an acceptable MC of 87% and four models obtained an MC greater than 95%, with a value of 98,1% for SM8.

84 Number Source of Model constructs Fitness AC MC SM SM SM SM SM SM SM SM SM SM SM SM Average Table fold cross validation with PSO The hybrid search gave slightly better correctness results as shown in Table 3. Both automatic and manual correctness were slightly better on average (93.4% for AC and 94.8 for MC). With regards to MC, the quality of 8 model transformations was improved while that of 4 was slightly degraded. For instance, MC for the worst transformed model (SM9) improved from 87% to 93.1%, while that for the best transformed model (SM5) decreased from 95.2% to 93%.

85 Number Source of Model constructs Fitness AC MC SM SM SM SM SM SM SM SM SM SM SM SM Average Table fold cross validation with PSO-SA Discussion One observation to be made from the results in Tables 2 and 3 is that, with the exception of model SM7, hybrid search yielded better results than global search for the models with the highest numbers of constructs. This may be due to the fact that, when the number of dimensions is high, the search space is very large and the use of PSO leads to particle movement steps that can only approximate the location of the target solution. A more focused search consisting of global search followed by local exploration produces better results in this case. In contrast, for a smaller search space (less dimensions), area coverage by the particles is easier, and a global search appears to be more efficient to zero in on the solution.

86 Fig 11. Fitness improvement with SA after PSO initial pass To better analyze the performance of the hybrid strategy, Figure 11 shows, for all models, the average final values of the fitness function after the quick global search with PSO and their corresponding values after the refinement made by SA. As one can see, substantial fitness improvement occurred (more than 50% in many cases) in each case of the 12-fold cross validation. It appears then that the hybrid strategy brings a good compromise between correctness and execution time. Indeed, it allows improving the transformation correctness with a slight increase in the execution time. The obtained results also show that our fitness function is a good estimator of transformation correctness. An important consideration is the impact of the example base size on transformation quality. Drawn for SM7, the results of Figures 12 and 13 show that our approach also proposes transformation solutions in situations where only few examples are available. When using the global search strategy, AC seems to grow steadily and linearly with the number of examples. For the hybrid strategy, the correctness seems to follow an exponential curve; it rapidly grows to acceptable values and then slows down. Indeed, AC improved from roughly 30% to 65% as the example base size went from 1 to 4 examples. Then, it grew only by an additional 15% as the size varied from 6 to 11 examples.

87 Fig 12. Example-size variation with PSO Fig 13. Example-size variation with PSO-SA When manually analyzing the results, we noticed that some of the 12 models had constructs not present in the other models. Those constructs were generally not transformed as not adequate block could be found for them. However, some others were transformed by adapting the transformation of constructs of the same nature. This was the case, for instance, for an association with multiplicity (1..N, 1..N). Since the multiplicity elements are considered as parameters of the construct, the transformation of an association (0..N,

88 N) was applied with a penalty on the fitness function. Although these few cases of adaptation improved the global correctness scores, we did not specifically address the issue at the current stage of our research. Execution Time (seconds) Time (s) PSO SA Constructs Fig 14. Execution time Finally, since we viewed the transformation problem as a combinatorial problem addressed with heuristic search, it is important to contrast the correctness results with the execution time. We executed our algorithm on a standard desktop computer (Pentium CPU running at 2 GHz with 1GB of RAM). The execution time is shown in Figure 14. As suggested by the curve shape, there were no performance problems when transforming models up to 100 elements that corresponds to small and medium models. It should be noted, however, that more important execution times may be obtained in comparison with using rule-based tools for small-dimensional problems. In any case, our approach is meant to apply to situations where rule-based solutions are normally not readily available.

89 5 Related Work The work proposed in this paper can be related to three research areas in software engineering, of which the most relevant one is MT in the context of MDD. Some links can also be found with by-example and search-based software engineering, but our concerns are different as will be discussed below. As a result, only a comparison to alternatives in the first area is warranted. 5.1 Model Transformation Several MT approaches can be found in the literature (see, for example, the classifications given in [24, 44]). Czarnecki and Helsen [24] distinguish between two main types: modelto-model and model-to-code. They describe five categories of the former: Graphtransformation-based [25], relational [11], structure-driven [21], direct-manipulation and hybrid. They use various criteria to analyze them, like the consideration of Model Driven Architecture (MDA) as a basis for transformation, the complexity and scalability of the transformation mechanism, the use or not of annotations, the level of automation, and the used languages and implementation techniques. In general, the reported approaches are based on empirically obtained rules [2, 3] in contradistinction to block transformation in MOTOE. In rules-based approaches, the rules are defined in metamodels while our blocks relate to specific problems, with a varying structure, for different problems. In existing transformation approaches, likes graph-transformation [25, 58, 59], a transformation rule consists of two parts: a left-hand side (LHS) and a right-hand side (RHS). The LHS accesses the source model, whereas the RHS expands in the target model. By comparison, each block in MOTOE contains a transformation of source elements (LHS) to their equivalents target elements (RHS). However, in a graph-transformation approach, potentials conflicts between transformation elements are verified with pre and post condition. In our case, pre and post conditions are replaced by the fitness function that ensures transformations coherency. In rule based approaches, a rule is applied to a specific location within its source scope. Since there may be more than one match for a rule within a given source scope, we need an application strategy. The strategy could be deterministic, non-deterministic or even

90 interactive [60]. For example, a deterministic strategy could exploit some standard traversal strategy (such as depth-first) over the containment hierarchy in the source. In our work, the transformation possibilities (blocks) are randomly chosen with.no strategy for rules application (rules scheduling, etc). Transformation rules are usually designed to have a functional character: given some input in the source model, they produce a concrete result in the target model [18]. A declarative rule (i.e., one that only uses declarative logic and/or patterns) can often be applied in the inverse direction. However, since different inputs may lead to the same output, the inverse of a rule may not be a function. We have the same problem in our approach since, blocks only defined in one direction (from CLD to RS for example). To ensure a bidirectional transformation property, we need to apply our methodology to examples from the other direction. If we define cognitive complexity as the level of difficulty to design a model transformation, we believe that collecting/recording transformation examples may be less difficult than producing and maintaining consistent transformation rule sets. This is consistent with recent trend in industry where we find several tools to record transformations and automatically generate transformation traceability records [61]. The traditional approach for implementing model transformations is to specify transformation rules and automate the transformation process by using a model transformation language [23]. Most of these languages are powerful enough to implement large-scale and complex model transformation tasks. However, the transformation rules are usually defined at the metamodel level, which requires a clear and deep understanding about the abstract syntax and semantic interrelationships between the source and target models. In some cases, domain concepts may be hidden in the metamodel and difficult to unveil [2, 3] (e.g., some concepts are hidden in attributes or association ends, rather than being represented as first-class entities). These implicit concepts may make writing transformation rules difficult. To help address the previous challenges, an alternative approach called Model Transformation By Example (MTBE) was proposed in [13, 15].In it, users are asked to build a prototypical set of interrelated mappings between the source and target model

91 instances, and then the metamodel-level transformation rules will be semiautomatically generated. Because users configure the mappings at the instance level, without knowing any details about the metamodel definition or the hidden concepts, combined with the generated rules, the simplicity of specifying model transformations can be improved. Varrò and Balogh [13, 15] propose a semi-automated process for MTBE using Inductive Logic Programming (ILP). The principle of their approach is to derive transformation rules semi-automatically from an initial prototypical set of interrelated source and target models. Another similar work is that of Wimmer et al. [31] who derive ATL transformation rules from examples of business process models. Both works use semantic correspondences between models to derive rules. Their differences include the fact that [31] presents an object-based approach that finally derives ATL rules for model transformation, while [13] derives graph transformation rules. Another difference is that they respectively use abstract versus concrete syntax: Varro uses IPL when Wimmer relies on an ad hoc technique. Both approaches provide a semi-automatic generation of model transformation rules that needs further refinement by the user. Also, since both approaches are based on semantic mappings, they are more appropriate in the context of exogenous model transformations between different metamodel. Unfortunately, the generation of rules to transform attributes is not well supported in most MTBE implementations. Our model is different from both previous approaches to MTBE. We do not create transformation rules to transform a source model, directly using examples instead. As a result, our approach is independent from any source or target formalisms. Recently, a similar approach to MTBE, called Model Transformation By Demonstration (MTBD), was proposed [62]. Instead of the MTBE idea of inferring the rules from a prototypical set of mappings, users are asked to demonstrate how the model transformation should be done, through direct editing (e.g., add, delete, connect, update) of the source model, so as to simulate the transformation process. A recording and inference engine was been developed, as part of a prototype called MT-Scribe, to capture all user operations and infer a user s intention during a model transformation task. A transformation pattern is generated from the inference, specifying the preconditions of the transformation and the sequence of operations needed to realize the transformation. This pattern can be reused by

92 automatically matching the preconditions in a new model instance and replaying the necessary operations to simulate the model transformation process. However, this approach needs a large number of simulated patterns to be efficient and it requires a high level of user intervention. In fact, the user must choose the suitable transformation pattern. Finally, the authors do not show how MTBD can be useful to transform an entire source model and only provide examples of transforming model fragments. Some others metamodel matching works can be also considered as a variant of By-example approaches. Garcia-Magarino et al. [66] proposes an approach that generates transformation rules between two meta-models that satisfies some constraints introduced manually by the developer. In [65], authors propose to capture automatically some transformation patterns in order to generate some matching rules in the metamodel level. This approach is similar to MTBD but it is used in the meta-model level. The difference of this category of approaches with our proposal that we do not generates transformation rules and MOTOE do not need to specify the source and target metamodels as input. To conclude, the previous problems limit the applicability of MTBE/MTBD for some transformation problems. In such situations, MOTOE may leads to more relevant solutions. In our approach, the definition of transformation examples is based on the use of traceability information [61]. Traceability usually allows tracing artifacts within a set of chained operations, where the operations may be performed manually (e.g., crafting a software design for a set of software requirements) or with automated assistance (e.g., generating code from a set of abstract descriptions). For example, Triple Graph Grammars (TGG) [63] explicitly maintains the correspondence of two graphs by means of correspondence links. These correspondence links play the role of traceability links that map elements of one graph to elements of the other graph and vice versa. With TGG, one has to explicitly describe correspondence between the source and target models, which is difficult if the transformation is complex and the intermediate models are required during the transformation. In [52], a traceability framework for Kermeta is discussed. This framework supports the creation of traces throughout a transformation chain. Marvie describes a transformation composition framework [64] that allows manual creation of linkings (traces). However, this framework do not support the automatic generation of

93 traces. In conclusion, A large part of the work on traceability in MDE uses it for detecting model inconsistency and fault localization in transformations. In MOTOE, this is not the goal as the purpose is to use trace information as input to automate the transformation process. The traces information (model correspondence) between a source and target models define a transformation example that is decomposed in some independent blocks as explained before. Our approach is different from case-based reasoning methods where the level of granularity must be the example as a whole, i.e., a transformation example [8]; in our case, we do not select the most similar example and adapt its transformation; rather, we aggregate the best transformation possibilities coming from different examples. 5.2 By-Example Software Engineering The approach proposed in this paper is based on using past transformation examples. Various such by-example approaches have been proposed in the software engineering literature. However, the problems addressed by them differ from ours in both nature and objectives. The closest work to ours is program transformations by demonstration [1, 5], in which a user manually changes program examples while a monitoring plug-in to the development environment records the changes. Then, the recorded data are analyzed to create general transformations that can be reused in subsequent programs. However, the overall process is not automated and requires frequent interaction with the user, and the generated transformation patterns are found via a different algorithms than the one used by MOTOE. 5.3 Search-Based Software Engineering Our approach is inspired by contributions in Search-Based Software Engineering (SBSE) [26, 28]. As the name indicates, SBSE uses a search-based approach to solve optimization problems in software engineering. Once a software engineering task is framed as a search problem, by defining it in terms of solution representation, fitness function and solution change operators, there are many search algorithms that can be applied to solve that problem. To the best of our knowledge, inspired among others by the road map paper of

94 Harman [28], the idea of treating model transformation as a combinatorial optimization problem to be solved by a search-based approach was not studied before our proposal in [29]. For this reason, we can not compare our approach to existing works in SBSE because the application domain is very different. 6 Summary and Conclusion In summary, we described MOTOE, a novel approach to automate MT using heuristicbased search. MOTOE uses a set of transformation examples to derive a target model from a source model. The transformation is seen as an optimization problem where different transformation possibilities are evaluated and, for each possibility, a quality is associated depending on its conformance with the examples at hand. The search space is explored with two methods. In the first one, we use PSO with transformation solutions generated from the examples at hand as particles. Particles progressively converge toward a good solution by exchanging and adapting individual construct transformation possibilities. In the second method, a partial run of PSO is performed to derive an initial solution. This solution is then refined using a local search with SA. The refinement explores neighboring solutions obtained by trying individual construct transformation possibilities derived from the example base. In both methods, the quality of a solution considers the adequacy of construct transformations as well as their mutual consistency. We illustrated MOTOE with the transformation of UML class diagrams to relational schemas. In this context, we conducted a validation with real industrial models. The experiment results clearly indicate that the derived models are comparable to those proposed by experts (correctness of more than 90% with manual evaluation). They revealed also that some constructs were correctly transformed although no transformation examples were available for them. This was possible because the approach uses syntactic similarity between construct types to adapt their transformations. We also showed that the two methods used for the space search produced comparable results when properly applied, and that PSO alone is enough with small-to-medium models while the combination PSO-SA is more suitable when the size of the models to transform is larger. For both methods, our

95 transformation process derives a good quality transformation in an acceptable execution time. Finally, the validation study showed that the quality of MT improves with the number of examples. However, it reaches a stable score after as few as nine examples. Also, there were no performance problems when transforming models up to 100 elements that corresponds to small and medium models. Our proposed method also has limitations. First, MOTOE s performance depends on the availability of transformation examples, which could be difficult to collect. Second, the generation of blocks from the examples is done manually in our present work; we could partially automate this task using decomposition heuristics. Third, due to the nature of our solution, i.e., an optimization technique, the transformation process could be time consuming for large models. Finally, as we use heuristic algorithms, different execution for the same source models could lead to different target models. Nevertheless, we showed in our validation that solutions that have high fitness values also have good correctness. Moreover, this is close to what happens in the real world where different experts could propose different target models. From the applicability point of view, our approach can theoretically be applied to the transformation of any pair of formalisms. To practically assess this clam, we are currently experimenting with other formalisms such as sequence diagrams to Petri nets. We also plan to work on adapting our approach to other transformation problems such as code generation (model-to- code), refactoring (code-to-code), and reverse engineering (code-to-model). The refactoring problem also has the advantage of exploring endogenous transformations where source and target models conform to the same metamodel. Regarding the quality evaluation of transformations, the fitness function we used could be improved. In this work, we gave equal importance to all constructs. In the real world, some construct types may be more important than others. References 1. A. Cypher (ed.). Watch What I Do: Programming by Demonstration. The MIT Press, (1993) 2. A. Egyed, Automated abstraction of class diagrams. ACM Trans. Softw. Eng.Methodol. 11(4): (2002).

100 Chapter 4: Dynamic Model Transformation by Example Introduction After presenting the case of static models chapter3, we describe in this chapter our solution for the problem of automating dynamic model transformation using examples. We adapt our approach MOTOE to the sequence diagram to colored Petri Nets transformation. This contribution was accepted for publication in the Sixth European Conference on Modelling Foundations and Applications (ECMFA 2010) [74]. The paper, entitled Example-based Sequence Diagrams to Colored Petri Nets Transformation using Heuristic Search, is presented next. 4.2 Sequence Diagrams to Colored Petri Nets Transformation by Example

101 Example-based Sequence Diagrams to Colored Petri Nets Transformation using Heuristic Search Marouane Kessentini 1, Arbi Bouchoucha 1, Houari Sahraoui 1 and Mounir Boukadoum 2 1 DIRO, Université de Montréal, {kessentm, bouchoua, 2 DI, Université du Québec à Montréal Abstract. Dynamic UML models like sequence diagrams (SD) lack sufficient formal semantics, making it difficult to build automated tools for their analysis, simulation and validation. A common approach to circumvent the problem is to map these models to more formal representations. In this context, many works propose a rule-based approach to automatically translate SD into colored Petri nets (CPN). However, finding the rules for such SD-to-CPN transformations may be difficult, as the transformation rules are sometimes difficult to define and the produced CPN may be subject to state explosion. We propose a solution that starts from the hypothesis that examples of good transformation traces of SD-to- CPN can be useful to generate the target model. To this end, we describe an automated SD-to-CPN transformation method which finds the combination of transformation fragments that best covers the SD model, using heuristic search in a base of examples. To achieve our goal, we combine two algorithms for global and local search, namely Particle Swarm Optimization (PSO) and Simulated Annealing (SA). Our empirical results show that the new approach allows deriving the sought CPNs with at least equal performance, in terms of size and correctness, to that obtained by a transformation rule-based tool. Keywords: Model transformation, Petri nets, Sequence diagrams, Search-based software engineering 1 Introduction Model Transformation plays an important role in Model Driven Engineering (MDE) [1]. The research efforts by the MDE community have produced various languages and tools, such as ATL [2], KERMETA [3] and VIATRA [4], for automating transformations between different formalisms. One major challenge is to automate these transformations while preserving the quality of the produced models [1, 6]. Many transformation contributions target UML models [1, 6]. From a transformation perspective, UML models can be divided into two major categories: static models, such as class diagrams, and dynamic models, such as activity and state diagrams [7]. Models of the second category are generally transformed for validation and simulation purposes. This is because UML dynamic models, such as sequence diagrams (SDs) [7], lack sufficient formal semantics [8]. This limitation makes it difficult to build automated tools for the analysis, simulation, and validation of those models [9]. A widely accepted approach to circumvent the problem uses concomitant formal representations to specify the relevant behavior [11];

102 Petri Nets (PNs) [10] are well suited for the task. PNs can model, among others, the behavior of discrete and concurrent systems. Unlike SDs, PNs can derive new information about the structure and behavior of a system via analysis. They can be validated, verified, and simulated [11]. Moreover, they are suitable for visualization (graphical formalism) [11]. These reasons motivate the work to transform UML SDs to PNs. SD-to-PN transformation may be not obvious to realize, due to two main reasons [29]. First, defining transformation rules can be difficult since the source and target languages have constructs with different semantics; therefore, 1-to-1 mappings are not sufficient to express the semantic equivalence between constructs. The second problem is the risk of a state explosion [11]. Indeed, when transformation rules are available for mapping dynamic UML models to PNs, systematically applying them generally results in large PNs [11]. This could compromise the subsequent analysis tasks, which are generally limited by the number of the PNs states. Obtaining large PNs is not usually related to the size of the source models [29]. In fact, small sequence diagrams containing complex structures like references, negative traces or critical regions can produce large PNs. To address this problem, some work has been done to produce reduction rules [35]. In this paper, we explore a solution based on the hypothesis that traces of valid transformations of SD-to-PN (performed manually for instance), called transformation examples, can be used by similarity to derive a PN from a particular SD. In this context, our approach, inspired by the Model-Transformation-by-Examples (MTBE) school [12, 13, 14], helps define transformations without applying rules. Because it reuses existing valid model transformation fragments, it also limits the size of the generated models. More concretely, to automate SD-to-PN transformations, we propose to adapt, the MOTOE approach [14, 15]. MOTOE views a model transformation as an optimization problem where solutions are combinations of transformation fragments obtained from an example base. However, the application of MOTOE to the SD-to-PN transformation problem is not straightforward. MOTOE was designed for and tested with static-diagram transformations such as class-diagrams-to-relational schemas [14, 15]. The transformation of a dynamic diagram is more difficult [8] because, in addition to ensuring structural (static) coherence, it should guarantee behavioral coherence in terms of time constraints and weak sequencing.

103 For instance, the transformation of a SD message depends on the order (sequence) inside the diagram and the events within different operands (parallel merge between the behaviors of the operands, choice between possible behaviors, etc.). This paper adapts and extends MOTOE to supports SD-to-CPN transformation. The new version, dmotoe, preserves behavioral coherence. We empirically show that the new approach derives the correct models, and that the obtained CPNs have a significantly lower size than those obtained with a rule-based tool [16] taken for comparison. The remainder of this paper is structured as follows. In section 2, we provide an overview of the proposed approach for automating SD-to-CPN transformations and discuss its rationale in terms of problem complexity. Section 3 describes the transformation algorithm based on the combined PSO and SA search heuristics. An evaluation of the algorithm is explained and its results are discussed in Section 4. Section 5 is dedicated to the related work. Finally, concluding remarks and future work are provided in section 6. 2 SD-to-CPN Transformation Overview A model transformation takes a model to transform as input, the source model, and produces another model as output, the target model. In our case, the source model is a UML sequence diagram and the target model is a colored Petri net. First, we describe the principles of our approach and discuss the rationale behind given the complexity of the transformation problem. 2.1 Overview dmotoe takes a SD to transform and a set of transformation examples form an example base as inputs, and generates an equivalent CPN as output. The generation process can be viewed as selecting the subset of the transformation fragments (mapping traces) in the example base that best matches the constructs of the SD according to a similarity function. The outcome is a CPN consisting of an assembly of building blocks (formally defined below). The quality of the produced target model is measured by the level of conformance of the selected fragments to structural and temporal constraints, i.e., by answering the

104 following three questions: 1) Did we choose the right blocks? 2) Did they fit together? 3) Did we perform the assembly in the right order? As many block assembly schemes are possible, the transformation process is a combinatorial optimization problem where the dimensions of the search space are the constructs of the SD to transform. A solution is determined by the assignment of a transformation fragment (block) to each SD construct. The search is guided by the quality of the solution in terms of its internal coherence (individual construct vs. associated blocks), external coherence (between blocks) and temporal coherence (message sequence). To explore the solution space, the search is performed in two steps. First, we use a global heuristic search by means of the PSO algorithm [18] to reduce the search space size and select a first transformation solution. Then, a local heuristic search is done using the SA algorithm [19] to refine this solution. In order to provide the details of our approach, we define some terms. A construct is a source or target model element; for example, messages or objects in a SD. An element may contain properties that describe it such as its name. Complex constructs may contain sub-constructs. For example, a message could have a guard that conditions its execution. A Transformation example (TE) is a mapping of constructs from a particular SD to a CPN. Formally, we view a TE as a triple <SMD, TMD, MB> where SMD denotes the source model (SD in our case), TMD denotes the target model (optimal CPN in our case), and MB is a set of mapping blocks that relate subsets of constructs in SMD to their equivalents in TMD. The Base of examples is a set of transformation examples. The transformation examples can be collected from different experts or by automated approaches. Each TE is viewed as a set of blocks. A block defines a transformation trace between a subset of constructs in the source model and a subset of constructs in the target model. Constructs that should be transformed together are grouped into the same block. For example, a message m that is sent from an object A to an object B cannot be mapped independently from the mapping of A to B. In our examples, blocks correspond to concrete traces left by experts when transforming models. They are not general rules as they involve concept instances (e.g., a message m) instead of concepts (e.g., message concept). In other

105 words, where transformation rules are expressed in terms of meta-models, blocks are expressed in terms of concrete models Fig. 1. (a) Example of SD (source model) and (b) his equivalent CPN (target model). In a SD-to-CPN transformation, blocks correspond to transformation traces of loops (loop), alternatives (alt), concurrent interactions (par), activation boxes, and messages (see UML2.0 SD specification for more details about these constructs [7]). In the case where the constructs are imbedded, a single block is created for the higher-level construct. Blocks can be derived automatically from the transformation trace of the whole model. An example of a SD-to-CPN transformation is presented in Figure 1. For legibility reasons, we present an example containing only one complex fragment loop. In the validation section, we will use more complex SDs that involve different CPN constructs. The SD in Figure 1.a contains 10 constructs that represent 3 objects, 3 messages, 1 loop and 3

106 activation boxes. Three blocks are defined 2 : B 51 for message Arrival of a new Order and activation box Wait, B 52 for the loop with guard [Busy], message Start order treatment, and activation box Treatment in progress, and B 53 for message Send and activation box Storage. Notice that only one block is defined in B 52 as the activation box is inside the loop. In block B 51, for example, Arrival of a new Order was transformed by an expert into the transition New order and Wait into the place Wait() (Figure 1.b). To manipulate them more conveniently, the models (source or target) are described by sets of predicates, each corresponding to a construct (or a sub-construct) type. The order of writing predicates is important in the case of a dynamic model. The predicate types for SDs are: Object (ObjetName, ObjetType); Message (MessageType, Sender, Receiver, MessageName, ActivityName); Activity (ActivityName, ObjectName, Duration, MessageNumber); Loop (StartMessageName, EndMessageName, ConditionValue); Par (StartMessageName, EndMessageName, ConditionValue, ConditionType); Similarly, those of CPN are: Place (PlaceName); Transition (TransitionName); Input(TransitionName, PlaceSourceName) Output(TransitionName, PlaceDestinationName) For example, the message Arrival of a new Order in Figure 1.a can be described by Message (Synchronic,_, Order, ArrivalOfNewOrder, Wait); The predicate indicates that Arrival of a new Order is a synchronic message sent to Order (with _ meaning no sender) and connected to activation box Wait. Mapping traces are also expressed using predicate correspondences with the symbol :. In Figure 1.b, for instance, block B 51 is defined as follows: Begin B51 Message (Synchronic, _, Order, ArrivalOfNewOrder, Wait). : Transition (NewOrder, Coulor1), Input(NewOrder, _), Output(NewOrder, Wait). Activity (Wait, Order, 10, 2). : Place (Wait). 2 For traceability purpose, blocks are sequentially numbered. For instance, the 3 blocks of this example TE i are B 51 to B 53. Those of TE i+1 are B 54 to B xx, and so on and so forth. When a solution is produced, it is easy to determine which examples contributed to it.

107 End B51 In the absence of transformation rules, a construct can be transformed in many ways, each having a degree of relevance. A SD M i to transform is characterized by its description SMD i, i.e., a set of predicates. Figure 2 shows a source model with 6 constructs to transform represented by circles. A transformation solution consists of assigning to each construct a mapping block transformation possibility from the example base (blocks are represented by rectangles in Figure 2). A possibility is considered to be adequate if the block maps a similar construct. Fig. 2. Transformation solution as blocks-to-constructs mapping 2.2 Transformation Complexity Our approach is similar to case-based reasoning [21] with the difference that we do not select and adapt the whole transformation of a similar SD. Instead, we combine and adapt fragments of transformations coming from the transformations of several SDs. The transformation of a SD M i with n constructs, using a set of examples that globally define m possibilities (blocks), consists of finding the subset from the m possibilities that better transforms each of the n constructs of M i. In this context, m n possible combinations have to be explored. This value can quickly become huge. If we limit the possibilities for each construct to only blocks that contain similar constructs, the number of possibilities becomes m 1 m 2 m 3 m n where each m i m represents the number of blocks containing constructs similar to construct i. Although the number of possibilities is reduced, it could still be very large for big SDs. A sequence diagram with 50 constructs, each having 8 or more mapping possibilities, necessitates exploring at least 8 50 combinations. Considering these magnitudes, an exhaustive search cannot be used within a

108 reasonable time frame. This motivates the use of a heuristic search when a more formal approach is either not available or hard to deploy. 3 Heuristic-based Transformation We describe in this section the adaptation of two heuristics, PSO [18] and SA [19], to automate SD-to-CPN transformation. These methods each follow a generic strategy to explore the search space. When applied to a given problem, they must be specialized by defining: (1) the coding of solutions, (2) the operators that allow moving in the search space, and (3) the fitness function that measures the quality of a solution. In the remainder of this section we start by giving the principles of PSO and SA. Then, we describe the three above-mentioned heuristic components. 3.1 Principle To obtain a more robust optimization technique, it is common to combine different search strategies in an attempt to compensate for deficiencies of the individual algorithms [20]. In our context the search for a solution is done in two steps. First, a global search with PSO is quickly performed to locate the portion of the search space where good solutions are likely to be found. In the second step, the obtained solution is refined with a local search performed by SA. PSO, Particle Swarm Optimization, is a parallel population-based computation technique proposed by Kennedy and Eberhart [18]. The PSO swarm (population) is represented by a set of K particles (possible solutions to the problem). A particle i is defined by a position coordinate vector X i, in the solution space. Particles improve themselves by changing positions according to a velocity function that produces a translation vector. The improvement is assessed by a fitness function. The particle with the highest fitness is memorized as the global best solution (gbest) during the search process. Also, each particle stores its own best position (pbest) among all the positions reached during the search process. At each iteration, all particles are moved according to their velocities (Equation 1). The velocity V i of a particle i, depends on three factors: its inertia corresponding to the previous velocity, its pbest, and the gbest. Factors

109 are weighted respectively by W, C 1, and C 2. The importance of the local and global position factors varies and is set at each iteration by a random function. The weight of inertia decreases during the search process. The derivation of V i is given by Equation 2. After each iteration, the individual pbests and the gbest are updated if the new positions bring higher qualities than the ones before. V i = X i + Vi (1) X i = W V i + C 1 rand () ( pbest i X i ) + C 2 rand () ( gbest X i ) (2) The algorithm iterates until the particles converge towards a unique position that determines the solution to the problem. Simulated Annealing (SA) [19] is a local search algorithm that gradually transforms a solution following the annealing principle used in metallurgy. Starting from an initial solution, SA uses a pseudo-cooling process where a pseudo temperature is gradually decreased. For each temperature, the following three steps are repeated for a fixed number of iterations: (1) determine a new neighboring solution; (2) evaluate the fitness of the new solution; (3) decide on whether to accept the new solution in place of the current one based on the fitness function and the temperature value. Solutions are accepted if they improve quality. When the quality is degraded, they can still be accepted, but with a certain probability. The probability is high when the temperature is high and the quality degradation is low. As a consequence, quality-degrading solutions are easily accepted in the beginning of process when the temperatures are high, but with more difficulty as the temperature decreases. This mechanism prevents reaching a local optimum. 3.2 Adaptation To adapt PSO and SA to the SD-to-CPN transformation problem, we must define the following: a solution coding suitable for the transformation problem, a neighborhood function to derive new solutions, and a fitness function to evaluate these solutions. As stated in Section 2, we model the search space as an n-dimensional space where each dimension corresponds to one of the n constructs of the SD to transform. A solution is then a point in that space, defined by a coordinate vector whose elements are blocks numbers from the example base assigned to the n constructs. For instance, the transformation of the

110 SD model shown in Figure 3 will generate a 7-dimensional space that accounts for the two objects, three messages and two activities. One solution is this space, shown in Table 1, suggests that message CheckDriver should be transformed according to block B 19, activity Positioning, according to block B 7, etc. Thus concretely, a solution is implemented as a vector where constructs are the dimensions (the elements) and block numbers are the element values. The association between a construct and a block does not necessarily mean that a transformation is possible, i.e., the block perfectly matches the contest of the construct. This is determined by the fitness function described in subsection The proposed coding is valid for both heuristics. In the case of PSO, as an initial population, we create k solution vectors with a random assignment of blocks. Alternatively, SA starts from the solution vector produced by PSO. Fig.3. Example of source model

111 Table 3. Solution representation Dimensions Constructs Block numbers 1 Message(CheckDriver) B19 2 Activity(Positioning) B7 3 Message(GetStarted) B51 4 Activity(Treatment) B105 5 Message(Confirmation) B16 6 Object(Driver) B83 7 Object(Car) B33 Change Operators. Modifying solutions to produce new ones is the second important aspect of heuristic search. Unlike coding, change is implemented differently by the PSO and SA heuristics. While PSO sees change as movement in the search space driven by a velocity function, SA sees it as random coordinate modifications. In the case of PSO, a translation (velocity) vector is derived according to equation 2 and added to the position vector. For example, the solution of Table 1 may produce the new solution shown in Figure 4. The velocity vector V has a translation value for each element (real values). When summed with the block numbers, the results are rounded to integers. They are also bounded by 1 and the maximum number of available blocks. + X V = X Fig. 4. Change Operator in PSO

112 X X Fig. 5. Change Operator in SA For SA, the change operator randomly chooses l dimensions (l < n) and replaces their assigned blocks by randomly selected ones from the example base. For instance, Figure 5 shows a new solution derived from the one of Table 1. Constructs 1, 5 and 6 are selected to be changed. They are assigned respectively blocks 52, 24, and 11 instead of 19, 16, and 83. The other constructs remain unchanged. The number of blocks to change is a parameter of SA (three in this example). In our validation, we set it to 4 considering that the average number of constructs per SD is 36. Fitness Function. The fitness function allows quantifying the quality of a transformation solution. As explained in the previous paragraph, solutions are derived by random assignment of new blocks to some constructs. The quality of a transformation solution is then the sum of the individual transformation qualities of the n constructs of the SD. To evaluate if assigned block B i is a good transformation possibility for construct C j, the fitness function first evaluates the adequacy, i.e., does B i contains a construct C k from the same type as C j? if the answer is no, the assigned block is unsuitable. Otherwise, the fitness function checks the three following coherence aspects: (1) internal coherence (what is the degree of similarity between C j and C k in terms of properties?), (2) external coherence (to what extent the transformation proposed by B i contradicts the transformations of constructs related to C j?), and (3) temporal coherence (to what extent the transformation proposed by B i preserves the temporal constraints of message sequences in SD?). The fitness function is formally defined as follows:

113 f = n j= 1 a j ( ic j + ec j + tc j ) (3) where a j is the adequacy of assigning B i to C j (1 if B i is adequate, 0 otherwise), and ic j, ec j, and tc j are respectively the internal, external, and temporal coherences of the assignment. ic j is defined as the ratio between the number of parameters of the predicate P j representing C j that match the parameters of the associated construct in block B i and the total parameters of P j. Consider the SD example shown in Figure 3. Message GetStarted is defined by predicate Message(Synchronic, Driver, Car, GetStrated, Positioning). This predicate indicates that the message GetStarted, which is synchronic, is sent by object Driver to Car from the activity Positioning. The solution in Table 1 assigns the block B 51 to this message. Block B 51 is described in section 2.1 as follows: Begin B51 Message (Synchronic, _, Order, ArrivalOfNewOrder, Wait). : Transition (NewOrder, Coulor1), Input(NewOrder, _), Output(NewOrder, Wait). Activity (Wait, Order, 10, 2). : Place (Wait). End B51 The adequacy a 3 of the transformation of GetStarted (3 rd construct) is equal to 1 because block B 51 also contains predicate message (ArrivalOfNewOrder). The parameters of the two messages are similar except for the sender which is not an object in the case of ArrivalOfNewOrder. As a result, internal coherence ic 3 =4/5=0.8 (four parameters that match over 5). For external coherence ec j, let RCons j be the set of constructs related to C j and RConsM ij, the subset of constructs in RCons j whose transformations are consistent with the one of C j, i.e., we compares the transformation proposed by the block assigned to C j with the ones suggested by the blocks assigned to the related constructs. ec j is calculated as the ratio between RConsM ij and RCons j. In our example, GetStarted involves three constructs (sender, receiver, and activity). According to B 51, only Positioning activity is related (has a predicate) and should be

114 transformed into a place similarly to Wait activity. In the solution of, the construct Positioning is assigned the block B 7 (dimension 2 of the solution vector). This block is defined as follows: Begin B7 Message (Asynchronic, User, Printer, NewPrint, Progress). : Transition (NewPrint, Coulor7), Input(NewPrint, _), Output(NewPrint, Progress). Activities (Progress, Printer, 8, 1). : Place (Progress). End B7 According to B 7, Positioning should also be mapped to a place. Thus there is no conflict between B 51 and B 7, and ec 3 =1 (1/1). tc j represents the temporal coherence. It reflects the time constraint specific to dynamic models. To preserve the temporal coherence, we ensure that the transformation of elements that are contiguous to C j preserve the temporal semantics. To this end, we first consider the block B inc that includes C j and the blocks B pre and B fol that respectively precedes and follows B inc. Although the model to transform is not in the example base, we identify blocks with only the source part according to the rules given in Section 2.1. Then we consider the block B i, assigned to C j by the evaluated solution, and the two blocks B pre_i and B fol_i preceding and following B i. tc j is obtained by comparing B pre to B pre_i, B inc to B i, and B fol to B fol_i. For example, let P pre (k) be the predicate having the k th position in B pre and P pre_i (k) be the predicate having the k th position in B pre_i, the number of pairs of predicates PMatch(B pre, B pre_i ) that match in the two blocks is defined as { P ( k), P ( k)) P ( k) P ( k) } ( pre pre _ i pre = pre _ i (4) tc i is then defined as follows: tc j = PMatch( B pre max( B, B pre pre_ i, B ) + pre_ i PMatch( B ) + max( B inc inc, B ) +, B i i PMatch( B ) + max( B fol, B fol, B fol _ i fol _ i ) ) (5)

115 Figure 6 shows an example of the calculation of tc j. Going back to the example of message GetStarted, to derive the tc 3, we identify in the SD to transform two blocks: B s which contains GetStarted and B s-1 which precedes B s. Consequently, block B 51 will be compared to B s. B s contains a message followed by an activity and another message. B 51 contains a message followed by an activity. Then, two pairs of predicates match and the max size between the two blocks is 3. As B 51 has no preceding block, we consider that no match exists with B s-1, and the corresponding max size is that of B s-1, i.e., 2 for the message and the activity. Finally, as B s has no following block, no match exists with B 52, which follows B 51. We take then as max size, the size of B 52 (3 corresponding to the loop, the message, and the activity). According to equation 5, tc 3 =(0+2+0)/(2+3+3)=0.25. This temporal coherence factor is standard and works with any combined fragments of SDs. Fig. 6. Temporal coherence The fitness function does not need a considerable effort to be adapted for other transformations (e.g. state machine to PNs). However, the block definition must be adapted to the semantics of the new transformation.

116 4 Validation To evaluate the feasibility of our approach, we conducted an experiment on the transformation of 10 UML sequence diagrams 3. We collected the transformations of these 10 sequence diagrams from the Internet and textbooks and used them to build an example base EB = {<SD i, CPN i > i=1,2,...,10}. We ensured by manual inspection that all the transformations are valid. The size of the SDs varied from 16 to 57 constructs, with an average of 36. Altogether, the 10 examples defined 224 mapping blocks. The 10 sequence diagrams contained many complex fragments: loop, alt, opt, par, region, neg and ref. To evaluate the correctness of our transformation method, we used a 10-fold cross validation procedure. For each fold, one sequence diagram SD j is transformed using the remaining 9 transformation examples. Then, the transformation result for each fold is checked for correctness using two methods: automatic correctness (AC) and manual correctness (MC). Automatic correctness consists of comparing the derived CPN to the known CPN, construct by construct. This measure has the advantage of being automatic and objective. However, since a given SD j may have different transformation possibilities, AC could reject a valid construct transformation because it yields a different CPN from the one provided. To prevent this situation, we also perform manual evaluation of the obtained CPN. In both cases, the correctness is the proportion of constructs that are correctly transformed. In addition to correctness, we compare the size of the obtained CPNs with the ones obtained by using the rule-based tool WebSPN for mapping UML diagrams to CPN [16]. The size of a CPN is defined by the number of constructs. Figure 7 shows the correctness for the 10 folds. Both automatic and manual correctness had values greater than 90% in average (92.5% for AC and 95.8% for MC). Although few examples were used (9 for each transformation), all the SDs had a transformation correctness greater than 90%, with 3 of them perfectly transformed. 3 The reader can find in this link all the materials used in our experiments

117 Figure 7 also shows that, in general, the best transformations are obtained with smaller SDs. After 36 constructs, the quality degrades slightly but steadily. This may indicate that the transformation correctness of complex SDs necessitates more examples in general. However, the largest and most complex SD (57 constructs and 19 complex fragments) has a MC value of 96%. Corectness vs Diagram Size and Complexity Correctness Diagrame Size AC-dMOTOE MC-dMOTOE MC-WebSPN Number of complex fragments Number of Complex Fragments Fig. 7. Correctness of the transformations In addition, our results show that the correctness of our transformations is equivalent that of WebSPN. Another interesting finding during the evaluation is that, in some cases, a higher fitness value does not necessarily imply higher transformation correctness. This was the case for the transformations of SD 3 (fitness of 82% and MC = 98%) and SD 5 (fitness of 92% and MC = 93%). This is probably due to the fact that we assign the same weight to simple constructs such as messages and complex constructs such as loops in the fitness function. Indeed, temporal coherence is more difficult to assess for complex constructs. Manual inspection of the generated CPNs showed that the different transformation errors are easy to fix. They do not require considerable effort and the majority of them is related to the composition of complex fragments. For example, as we did not have an example that

118 mapped two alts situated in a loop, the optimization technique used one that contained only one alt in a loop. Almost the same errors were made by WebSPN, including the case of two alts in a loop. When developing our approach, we conjectured that the example-based transformation produce CPNs smaller than the one obtained by systematic rule application. Table 2 compares the obtained CPN sizes by using dmotoe and WebSPN for the 10 transformations. In all cases, a reduction in size occurs when using dmotoe, with an average reduction of 28.3% in comparison to WebSPN. Although the highest reduction corresponded to the smallest SD, the reductions for larger diagrams were important as well (e.g., 39% for 36 constructs, 38% for 39 constructs, and 29% for 76 constructs). These reductions should be viewed in the context of the correctness equivalence between our approach and WebSPN. Table 2. CPN size comparison Size(WebSPN) Size(dMOTOE) Variation % % % % % % % % % % Average Reduction : 28.3%

119 The obtained results confirm our assumption that systematic application of rules results in CPNs larger than needed and that reusing valid transformed examples attenuates the state explosion problem. As for execution time, we ran our algorithm on a standard desktop computer (Pentium CPU running at 2 GHz with 2 GB of RAM). The execution time was of the order of a few seconds and increased linearly with the number of constructs, indicating that our approach is scalable from the performance standpoint. 5 Related Work The work proposed in this paper crosscuts two main research areas: model transformation and traceability in the context of MDD. In [5], five categories of transformation approaches were identified: graph-based [22], relational [23], structure-driven [24], direct-manipulation, and hybrid. One conclusion to be drawn from studying the existing MT approaches is that they are often based on empirically obtained rules [25]. Recently, traceability gained popularity in model transformation [26]. Usually, trace information are generated manually and stored as models [27]. For example, Marvie et al. [28] propose a transformation composition framework that allows the manual creation of linkings (traces). In the studied approaches and frameworks based on traceability, trace information is used in general for detecting model inconsistency and fault localization in transformations. On the other hand, dmoto uses traces to automate the transformation process. More specifically, in the case of SD-to-PN, several approaches were proposed in addition to WebSPN. In [29], the authors describe a meta-model for the SD-to-PN mapping. It defines rules involving concepts of the meta-models representing respectively sequence diagrams and Petri nets. One of the limitations of this approach is that temporal coherence is not addressed explicitly. Additionally, the meta-model representing the rules tends to generate large PNs, as noticed by the authors. In [11], a set of rules to transform UML 2.0 SDs into PNs is proposed. The goal is to animate SDs using the operational semantics of PNs. In our case, we can generate the structure of the targeted CPN in an XMI file that can

120 be used as input for some simulation tools like CPN tools [38]. Other UML dynamic diagrams are also considered for the transformation to PNs. For example, use case constructs are mapped to PN using a multi-layer technique [8]. There are other research contributions that concentrate on supporting validation and analysis of UML statecharts by mapping them to Petri nets of various types [36, 37]. Unlike our approach, this work uses information extracted from different UML diagrams to produce the Petri nets. A general conclusion on the transformation of dynamic models to PNs is that, in addition to the fact that no consensual transformation rules are used, a second step is usually required to reduce the size of the obtained PNs. dmotoe uses the by example principle to transform models, but what we propose is completely different from other contributions to model transformation by example (MTBE). Varro and Balogh [12, 13] propose a semi-automated process for MTBE using Inductive Logic Programming (ILP). The principle of their approach is to derive transformation rules semi-automatically from an initial prototypical set of interrelated source and target models. Wimmer et al. [30] derive ATL transformation rules from examples of business process models. Both works use semantic correspondences between models to derive rules, and only static models are considered. Moreover, in practice, a large number of transformation learning-examples may be required to ensure that the generated rules are complete and correct. Both approaches provide a semi-automatic generation of model transformation rules that needs further refinement by the user. Also, since both approaches are based on semantic mappings, they are more appropriate in the context of exogenous model transformations between different metamodel. Unfortunately, the generation of rules to transform attributes is not well supported in most MTBE implementations. Our model is different from both previous approaches to MTBE. We do not create transformation rules to transform a source model, directly using examples instead. As a result, our approach is independent from any source or target metamodels.recently, a similar approach to MTBE, called Model Transformation By Demonstration (MTBD), is proposed [34]. Instead of the MTBE idea of inferring the rules from a prototypical set of mappings, users are asked to demonstrate how the model transformation should be done by directly editing (e.g., add, delete, connect, update) the

121 model instance to simulate the model transformation process step by step. This approach needs a large number of simulated patterns to give good results and, for instance, MTBD cannot be useful to transform an entire source model. 6 Conclusion In this paper, we propose the approach dmotoe, to automate SD-to-CPN transformation using heuristic search. dmotoe uses a set of existing transformation examples to derive a colored Petri net from a sequence diagram. The transformation is seen as an optimization problem where different transformation possibilities are evaluated and, for each possibility, a quality is associated depending on its conformance with the examples at hand. The approach we propose has the advantage that, for any source model, it can be used when rules generation is difficult. Another interesting advantage is that our approach is independent from source and target formalisms; aside from the examples, no extra information is needed. Moreover, as we reuse existing transformations, the obtained CPN are smaller than those obtained by transformation rules. We have evaluated our approach on ten sequence diagrams. The experimental results indicate that the derived CPNs are comparable to those defined by experts in terms of correctness (average value of 96%). Our results also reveal that the generated CPNs are smaller than the ones generated by the tool WebSPN [16]. Although, the obtained results are very encouraging, many aspects of our approach could be improved. Our approach currently suffers from the following limitations: 1) in the case of SD-to-PNs transformation, it provides less clean semantics than a rules-based approach; 2) coverage of complex fragments examples is needed for completeness and to ensure consistently good results; 3) the base of examples is difficult to collect especially for complex and not widely used formalisms; 4) the fitness function could weight complex constructs more heavily when evaluating a solution. In addition, a validation on a larger example base is in project to better assess the adaptation capability of the approach, and we can compare the sizes of the reachability graph of the produced CPNs by dmotoe and WebSPN in order to treat the richer behaviors (in fact, a bigger net is not necessarily worse in some cases). In a broader perspective, we plan to experiment and extend dmotoe to

124 Part 2: Endogenous Transformation by Example In part 1 of this thesis, we described our contributions to exogenous transformation. In this part, we focus on endogenous transformations. In an endogenous transformation, the source and target meta-models are the same. Furthermore, they require two steps: 1) identify the elements to transform in the source model; 2) transform the identified elements. Endogenous transformations are principally related to the following maintenance activities: 1) Optimization (a transformation aimed to improve certain operational qualities (e.g., performance), while preserving the semantics of the software); 2) Refactoring (a change to the internal structure of software to improve certain software quality characteristics such as understandability, modifiability, reusability, modularity, adaptability) without changing its observable behaviour. In this thesis, we focus on code transformation in order to improve quality. We distinguish two steps for this task: 1) detecting refactoring opportunities that correspond to design defects; 2) applying some refactoring methods (move method, add super class, etc) to modify the defected classes. The second step is out of the scope of this work and we will only address the first one. The first step related to detecting design defects is important. In fact, detecting and fixing defects is a difficult, time-consuming, and to some extent, manual process. The number of outstanding software defects typically exceeds the resources available to address them. In many cases, mature software projects are forced to ship with both known and unknown defects for lack of the development resources to deal with everyone. For example, one Mozilla developer claimed that everyday, almost 300 bugs and defects appear... far too much for only the Mozilla programmers to handle [53]. To help cope with this magnitude of activity, several automated detection techniques have been proposed [27].

125 Although there is a consensus that it is necessary to detect design anomalies, our experience with industrial partners has shown that there are many open issues that need to be addressed when developing a detection tool. Design anomalies have definitions at different levels of abstraction. Some of them are defined in terms of code structure, others in terms of developer/designer intentions, or in terms of code evolution. These definitions are in many cases ambiguous and incomplete. However, they have to be mapped into rigorous and deterministic rules to make the detection effective. In the following, we discuss some of the open issues related to the detection. How to decide if a defect candidate is an actual defect? Unlike software bugs, there is no general consensus on how to decide if a particular design violates a quality heuristic. There is a difference between detecting symptoms and asserting that the detected situation is an actual defect. Are long lists of defect candidates really useful? Detecting dozens of defect occurrences in a system is not always helpful. In addition to the presence of false positives that may create a rejection reaction from development teams, the process of using the detected lists, understanding the defect candidates, selecting the true positives, and correcting them is long, expensive, and not always profitable. What are the boundaries? There is a general agreement on extreme manifestations of design defects. For example, consider an OO program with a hundred classes from which one implements all the behavior and all the others are only classes with attributes and accessors. There is no doubt that we are in presence of a Blob. Unfortunately, in real industrial systems, we can find many large classes, each one using some data classes and some regular classes. Deciding which ones are Blob candidates depends heavily on the interpretation of each analyst.

126 How to define thresholds when dealing with quantitative information? For example, Blob detection involves information such as class size. Although, we can measure the size of a class, an appropriate threshold value is not trivial to define. A class considered large in a given program/community of users could be considered average in another. How to deal with the context? In some contexts, an apparent violation of a design principle is considered as a consensual practice. For example, a class Log responsible for maintaining a log of events in a program, used by a large number of classes, is a common and acceptable practice. However, from a strict defect definition, it can be considered as a class with abnormally large coupling. In addition to these issues, the process of defining rules manually is complex, timeconsuming and error-prone. Indeed, the list of all possible defect types can be very large and each type requires specific rules. To address or circumvent the above mentioned issues, we propose two different automated detection approaches that are completely different from the state of art. For the first one, instead of characterizing each symptom of each possible defect type, we apply the principle of negative selection, the process used by biological immune systems to identify antigens. An immune system does not try to detect specific bacteria and viruses. Rather, it starts by detecting what is abnormal, i.e., what is different from the healthy cells of the body. The more something is different, the more it is considered risky. For the second approach, we propose a solution that uses knowledge from previously manually inspected projects, called defects examples, in order to detect design defects that will serve to generate new detection rules based on combinations of software quality metrics. In short, the detection rules are automatically derived by an optimization process, based on the Harmony search algorithm [56] that exploits the available examples. In the next chapter, we provide the details of our proposal for automating design defects detection based on the immune system metaphor.

127 Chapter 5: An Immune-Inspired Approach for Design Defects Detection Introduction We describe our solution to the problem of design defects detection. This problem is considered as an endogenous transformation problem where, as mentioned earlier, two steps are needed. We focus on the first step that consists of finding the elements to transform. Our solution is based on the use of well-designed code examples and on considering each deviation from these examples as risky. This mechanism corresponds to the immune system process where foreign substances are detected via deviations from normal cell behaviour. This contribution was accepted for publication in the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE 2010) [82]. The paper, entitled Deviance from Perfection is a Better Criterion than Closeness to Evil when Identifying Risky Code, is presented in the next section. 5.2 Design Defects Detection by Example: An Immune System Metaphor

The Role of Programming in Informatics Curricula A. J. Cowling Department of Computer Science University of Sheffield Structure of Presentation Introduction The problem, and the key concepts. Dimensions

An Automatic Reversible Transformation from Composite to Visitor in Java Akram To cite this version: Akram. An Automatic Reversible Transformation from Composite to Visitor in Java. CIEL 2012, P. Collet,

BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT March 2013 EXAMINERS REPORT Software Engineering 2 General Comments The pass rate this year was significantly better than

Introducing Formal Methods Formal Methods for Software Specification and Analysis: An Overview 1 Software Engineering and Formal Methods Every Software engineering methodology is based on a recommended

Günter Böckle Klaus Pohl Frank van der Linden 2 A Framework for Software Product Line Engineering In this chapter you will learn: o The principles of software product line subsumed by our software product

1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

The University of Jordan King Abdullah II School for Information Technology Department of Information Systems Master s Program in Information Systems 2006/2007 Study Plan Master Degree in Information Systems

AP FRENCH LANGUAGE 2008 SCORING GUIDELINES Part A (Essay): Question 31 9 Demonstrates STRONG CONTROL Excellence Ease of expression marked by a good sense of idiomatic French. Clarity of organization. Accuracy

Office of the Auditor General / Bureau du vérificateur général FOLLOW-UP TO THE 2007 AUDIT OF THE DISPOSAL OF PAVEMENT LINE MARKER EQUIPMENT 2009 SUIVI DE LA VÉRIFICATION DE L ALIÉNATION D UNE TRACEUSE

IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 2007 35 Implementation of hybrid software architecture for Artificial Intelligence System B.Vinayagasundaram and

Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More

OCEB White Paper on Business Rules, Decisions, and PRR Version 1.1, December 2008 Paul Vincent, co-chair OMG PRR FTF TIBCO Software Abstract The Object Management Group s work on standards for business

Purpose The purpose of this document is to provide guidance on the practice of Modeling and to describe the practice overview, requirements, best practices, activities, and key terms related to these requirements.

1 CHAPTER 1 INTRODUCTION 1.1 Overview Software testing is a verification process in which an application of the software or the program meets the business requirements and technology that have dominated

Software Processes Objectives To introduce software process models To describe three generic process models and when they may be used To describe outline process models for requirements engineering, software

What is a life cycle model? Framework under which a software product is going to be developed. Defines the phases that the product under development will go through. Identifies activities involved in each

FOR TEACHERS ONLY The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION F COMPREHENSIVE EXAMINATION IN FRENCH Friday, June 16, 2006 1:15 to 4:15 p.m., only SCORING KEY Updated information