aside from getting used to the new apis, I've faced new challenges on the 2.X stream.
My main concern is current matching algorithm's scalability. I got surprised to see how test-cases that used to take 15 seconds, now need 20 minutes to finalize.

I've debugged the code and compared it to 1.X, and found some differences that look to me as possible responsibles to this poor performance. Aparently, the scope of comparison has heavily increased on the new match engine. Inspecting ProximityIndex.findTheClosest(Comparison, EObject, Side, Side, boolean) shows how now the engine tries to match each element with every other possible element on the model with the same type (ByTypeIndex takes care of that classifying by type). You can check how the "storageToSearchFor" list contains all elements with the same type in the model as the object to match.

Oppossed to that, 1.X used content matching strategies to narrow the scope of the comparison. This means: we will only compare elements that are around the same point of the models structure, by doing a structural/content match. There was even a search window parameter to narrow/broad even more the match scope in case necessary. Although this could lead to some false positives, it was way more efficient.

For big models, the 2.x behaviour end up in almost 2 * n * n comparisons in the worst case, where n is the number of elements, and a potential 2 due to double check mechanism. Obiously, the new approach is safer in the sense that nothing remains out of the match calculations, but this appears not to scale on big models.

I got surprised to realise this, when scalability has always been a concern to the project.

So, the questions here are:
- Is it true what I've found or I'm missing something?
- Why is the model flattened and compared, and why was the content match of 1.x discarded?
- Any recommended way to avoid this exponential number of comparisons?
- Are there any possible optimizations to know?

Yes it's pretty much true except the current implementation try to avoid "comparing too much" by looking for perfect identical objects with a cheap comparison before going further. As you said this approach allow us to have "better" results but can ends up being more "explosive" from a computational point of view. So far our efforts have been focused on the merge consistency and better scalability through scope reduction, content match got rewritten by reusing the Diff logic itself and assume most of the model elements are not changing.

Longer term, we intend to have a proper distance from this and then have the ability to sort the storageToSearchFor, that would lead to having the best of both worlds and quickly find the closest element.
This approach had not be implemented yet because of lack of time, but I would gladly review the work of anybody wanting to speed things up.

"Why is the model flattened and compared, and why was the content match of 1.x discarded?"

One of the key requirement of 2.x is the ability to compute differences operating on arbitrary scopes and not rely on the containment hiearchy. Furthermore we need to be sure to find the "best match" whenever thats possible.

- Any recommended way to avoid this exponential number of comparisons?

The figure you gives makes me think our "assumption" are not valid at all in your case and you ends up in the worst case..
Is your model "completely different" ? Are you using Featuremaps ?
ProximityIndex has a ProximityMatchStats object keeping track of the computation made during the process, what is it giving to you ?

- Are there any possible optimizations to know?

Many are coming to my mind, we could keep the objects in their insertion order and introduce a SEARCH_WINDOW again, reducing the "explosion", we can make sure the match process is not comparing references of the EObject avoiding the LCS computation during the match, re-implement a match engine mimicking the 1.x behavior... In any case it would be interesting for us to understand what is going on exactly with your models leading to this.

Yes it's pretty much true except the current implementation try to avoid "comparing too much" by looking for perfect identical objects with a cheap comparison before going further. As you said this approach allow us to have "better" results but can ends up being more "explosive" from a computational point of view. So far our efforts have been focused on the merge consistency and better scalability through scope reduction, content match got rewritten by reusing the Diff logic itself and assume most of the model elements are not changing.

Longer term, we intend to have a proper distance from this and then have the ability to sort the storageToSearchFor, that would lead to having the best of both worlds and quickly find the closest element.
This approach had not be implemented yet because of lack of time, but I would gladly review the work of anybody wanting to speed things up.

If I find something useful to share with you guys, I'll gladly do it. So far, the only thing I could come with is some sort of heuristic to reduce the size of storageToSearchFor by taking only elements with the same parent as the object subject to be matched. Not working perfectly anyway, but reduced the computation time.

Quote:

- Any recommended way to avoid this exponential number of comparisons?
The figure you gives makes me think our "assumption" are not valid at all in your case and you ends up in the worst case..
Is your model "completely different" ? Are you using Featuremaps ?
ProximityIndex has a ProximityMatchStats object keeping track of the computation made during the process, what is it giving to you ?

We are using the OMG's KDM metamodel. The test-case model I'm using is not that big (17mb XMI file), and there are not many differences. Neiher we are using FeatureMaps.
The objects are stored in a CDO Server instance before comparing. I used to have problems with CDO in the 1.X stream (regarding both scalability and reliability), but in 2.X both XMI and CDO take the same amount of time to compute the differences. However, I'm yet to determine whether the fact is stored in CDO may introduce false-positives (which happened quite a lot in the 1.X stream).

I took a look at the stats variable. As you know, its segmented by type (due to ByTypeIndex), but these are some of the most "conflictive" elements:

I've managed to cut down the comparison time by ignoring some references on the model. The thing is, I already had a custom FeatureFilter installed in my own DiffEngine,
but apparently didn't come into play because org.eclipse.emf.compare.match.eobject.EditionDistance uses EditionDistance.CountingDiffEngine instead of my extended DiffEngine.
It looks like EditionDistance (used for the identic match) is ignoring any customizations introduced at DiffEngine level. Is this expected?

Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.

Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher

It cuts my comparison times on decent to big models, could you have a try and tell me how it goes for you ?

I just checked out git HEAD, and the new org.eclipse.emf.compare.match.eobject.internal.CachingDistance seems to be missing in the repository :S I made sure the commit in synchronized with is the same you shared (which happens to be HEAD right now). Maybe you forgot to commit it too?

Quote:

Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.

Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher

It cuts my comparison times on decent to big models, could you have a try and tell me how it goes for you ?

I just checked out git HEAD, and the new org.eclipse.emf.compare.match.eobject.internal.CachingDistance seems to be missing in the repository :S I made sure the commit in synchronized with is the same you shared (which happens to be HEAD right now). Maybe you forgot to commit it too?

Quote:

Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.

Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher

Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.

Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher

I was going back to our initial discussion : "[..] heuristic to reduce the size of storageToSearchFor by taking only elements with the same parent as the object subject to be matched. Not working perfectly anyway, but reduced the computation time."

by the way, I'm gonna have to bring back chocolate at the office because of this "last minute commit before leaving".

Good one! Is that strategy under EPL? I'd like to implement here at the office too

Quote:

I was going back to our initial discussion : "[..] heuristic to reduce the size of storageToSearchFor by taking only elements with the same parent as the object subject to be matched. Not working perfectly anyway, but reduced the computation time."

Would you mind sharing it with us through a patch or something ?

I lost everything once I checked out from git repository Anyway, the idea behind that naive change I made is some sort of Search Window strategy.

Anyway, given latest results with our big KDM models, and given our time constraints, we will keep using our previous code with EMF Compare 1.3, and the newer code will use 2.X stream, which compares over small models (and seems to work flawlessly, no false positives so far). I'll try to spend sometime in figuring out whats wrong with our KDM models to make 2.X not to scale.

I assume you guys want to keep 1.5 compatibility. The bundles execution environment declares J2SE15... so I assume that constructor shouldn't be used.

What version are you using? I couldn't find any reference to that string constructor in our code base, and there does not seem to be any hit in the history of that class either.

Other than that, any chance you could provide us with the models you are seeing these performance issues with? As Cedric mentionned, we eventually plan on having something more flexible for the content match, but for now profiling would be a lot more help to track down what goes wrong in your case with our current implementation.

Yup, that was a mistake from my side (I forgot to "git pull" before checking ... came back from holidays yesterday ).

Java 6 usage is a mistake from another member of the Team, which has been fixed yesterday. This had been introduced post-M7, so only master was affected. As Cédric said in a previous comment, someone will have to bring chocolate to the office, that's the rule here since 2007 .

Yup, that was a mistake from my side (I forgot to "git pull" before checking ... came back from holidays yesterday ).

Java 6 usage is a mistake from another member of the Team, which has been fixed yesterday. This had been introduced post-M7, so only master was affected. As Cédric said in a previous comment, someone will have to bring chocolate to the office, that's the rule here since 2007.

Glad to see that fixed

Quote:

Other than that, any chance you could provide us with the models you are seeing these performance issues with? As Cedric mentionned, we eventually plan on having something more flexible for the content match, but for now profiling would be a lot more help to track down what goes wrong in your case with our current implementation.

I wish I could, these models represent clients code, I dont think I have permission to spread clients code

I wish I could, these models represent clients code, I dont think I have permission to spread clients code

I wish we had a model obfuscator somewhere . Shuffling randomly all primitive attributes would probably suffice to have an incomprehensible model that is still representative of a "real world" use case showing these issues?

I wish we had a model obfuscator somewhere . Shuffling randomly all primitive attributes would probably suffice to have an incomprehensible model that is still representative of a "real world" use case showing these issues?

Long time ago we developed theoric algorithms for such thing at Open Canarias, but never implemented. Obfuscation sure is an interesting topic in the modeling area.

I actually used it during my analysis, but couldn't figure out anything, IIRC distance calculations, executed too many times, lead to a long computation times on our models.
I'll give it a try and send you guys an snapshot.

java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:424)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:406)
at java.util.HashMap.put(HashMap.java:385)
at java.util.HashSet.add(HashSet.java:200)
at org.eclipse.emf.compare.match.eobject.EditionDistance$CountingDiffProcessor.attributeChange(EditionDistance.java:307)
at org.eclipse.emf.compare.diff.DefaultDiffEngine.computeSingleValuedAttributeDifferences(DefaultDiffEngine.java:801)
at org.eclipse.emf.compare.diff.DefaultDiffEngine.computeDifferences(DefaultDiffEngine.java:497)
at org.eclipse.emf.compare.match.eobject.EditionDistance$CountingDiffEngine.computeDifferences(EditionDistance.java:382)
at org.eclipse.emf.compare.diff.DefaultDiffEngine.checkForDifferences(DefaultDiffEngine.java:130)
at org.eclipse.emf.compare.match.eobject.EditionDistance$CountingDiffEngine.measureDifferences(EditionDistance.java:427)
at org.eclipse.emf.compare.match.eobject.EditionDistance.distance(EditionDistance.java:155)
at org.eclipse.emf.compare.match.eobject.CachingDistance.distance(CachingDistance.java:57)
at org.eclipse.emf.compare.match.eobject.internal.ProximityIndex.findTheClosest(ProximityIndex.java:182)
at org.eclipse.emf.compare.match.eobject.internal.ProximityIndex.findClosests(ProximityIndex.java:110)
at org.eclipse.emf.compare.match.eobject.internal.ByTypeIndex.findClosests(ByTypeIndex.java:86)
at org.eclipse.emf.compare.match.eobject.ProximityEObjectMatcher.tryToMatch(ProximityEObjectMatcher.java:276)
at org.eclipse.emf.compare.match.eobject.ProximityEObjectMatcher.matchList(ProximityEObjectMatcher.java:233)
at org.eclipse.emf.compare.match.eobject.ProximityEObjectMatcher.matchAheadOfTime(ProximityEObjectMatcher.java:147)
at org.eclipse.emf.compare.match.eobject.ProximityEObjectMatcher.createMatches(ProximityEObjectMatcher.java:115)
at org.eclipse.emf.compare.match.eobject.IdentifierEObjectMatcher.doDelegation(IdentifierEObjectMatcher.java:153)
at org.eclipse.emf.compare.match.eobject.IdentifierEObjectMatcher.createMatches(IdentifierEObjectMatcher.java:116)
at org.eclipse.emf.compare.match.DefaultMatchEngine.match(DefaultMatchEngine.java:312)
at org.eclipse.emf.compare.match.DefaultMatchEngine.match(DefaultMatchEngine.java:121)
at org.eclipse.emf.compare.match.DefaultMatchEngine.match(DefaultMatchEngine.java:91)
at org.eclipse.emf.compare.EMFCompare.compare(EMFCompare.java:169)
at org.eclipse.emf.compare.EMFCompare.compare(EMFCompare.java:152)

I tested the above with 1024mb heap and 2048 heap, both failed with the same exception.

I'll have a look at the memory consumption, I have not worked on it for the content match and there is no doubt there is room for enhancements. One "quick try" you could do would be to stop using the caching distance. It has no eviction strategy and is probably growing quite a lot.