The activity information that needs to be captured for the conversation based actvity events needs to be indexed on two concepts:

(a) business context

This is termed 'business' context at the moment due to the fact that it refers to using business related information from the message contents of the messages involved in the conversation as the way to link them into a conversation instance - in BPEL terms, this represents the correlation set.

Note: possibly to make this mechanism more general purpose, this should not be called 'business' context - possibly just 'event context'? But for the purpose of discussion we will just call it 'business' context.

If we consider a BPEL process, it is possible that a business context may change through the life of a process - so a process may start by being identified by an Order Id (for example), and later be associated with a Supplier Id. The only thing that needs to occur is that at some point during the process, a message content must contain both the Order Id and the Supplier Id, as a means of linking the two ids with the same process id.

The same concept works with conversation instances - an initial id will be associated with the conversation, but not all messages associated with the conversation need to carry the same id, as long as at some point during the conversation instance the subsequent ids are defined in a message that also contains an id already associated with the conversation.

The main point is that the two or more id values, associated with the conversation instance, are unique - so the same id value cannot identify two different conversation instances.

The other way in which business contexts may be related is as parent and child conversation instances. For example, if a parent conversation is associated with an Order Id, then it may have child conversation instances associated with a composite 'business context' of Order Id:Supplier Id - so for the same Order Id, there may be multiple sub (or child) conversation instances that are associated with the order and a supplier id.

So from a db query perspective, we may want to select all of the activity events within a particular conversation instance keyed on a 'business' context. If the context is the parent context, then it should return all of the events for the parent and sub-conversations. If the query context is the child context, then possibly just return the activity events for the child?

(b) common properties

For example, if we have conversation instances associated with a trading choreography, we may want to be able to identify all of the conversation instances associated with a particular trader.

In this case, the activity events for the conversation would be pre-processed to extract relevant properties from the message content - the trader name in this case, and associated with each activity event.

So on storage in the database, it would be convenient if the property name and value could be used as an index into all of the conversation instances that have atleast one activity event with that property. So rather than being linked to the activity events, it would be nice if they could just be linked to the conversation instances in some way - but if not efficient, then references to the activity events would be next best solution, as long as the conversation instance could be derived from the activity event.

We need to examine the current activity event schema, against these requirements of being able to query based on business context or property name/value, and ensure the set of appropriate activity events (in the case of business context), or conversation ids (in the case of properties) are retrieved efficiently.

As mentioned in the previous para - possibly the best way would be to introduce the concept of a conversation id. So the 'business context' values reference a conversation id (so if more than one business context is associated with the same id), and the conversation id links to the set of activity events. Not sure yet how this would work with the sub-conversations.

I think it is better that we introduce the conversation ID type for the "business context" for the query performance purpose.

below is the db schema that I thought for the Conversation ID.

ID, NAME, VALUE, PARENT

the ID is for the internal use. Name and value are very straightforward. Here I think we can introduce the PARENT column for the ID association. for example, first time, the 'order id' is the business context, and in its child convesation, or subsequent conversation that changes the business context, supplier id, for instance, we can use the parent column to indicate the 'supplier id' is derived (or child) conversation ID.

For the conversation query, if users just wanted the sub-conversations, we just use the sub-conversation id. while for the whole conversation, that we also list its PARENT column related activities.

2) common properties

These properties I think our 'Context' schema data type should be enough.

I think your idea for the db schema would be good - however it could result in confusion about what the conversation instance is, so for this reason, and also because the mechanism is also applicable for processes (not just conversations), I think it will be better if we refer to this as a 'correlation id', and 'correlation' (instead of business context).

So that does not change your approach - but I think makes the mechanism relate to the more general concept of correlation without worrying about the boundaries between conversations, sub-conversations and processes. Although sub-conversations are technically separate instances, a conversation that transitions from Order id to Supplier id, is technically still the same conversation - so that is why I think calling these internal ids 'correlation id' means we don't cause any confusion.

Two questions:

1) Assume it won't be a problem being able to infer 'child' correlation ids, so be able to (for example) start with the initial correlation id of a conversation and work through a tree of ids to identify all the ids that relate to the actual conversation instance?

2) An individual activity event (for example the message that contains both the Order Id and Supplier Id), may therefore be associated with more than one correlation id. I assume this is not a problem? It is also quite possible that some activity events will not be associated with any correlation, so it should not necessarily be mandatory - it just means these other events may be grouped on some other basis - so the correlation mechanism is just one approach to identifying sub-groups of activity events.

Jeff Yu wrote:

2) common properties

These properties I think our 'Context' schema data type should be enough.

Yes and no - I think the current schema is not explicit enough - currently the Context element can be used for correlation information as well as these properties, whereas they are different. Correlation information must be unique for the conversation/process instance, whereas properties don't need to be unique - they are just identifying common aspects across conversation/process instances.

So we could either have distinct schema components for these (instead of Context), or add something to the Context element to indicate whether the information is intended to be unique for the conversation/process instance?

May be clearer to have distinct components - e.g. correlation as one, and context or properties for the other?

I think your idea for the db schema would be good - however it could result in confusion about what the conversation instance is, so for this reason, and also because the mechanism is also applicable for processes (not just conversations), I think it will be better if we refer to this as a 'correlation id', and 'correlation' (instead of business context).

So that does not change your approach - but I think makes the mechanism relate to the more general concept of correlation without worrying about the boundaries between conversations, sub-conversations and processes. Although sub-conversations are technically separate instances, a conversation that transitions from Order id to Supplier id, is technically still the same conversation - so that is why I think calling these internal ids 'correlation id' means we don't cause any confusion.

Two questions:

1) Assume it won't be a problem being able to infer 'child' correlation ids, so be able to (for example) start with the initial correlation id of a conversation and work through a tree of ids to identify all the ids that relate to the actual conversation instance?

2) An individual activity event (for example the message that contains both the Order Id and Supplier Id), may therefore be associated with more than one correlation id. I assume this is not a problem? It is also quite possible that some activity events will not be associated with any correlation, so it should not necessarily be mandatory - it just means these other events may be grouped on some other basis - so the correlation mechanism is just one approach to identifying sub-groups of activity events.

I'd like to use the 'correlation id' here. but I think we may simplify the design as following:

1. database schema changed a bit.

ID, VALUE, PARENT

we won't have a name column here, we will Stringfied the correlation id(s), say we have follow correlation id:

<correlationID>

<orderID>1</orderID>

</correlationID>

we will save this as: orderID=1 in the 'VALUE' column, PARENT column as 'null'. and then we may have sub-conversation correlation id, like:

<correlationID>

<orderID>1</orderID>

<supplierID>1</supplierID>

</correlationID>

and then we save this as: orderID=1,supplierID=1 in the 'VALUE' column, PARENT column as '1', we assume the first correlation id above is been assigned as '1' for the ID column.

With this approach, it is very simple in the backend database storage, and this field is searchable, one limitation would be that for the correlation id, we won't allow it to more than 255 characters.

In this way, an individual activity event will at most have one correlationID, but it is fine with no correlation ID.

I think this approach is simpler than earlier one.

Any problems with this approach?

Gary Brown wrote:

Jeff Yu wrote:

2) common properties

These properties I think our 'Context' schema data type should be enough.

Yes and no - I think the current schema is not explicit enough - currently the Context element can be used for correlation information as well as these properties, whereas they are different. Correlation information must be unique for the conversation/process instance, whereas properties don't need to be unique - they are just identifying common aspects across conversation/process instances.

So we could either have distinct schema components for these (instead of Context), or add something to the Context element to indicate whether the information is intended to be unique for the conversation/process instance?

May be clearer to have distinct components - e.g. correlation as one, and context or properties for the other?

Agree on the distinct components - they should just be treated as two different ways to group 'things' - the correlation mechanism groups events into correlated groups, and the properties/contexts groups the correlated groups into related groups.

I think 'properties' is too generic, as we may want to just have a set of properties for an event to record adhoc information. So we need to come up with an appropriate term to represent this "grouping of correlated groups" information - but that is not urgent.

In terms of the schema structure you presented, the problem is that an individual event may have more than one independent corrrelation fields. And different subsequent events may contain either of the fields, and still need to be linked back to the same overall correlated group.

So I think, to ensure we can cover all use-cases, we need to be able to assign an ID to a correlation key, where that correlation key (value) could represent the type of structure you described above - because it could be a simple or composite correlation key. Some usecases:

In terms of the context information - I think this needs to be based on the activity events, rather than the correlation groups, to cater for situations where the correlation information is not available. We would still want to know which activity events were associated with a particular context.

Once the set of activity events had been derived, then additional query would need to be performed to determine the correlation ids for the activities.