Allow means force a Repository to synchronize with the cluster

Details

Description

Based on the thread on the user mailing list I'm logging this to propose adding a sync() method to force cluster synchronization using the JackrabbitRepository extension API.

The purpose of the method is such that in a distributed clustered environment sometime cluster synchronization does or has not occurred such that certain repositories are in a stale state. This method would provide a means to force a repository to update pull in possible changes made by other Jackrabbit repositories.

Activity

I forgot to include the link to the mailing list thread in the original bug description[1].

The attachment contains patches for the 1.4 branch of the jackrabbit-core and jackrabbit-api projects. It adds the method forceClusterSync() to the JackrabbitRepository interface and implements the method in the three implementations of the interface.

The patches stray from Jukka's suggestion of naming the method sync() as the use case I was needing solved involved a clustered environment. sync() seems general enough that it might indicate the repository implementation would pick up any changes made to the data store. I know on the mailing list altering the database directly instead of through JCR/Jackrabbit API is discouraged so perhaps that isn't a valid use case.

I'm also not completely familiar with all use cases and therefore the implementations of the method only work in a clustered environment. I assume that multiple repositories hitting the same database in an unclustered environment could be considered invalid. However if this assumption is incorrect then the code will need to be changed to handle that.

Micah Whitacre
added a comment - 22/Sep/08 19:58 I forgot to include the link to the mailing list thread in the original bug description [1] .
The attachment contains patches for the 1.4 branch of the jackrabbit-core and jackrabbit-api projects. It adds the method forceClusterSync() to the JackrabbitRepository interface and implements the method in the three implementations of the interface.
The patches stray from Jukka's suggestion of naming the method sync() as the use case I was needing solved involved a clustered environment. sync() seems general enough that it might indicate the repository implementation would pick up any changes made to the data store. I know on the mailing list altering the database directly instead of through JCR/Jackrabbit API is discouraged so perhaps that isn't a valid use case.
I'm also not completely familiar with all use cases and therefore the implementations of the method only work in a clustered environment. I assume that multiple repositories hitting the same database in an unclustered environment could be considered invalid. However if this assumption is incorrect then the code will need to be changed to handle that.
[1] - http://www.nabble.com/Forcing-a-cluster-synch-td19578255.html

Hmm, there's a somewhat releated long discussion ([1] and [2]) about JackrabbitRepository.shutdown() that we had earlier this year. This method is not nearly as intrusive as the shutdown() method, but there might be valid reasons why it shouldn't be exposed to just any client.

Jukka Zitting
added a comment - 23/Sep/08 17:00 Hmm, there's a somewhat releated long discussion ( [1] and [2] ) about JackrabbitRepository.shutdown() that we had earlier this year. This method is not nearly as intrusive as the shutdown() method, but there might be valid reasons why it shouldn't be exposed to just any client.
[1] http://markmail.org/message/fqyypq5x6bma4ike
[2] http://markmail.org/message/dsqyv3rafo4j5xea

I'm also wondering if this is really needed. IIUC the cluster works like an eventually consistent system (see [1] for an introduction). If the application can guarantee 'session consistency' (i.e. an application session always uses a given cluster node and only switches to another cluster node when it goes down) the consequences of the inconsistency window should be very limited.

Micah, can you please elaborate why you need a cluster node to be in sync? Even if you can sync a cluster node, it might get out of sync again right away by a modification of another cluster node.

Marcel Reutegger
added a comment - 24/Sep/08 07:50 I have the same concerns regarding the visibility of the method.
I'm also wondering if this is really needed. IIUC the cluster works like an eventually consistent system (see [1] for an introduction). If the application can guarantee 'session consistency' (i.e. an application session always uses a given cluster node and only switches to another cluster node when it goes down) the consequences of the inconsistency window should be very limited.
Micah, can you please elaborate why you need a cluster node to be in sync? Even if you can sync a cluster node, it might get out of sync again right away by a modification of another cluster node.
[1] http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

My use case is described in the mailing list link included in the first comment. I read the article you linked to and it was very interesting. In my situation we are shooting for session consistency however there is not means to guarantee the stickiness of the session on the server side. The setup I have is operations are routed between 3 different JVMs and each JVM is read/writing to the JCR repository. So the use case I'm shooting for is client 1 performs writes which get routed to JVM1. The same client then performs a read on that write however the operation is routed to JVM2. In this situation I know that the write operation has occurred but when retrieving from the repository get a PathNotFoundException. So in that case I'd like to sync to update JVM2. After I sync I attempt to read and either get the value I'm looking for or I don't. If i don't then I know that a concurrent modification has occurred and report the appropriate response back to the client.

Micah Whitacre
added a comment - 24/Sep/08 13:41 My use case is described in the mailing list link included in the first comment. I read the article you linked to and it was very interesting. In my situation we are shooting for session consistency however there is not means to guarantee the stickiness of the session on the server side. The setup I have is operations are routed between 3 different JVMs and each JVM is read/writing to the JCR repository. So the use case I'm shooting for is client 1 performs writes which get routed to JVM1. The same client then performs a read on that write however the operation is routed to JVM2. In this situation I know that the write operation has occurred but when retrieving from the repository get a PathNotFoundException. So in that case I'd like to sync to update JVM2. After I sync I attempt to read and either get the value I'm looking for or I don't. If i don't then I know that a concurrent modification has occurred and report the appropriate response back to the client.

How about if we made Repository.login() and Session.refresh() automatically trigger a cluster sync? I think that would be well in line with both the semantic and performance expectations a typical client would have.

Jukka Zitting
added a comment - 08/Oct/08 12:12 How about if we made Repository.login() and Session.refresh() automatically trigger a cluster sync? I think that would be well in line with both the semantic and performance expectations a typical client would have.

I'm a bit concerned about the number of sync() calls that might pile up in the end: at the moment, the sync() operation is guarded by a Mutex, roughly like this:

mutex.acquire();
try

{
doSync();
}

finally

{
mutex.release();
}

a lot of threads might potentially end up handling Repository.login() and Session.refresh() and would therefore all in turn wait for each other, only to start the sync() operation again. I'd therefore change the semantics of the method described above to:

(1) either return immediately if a thread is already inside doSync() is already in progress
(2) or wait for the processing thread in doSync() to finish and then return

Optionally, the wait in (2) might again be limited by a timeout in order to avoid locking the complete system.

Dominique Pfister
added a comment - 08/Oct/08 13:44 I'm a bit concerned about the number of sync() calls that might pile up in the end: at the moment, the sync() operation is guarded by a Mutex, roughly like this:
mutex.acquire();
try
{
doSync();
}
finally
{
mutex.release();
}
a lot of threads might potentially end up handling Repository.login() and Session.refresh() and would therefore all in turn wait for each other, only to start the sync() operation again. I'd therefore change the semantics of the method described above to:
(1) either return immediately if a thread is already inside doSync() is already in progress
(2) or wait for the processing thread in doSync() to finish and then return
Optionally, the wait in (2) might again be limited by a timeout in order to avoid locking the complete system.
WDYT?

I'm not sure that there will be that many syncs, especially if we don't trigger the sync from Repository.login() calls. Additionally, if the node already is in sync with the cluster (which would be the usual case), then the sync amounts to just retrieving the journal revision number and comparing it with the local state.

If the retrieval of the journal revision number is expensive, we could add an additional local guard that just ensures that at least one cluster sync has been done between a client calling Session.refresh() (or Repository.login()) and the call returning. Whether it's that thread or some other thread doing the sync doesn't matter. Something like this:

Jukka Zitting
added a comment - 09/Oct/08 09:23 I'm not sure that there will be that many syncs, especially if we don't trigger the sync from Repository.login() calls. Additionally, if the node already is in sync with the cluster (which would be the usual case), then the sync amounts to just retrieving the journal revision number and comparing it with the local state.
If the retrieval of the journal revision number is expensive, we could add an additional local guard that just ensures that at least one cluster sync has been done between a client calling Session.refresh() (or Repository.login()) and the call returning. Whether it's that thread or some other thread doing the sync doesn't matter. Something like this:
private volatile int syncCount = 0;
int count = syncCount;
mutex.acquire();
try {
if (count == syncCount)
{
doSync();
syncCount++;
}
} finally
{
mutex.release();
}

Patch looks good to me, apart from the missing "h" in method name "syncWitCluster"

I'm still uneasy with the idea of having journal.sync() being potentially called with every Repository.login() and Session.refresh(), thinking of applications that e.g. test the repository availability by intermittently calling Repository.login(). What about specifying a lower bound on the delay (e.g. 1 second) that should pass before another journal sync actually takes place?

Dominique Pfister
added a comment - 09/Oct/08 10:53 Patch looks good to me, apart from the missing "h" in method name "syncWitCluster"
I'm still uneasy with the idea of having journal.sync() being potentially called with every Repository.login() and Session.refresh(), thinking of applications that e.g. test the repository availability by intermittently calling Repository.login(). What about specifying a lower bound on the delay (e.g. 1 second) that should pass before another journal sync actually takes place?

> I'm still uneasy with the idea of having journal.sync() being potentially called with every
> Repository.login() and Session.refresh(), thinking of applications that e.g. test the repository
> availability by intermittently calling Repository.login().

The performance hit is a single SELECT statement (for DatabaseJournal) or a directory listing (for FileJournal). I don't think that's too much, but you're right in that there are cases (one example is a web site without a session pool) where avoiding any extra perfromance hit on login() would be beneficial.

I think it would be OK if we didn't trigger the cluster sync on login(). A client could still use refresh() to ensure causal consistency (Micah's use case).

> What about specifying a lower bound on the delay (e.g. 1 second) that should pass before
> another journal sync actually takes place?

That would kind of defeat the purpose as the client would then have no way (apart from explicitly waiting for that one second and retrying the sync) to ensure consistent access to the repository.

Jukka Zitting
added a comment - 09/Oct/08 12:12 > missing "h" in method name "syncWitCluster"
Good catch, thanks!
> I'm still uneasy with the idea of having journal.sync() being potentially called with every
> Repository.login() and Session.refresh(), thinking of applications that e.g. test the repository
> availability by intermittently calling Repository.login().
The performance hit is a single SELECT statement (for DatabaseJournal) or a directory listing (for FileJournal). I don't think that's too much, but you're right in that there are cases (one example is a web site without a session pool) where avoiding any extra perfromance hit on login() would be beneficial.
I think it would be OK if we didn't trigger the cluster sync on login(). A client could still use refresh() to ensure causal consistency (Micah's use case).
> What about specifying a lower bound on the delay (e.g. 1 second) that should pass before
> another journal sync actually takes place?
That would kind of defeat the purpose as the client would then have no way (apart from explicitly waiting for that one second and retrying the sync) to ensure consistent access to the repository.

We need a configurable lower bound on the delay. Some applications may call refresh() a lot. If there was a change, the cluster mechanism will call PersistenceManager.onExternalUpdate(), which is potentially expensive. Without a lower bound on the delay, the system may get unusably slow. For example, if there is a background thread that calls Session.refresh() once a second, and if PersistenceManager.onExternalUpdate() takes one second.

Thomas Mueller
added a comment - 09/Oct/08 12:30 We need a configurable lower bound on the delay. Some applications may call refresh() a lot. If there was a change, the cluster mechanism will call PersistenceManager.onExternalUpdate(), which is potentially expensive. Without a lower bound on the delay, the system may get unusably slow. For example, if there is a background thread that calls Session.refresh() once a second, and if PersistenceManager.onExternalUpdate() takes one second.

> For example, if there is a background thread that calls Session.refresh() once a second,
> and if PersistenceManager.onExternalUpdate() takes one second.

Why is that a problem? With the syncCount patch any later refresh() methods will just wait for the first invocation to finish before just returning without invoking another sync. Also, the usual case is that there are no changes to report, so the amortized cost of the refresh() method is still pretty small.

The sync mutex already implements a lower bound delay that prevents any two syncs from executing concurrently. Adding more delay is IMHO stepping to the client territory. The client may have a good reason to want to sync more frequently (Micah's round robin case is an excellent example) and I see no reason why the repository should explicitly try to degrade the performance below what the hardware is capable of.

Jukka Zitting
added a comment - 09/Oct/08 13:18 > For example, if there is a background thread that calls Session.refresh() once a second,
> and if PersistenceManager.onExternalUpdate() takes one second.
Why is that a problem? With the syncCount patch any later refresh() methods will just wait for the first invocation to finish before just returning without invoking another sync. Also, the usual case is that there are no changes to report, so the amortized cost of the refresh() method is still pretty small.
The sync mutex already implements a lower bound delay that prevents any two syncs from executing concurrently. Adding more delay is IMHO stepping to the client territory. The client may have a good reason to want to sync more frequently (Micah's round robin case is an excellent example) and I see no reason why the repository should explicitly try to degrade the performance below what the hardware is capable of.

With this method the client needs to call session.refresh(...) to force a cluster sync.

I've also included a static (and public) flag variable, SessionImpl.clusterSyncOnRefresh, that an application can set to false if it wants to disable this feature (some applications may have come to expect refresh to be a nearly instantaneous operation, which no longer is true with this feature).

Jukka Zitting
added a comment - 24/Nov/08 14:18 Attached the alternative patch ( JCR-1753 -sync-in-refresh.patch).
With this method the client needs to call session.refresh(...) to force a cluster sync.
I've also included a static (and public) flag variable, SessionImpl.clusterSyncOnRefresh, that an application can set to false if it wants to disable this feature (some applications may have come to expect refresh to be a nearly instantaneous operation, which no longer is true with this feature).

Jukka Zitting
added a comment - 28/Nov/08 11:33 Dropping the jackrabbit-api component as the latest proposals are based on only the standard JCR API.
Note that this improvement will not make it in Jackrabbit 1.5. Let's target the 1.6 version for this.

Committed a slightly modified version of the patch in revision 792211. Merged to the 1.x branch in revision 792212.

Cluster synchronization is now enabled by default for Session.refresh(boolean). The synch can be selectively disabled per session by setting the "org.apache.jackrabbit.disableClusterSyncOnRefresh" attribute to any non-null value. Alternatively a SessionImpl subclass can disable (or enable) this feature for all refresh() calls by overriding the protected clusterSyncOnRefresh() method.

Jukka Zitting
added a comment - 08/Jul/09 16:25 Committed a slightly modified version of the patch in revision 792211. Merged to the 1.x branch in revision 792212.
Cluster synchronization is now enabled by default for Session.refresh(boolean). The synch can be selectively disabled per session by setting the "org.apache.jackrabbit.disableClusterSyncOnRefresh" attribute to any non-null value. Alternatively a SessionImpl subclass can disable (or enable) this feature for all refresh() calls by overriding the protected clusterSyncOnRefresh() method.