The server manages 2 "queue" databases used for communication between processes.

One of the databases has 2 writers and 1 reader. The other has a single writer and single reader. The message size going into the queue is not large - around 4k at most. Each read/write operation is done via a new EntityManager which is closed straight afterwards.

The server is running with settings -Xms512m -Xmx1g. I have scope to increase these limits but would like to get an understanding of where the memory is being used. In each exception case the server has typically been running for 3-4 days without issue.

Unfortunately the environments showing the problem are production so we have limited access.

I've tried to recreate this in a test environment but have had no luck as yet - I'll update as soon as i have something. I've included the stack trace from the latest incident below just in case it sheds any light on the problem.

[2011-08-02 18:39:59 #18 server]
[ObjectDB 2.2.8] Unexpected exception (Error 990)
Generated by Java HotSpot(TM) 64-Bit Server VM 1.5.0_21 (on Windows 2003 5.2).
Please report this error on http://www.objectdb.com/database/issue/new
com.objectdb.o.InternalException: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at com.objectdb.o.PAG.<init>(PAG.java:169)
at com.objectdb.o.PGC.x(PGC.java:274)
at com.objectdb.o.SSS.V(SSS.java:441)
at com.objectdb.o.SSS.W(SSS.java:470)
at com.objectdb.o.PGW.az(PGW.java:527)
at com.objectdb.o.SPW.aM(SPW.java:350)
at com.objectdb.o.SPW.aJ(SPW.java:266)
at com.objectdb.o.SPW.ar(SPW.java:231)
at com.objectdb.o.PGW.am(PGW.java:201)
at com.objectdb.o.UPT.B(UPT.java:134)
at com.objectdb.o.UPT.l(UPT.java:110)
at com.objectdb.o.TSK.i(TSK.java:146)
at com.objectdb.o.TSK.f(TSK.java:95)
at com.objectdb.o.UPT.r(UPT.java:157)
at com.objectdb.o.PGT.q(PGT.java:109)
at com.objectdb.o.UPT.B(UPT.java:121)
at com.objectdb.o.URT.l(URT.java:171)
at com.objectdb.o.TSK.i(TSK.java:146)
at com.objectdb.o.TSK.f(TSK.java:95)
at com.objectdb.o.TSM.e(TSM.java:87)
at com.objectdb.o.UTT.z(UTT.java:362)
at com.objectdb.o.UTT.l(UTT.java:218)
at com.objectdb.o.TSK.i(TSK.java:146)
at com.objectdb.o.TSK.f(TSK.java:95)
at com.objectdb.o.TSM.e(TSM.java:87)
at com.objectdb.o.MST.UY(MST.java:1174)
at com.objectdb.o.WRA.UY(WRA.java:365)
at com.objectdb.o.WSM.UY(WSM.java:174)
at com.objectdb.o.WRA.UY(WRA.java:365)
at com.objectdb.o.WSN.UY(WSN.java:650)
at com.objectdb.o.STC.v(STC.java:485)
at com.objectdb.o.SHN.al(SHN.java:542)

Basically I'm creating 2 server based queues, each with 5 writers and 1 reader. The msg size being saved should be around 2mb. The server is running with -Xms256m -Xmx256m, so lower than my real environment, but I think this should still run ok as at most we should have 12 clients loading a 2mb msg - so 24mb raw data size? (Let me know if I've got this wrong though!) The test dies after a few minutes of running.

I've attached the code, the multithreaded unit test library i'm using and the server logfile.

Could you please explain how this test uses threads? It seems that every few seconds all the existing threads are ended and then more threads are run (i.e. > 60 threads after about 2 minutes). Each thread manages its own EntityManager, which may consume a lot of memory.

The aim of the test is to replicate application startup & shutdown cycles. Each cycle of the outer loop should create 10 writer threads (5 for each database) and 2 reader threads (1 for each database). Each thread should read or write 10 msgs and then exit. When all threads have exited (i.e. the application shutdown) the loop restarts.

There shouldn't be more than 12 threads accessing the server at anyone time - if you're seeing more then it sounds like the threads are not exiting and this is a problem with the test. I'll check this...

Are there additional cleanup steps that I should be taking to dispose of EntityManagers then? I actually thought it would be creating more than one EntityManager per thread as the createEntityManager call is made as every object is read or saved.

To be honest this test came about by fishing for the problem rather than a structured replication. As the real problem only presents itself after a few days the test is deliberately more extreme to try and force the issue out. The application does similar things with objectdb in embedded mode without any issues so it looks to be related to something which is different in server mode - connection management seemed like a good place to start.

I'm working on it but even when there are problems it is tough to get this kind of thing through :(

On a brighter note, I think I might have actually replicated it now...

The attached test (testObjectDbPipe) has 10 threads which write msgs to database "test1" and 5 threads which reads msgs from database "test1" and writes them to database "test2". The msgs are approx 2mb in size so at one msg per thread there should be around 30mb of data in play at any one time.

I've run the test with -Xms256m -Xmx512m and the server with -Xms256m -Xmx256m

After running for about 15mins I get out of memory exceptions

The test fails faster (in about 2 mins) if I enable another 5 threads to read from test1 and write to test2 - I've left this commented out in the attachment.

If I remove the save into test2 from the reader/writer threads (so it just reads the msg and discards) then the test runs fine.

I ran the new test and took heap dump snapshots using the Java VisualVM. After some activity the test became inactive, maybe because of the VisualVM interference, but the heap dump reveals a general picture.

Memory usage on the client side was stable. Memory usage on the server side became higher mainly due to increasing page cache size. I see that the test uses 4 different databases (2 large and 2 small). The default size of each page cache (per database) is 64MB, so the cache of the two large databases consumes at least 128MB (probably a little more). Other caches (page caches of the 2 other databases, query results caches, etc.) consume additional memory, so server side OutOfMemoryError when heap size is limited to 256MB may be reasonable.

Try to increase the maximum heap size and reduce cache sizes (not just the page cache) in the configuration.

I've rerun the test after setting the cache (page & results) size to 0mb. At first it looked ok but when I bumped up the msg size to 5mb and left it running for 25mins I got another out of memory exception on the server. I haven't been able to reproduce it consistently though as it worked ok for a hour on a subsequent rerun.

[ObjectDB 2.2.8] javax.persistence.RollbackException
Failed to commit transaction: Attempt to commit a transaction after failure in writing to database file 'F:\calculationservice\objectdb\dbs\admin\WorkflowHistory.odb' (error 613)
at com.objectdb.jpa.EMImpl.commit(EMImpl.java:277)
at rbccm.felix.gridservice.admin.dao.SimpleWorkflowInstanceDao.saveWorkflow(Unknown Source)
at rbccm.felix.gridservice.admin.dao.SimpleWorkflowInstanceDao.SaveWorkflow(Unknown Source)
at rbccm.felix.gridservice.admin.AdminMessageService.process(Unknown Source)
at rbccm.felix.framework.service.ServiceRunner.run(Unknown Source)
at java.lang.Thread.run(Thread.java:595)
Caused by: javax.persistence.PersistenceException: com.objectdb.o.UserException: Attempt to commit a transaction after failure in writing to database file 'F:\calculationservice\objectdb\dbs\admin\WorkflowHistory.odb'
at com.objectdb.o._PersistenceException.b(_PersistenceException.java:47)
at com.objectdb.o.JPE.g(JPE.java:140)
at com.objectdb.o.JPE.g(JPE.java:78)
... 8 more
Caused by: com.objectdb.o.UserException: Attempt to commit a transaction after failure in writing to database file 'F:\calculationservice\objectdb\dbs\admin\WorkflowHistory.odb'
at com.objectdb.o.MSG.d(MSG.java:74)
at com.objectdb.o.SFL.ad(SFL.java:822)
at com.objectdb.o.MST.ad(MST.java:1249)
at com.objectdb.o.MST.UY(MST.java:1164)
at com.objectdb.o.WRA.UY(WRA.java:365)
at com.objectdb.o.WSM.UY(WSM.java:174)
at com.objectdb.o.OBM.bI(OBM.java:904)
at com.objectdb.o.OBM.bG(OBM.java:809)
at com.objectdb.o.OBM.bE(OBM.java:715)
at com.objectdb.jpa.EMImpl.commit(EMImpl.java:274)
... 5 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at com.objectdb.o.HTB.f(HTB.java:86)
at com.objectdb.o.HTB.q(HTB.java:399)
at com.objectdb.o.PGC.B(PGC.java:402)
at com.objectdb.o.SFL.ah(SFL.java:990)
at com.objectdb.o.SFL.run(SFL.java:898)
... 1 more

In this case it is a normal object database (i.e. no queue behaviour). The app has been running for at least a couple of weeks and the db is approx 1.2gb. It only seems to have impacted the objectdb operations (not sure why or how) - other functionality in the app seems to be working ok.

(Apologies for sending another stack trace without much detail - I'm working on it...)

Fortunately, I have a dump from the most recent incident (we already have oom dumps switched on for the app itself). Unfortunately the file is 2g so I have no way of getting it to you. Some of the info (from Eclipse Memory Analyser) is below & attached. Please let me know if there is anything specific I should look for.

Problem Suspect 1:

One instance of "com.objectdb.o.MST" loaded by "sun.misc.Launcher$AppClassLoader @ 0x853f0440" occupies 375,301,672 (17.92%) bytes. The memory is accumulated in one instance of "com.objectdb.o.MST" loaded by "sun.misc.Launcher$AppClassLoader @ 0x853f0440".

This heap dump indicates a problem with the page caches. With cache size of 64MB (assuming you use the default) and page size of 2KB (the default) - you should only have 32K PAG instances per open database file.

How many open databases do you use concurrently? Following the thread dump I count about 9 open databases.

For 10 open databases you should have about 320K PAG instances. Additional PAG instances are created for operations, and because your objects are large, operations manage many pages.

Still, 598K PAG instances seems suspicious and it might be an ObjectDB bug.

You can upload the full heap dump (in a zip version) to ftp objectdb.com, user=public, password=public.

I'm unable to access the ftp site as the firewall here is blocking it - instead I've uploaded a multi-part rar file (14 parts) via a few support tickets - hope thats ok & apologies for the inconvinience.

The app is connecting to quite a few databases so it may be that the problem boils down to the cache/config settings after all. Is there anyway to specify cache sizes on a per-database level? Most of the "queue" databases we use probably don't need any cache but it would be nice to keep it for the proper queryable databases. (If you think this is the root of the problem then I can switch it off across the board).

To answer your questions - yes, we are using the default settings. The databases are as follows with a "queue" database translating into 2 physical databases (one small for queryable metadata, one larger for msg data):

3x standard object databases (embedded)

3x queue databases (embedded)

3x queue databases (server mode)

So in total we have 9 embedded database and 6 server based databases.

Thanks for your help and patience on this as I know its been frustrating with the lack of information!

Ok, I've also attached a heap dump for the local test issue I was seeing yesterday in case its a separate problem. This involved restarting the test readers & writers from #9 above a few times whilst leaving the server running in the background. The server was throttled to 256mb but cache sizes were set to 0mb.

(I'll post the dump via support tickets again as this page complains about the size - only 5 files this time but still 130mb)

I've haven't been able to recreated the problem locally with the new build (although my local test was never that scientific anyway) .

The new build was deployed to the environments that we've seen the problem last weekend and as yet there has been no re-occurance. The issue took just over a week to occur last time so it is a little early to say that it is resolved completely but it looks promising.