Design Specification: Performance and Concurrency

Document History

Project overview

This project groups several smaller performance related bug fixes and enhancements into a single unit. Its' goal is the improve the performance, concurrency and scalability of the product.

Concepts

Performance is concerned about reducing CPU usage and finding more optimal methods of processing operations.

Concurrency is concerned with reducing contention and improving multi-threaded and multi-CPU performance.

Scalability is concerned with clustering, large workloads and data.

Requirements

The goal of this project is to ensure that our product remains the leading high-performance persistence solution. Areas of improvement are determined through performance comparison with other persistence products and benchmarking.

Design Constraints

The goal of the project is to improve performance of common usage patterns. Fringe features and usage patterns will not be specifically targeted unless found to be highly deficient.

Any optimization must also be weighed in its' impact on usability, and spec compliance. Optimizations that may have a large negative impact to usability may need to be only enabled through specific configuration.

Functionality

Each specific performance improvement is discussed separately below.

Building objects from ResultSets

There is currently an old prototype of building objects directly from ResultSets. The goal of the feature is to allow "simple" objects and queries to be able to bypass the intermediate DatabaseRow objects build from JDBC used to build objects. Also to avoid a lot of the checks for non core features and simplify the object building process.

This will introduce a second path on queries and object building for these optimized queries, that should avoid a lot of the general overhead required to support advanced features. The feature will only be used on a query, or perhaps a class through configuration or if the class/query are determined to be "simple".

Initially simple will only include direct mappings, but hopefully be expanded to include single primary key relationships, or perhaps composite primary keys. It will not include inheritance, events, complex queries, fetch groups, etc.

Singleton cache keys and cache refactoring

Currently cache access can be an expensive operation. This will be improved by simplifying the CacheKey.
For singleton primary key objects, the Id value (Integer, Long, String) will instead be used as the cache key.
This will have a very broad impact as it changes the usage of Vector for the primary key, to be of type Object.
A new CacheId object will be used for composite or complex primary keys. The CacheId will be a basic wrapper for an Object array, adding equals and hashCode implementations.
A CacheKey will still be used as the cache value.

The cacheKeyType will be configurable on a ClassDescriptor or through the existing @PrimaryKey annotation.

This would affect a lot of internal API, as well as some external API that currently is typed to Vector. The public API taking Vector could still be supported, but the API returning Vector would either need to be changed, or new methods added and old ones deprecated.

The purpose of this change is for performance reasons. It also has the benefit of removing our usage of the legacy Vector API. For JPA classes that use a single simple Id value, it also has the benefit of using the JPA Id value as the cache key. For JPA IdClass or EmbeddedId it will not match the JPA Id, but the cache key is mainly an internal value, and should reflect what is optimal for cache usage. This work removes the primary key casting as Vector, so would make it easy to support usage of the JPA IdClass if desired (as a separate feature unrelated to performance). Extreme caution should be used in doing this however, as it requires that the user implement equals() and hashCode() correctly in their IdClass, which is quite easy to mis implementing, or implement incorrectly. It also will cause a negative performance impact as building the IdClass in our internal cache usage will be much less efficient than usage of the CacheId, and the user's equals() and hashCode() implementation is most likely not optimal.

The existing API on IdentiyMapAccessor, ReadObjectQuery and ReportQuery currently uses Vector for the primary key. This API will still be supported, but deprecated. New API will be added that take Object for the primary key.

The JPA Cache interface will be extended in the same pattern as our JpaEntityManager to expose our additional cache API using the JPA Id. This will make our internal cache key type transparent to JPA users.

Batching reading using exist and IN

Currently batch reading uses a join of the batch query to the source query.

This join has some issues:

If a m-1 or m-m, join causes duplicate rows to be selected, (a distinct is used to filter rows for m-1, but duplicates are returned for m-m as join table info is needed).

DISTINCT does not work with LOBs.

DISTINCT may be less efficient on some databases than alternatives.

Does not work well with cursors.

Needs verification if works with pagination.

One alternative is to use an EXISTS with a sub-select instead of a JOIN. This should not result in duplicate rows, so avoid issues with DISTINCT.

Another option is to load the target objects using a IN clause containing the source query object's primary keys. This would also work with cursors, but as the limitation of requiring custom SQL support for composite primary keys, and produces a large dynamic SQL query.

A new BatchFetchType enum will be define and the usesBatchReading flag will enhance to setBatchFetch allowing for JOIN, EXISTS or IN. This option will also be added to ObjectLevelReadQuery, rolling up the current 4 batch reading properties into a new BatchFetchPolicy, also moving them up from ReadAllQuery to allow ReadObjectQuery to also specify nested batched attributes. A new BatchFetch annotation and query hint will be added.

The EXISTS option will operate very similar to the existing JOIN batching, just putting the same join condition inside a sub-select. Although it should have be straight forward, I hit several issues with our current sub-selects support that I had to debug and fix. This also lead to discovering some issues with our JPQL support, that also needed to be fixed. EXISTS supports all of the mappings the same as JOIN, but does not require a distinct. m-1 will not use a DISTINCT by default for EXISTS (even though it would avoid duplicates, as the reason for using the EXISTS is normally to avoid the distinct), but the distinct can be enabled on the originating query if desired. EXISTS will still select duplicates for m-m, and not work well with cursors.

The IN option will query a batch of the target objects using an SQL IN clause containing the key values. For a 1-1 the foreign keys from the source rows will be extracted, if these are the primary keys of the target, they will first be filtered by checking the cache for each object. The remaining keys will be split into batches of a query configurable size, defaulting to 256. For composite keys the multi array SQL syntax for ((key1, key2), (...)) IN ((:key1, :key2), (...)) will be used. Only some databases support this syntax, so composite primary keys will only be supported on some databases. For 1-m or m-m the source rows do not contain the primary keys of the target objects. The IN will still be supported, but will be based on join the source query as in JOIN (or maybe EXISTS?) for the set of keys. For cursors the IN size will default to the cursor pageSize, and each cursor page will be processed separately.

Testing

Both the existing performance and concurrency tests and pubic benchmarks will be used to monitor and evaluate performance improvements.