User story #1 : Datastore caching

Datastore caching should allow to wrap transparently any other
Datastore (ie. the source Datastore) and leverage related subsequent
queries.

Datastore cache acts as buffer between the client and the actual
source Datastore. It may be persistent over client sessions.

The maximum capacity can be set by the user.

General purpose framework

Can we propose a general framework for arbitrary data caching ?
Raster and vector data models API are not unified in Geotools.

Data to be stored and queries

One issue is the way features are retrieved from Datastore : a
Datastore usually receives a Query, and features are not retrieved by
id. Slightly different queries will yield a common set of features.

What we will store are features, rather than query results.

But we need to track queries, and some kind of relationships with
result features. How do we want to remember previous queries ? What
relationships with features do we keep track of ?

First, we suggest we keep track of queries spatial bounds and keep a
spatial index of the cached features.

What type of queries the cache should leverage ?

BBox Filter

FID Filter

Other queries may require to keep other indices of features (but this
way it will be possible to implement specialized cache for performance
sensitive operations. This is an extension mechanism we should take
care to provide.)

Cache should delegate unhandled queries to source Datastore. Two step
query of the datasource may be used :

1. get fid list from datasource
2. get features by fid that the cache doesn't already hold.

We will have to check this is performance-wise, but at least this will
save bandwidth when working with remote datastore.

Data persistence

Feature is not Serializable.

We have to find some way to serialize data represented by features.

What if Datastore schema changes ?

What if cache get out of sync with source datastore ? How do we detect this ?

Eviction strategy

When the cache is full, it must make room for new data. We will have
to define an eviction strategy for feature cache and index (or
indices).

Multisource datastore

In the low bandwidth scenario, data may be best copied from the source server and distributed using CD-ROM or DVD-ROM. Clients would appreciate to download only updates from the server, so they will still be able to display up-to-date data.

We would like to set up a datastore that can connect to two (or more ?) source datastores which store the same features, but possibly different versions. For example, one of the datastore would be a local storage, with good performance access, but holding data frozen at time t. The other datastore would represent a remote server.

The multisource datastore would then retrieve features :

from the local store, if the feature has not been modified

from the remote store, if the feature has been modified.

It should be able to ask the remote store for what features have been modified, and update the local store.

Since we have to know if the original data has changed since time t, this service seems not be possible without some kind of data versioning system.
This could involve a sequential checkout (based on spatial extent) from a versioned remote datastore.

This is the scenario where a client has a local copy of a data source and wants to download updates from the original source. The local copy is used to initialize the cache, and client asks for updates to the remote source, and updates cache (or local copy).

Asynchronous updates

In the low or no bandwidth scenario, users may want to edit locally the data, and commit modifications when possible.

Data cache using persistent storage will allow clients to carry on work when disconnected.

At commit time, the original data may have been modified by another client. This is a well-known issue of concurrent modifications

Scope of intended work, roadmap

The initial goal is to allow client|application to cache data when querying remote sources, and make disconnected mode possible. Handling asynchronous updtaes is a plus.

In this section, I try to detail deliverables and scope of intended work.

1. in-memory data cache, read-only

The most simple scenario.

I will propose an extensible framework for DataCache.
I will propose a test framework to assess DataCache functionalities.
I will test different implementation of DataCache against MemoryDataStore, the reference implementation of DataStores. Different implementations include testing different types of spatial indices.

2. client demo using cache

A demo would be nice to appreciate the actual benefits of DataCache.

First, I can think of a very simple client - use JMapPane, with two panes (cached, control) to visually compare rendering times ?

Second, I can try a demo using uDig (adding programmatically a cached DataStore to catalog).

3. persistent data cache (disk storage), read-only

Extension of 1 to disk-based data cache.

I will consider using Hsql DataStore as cache backend - but I don't know if plugin is mature enough to be used (ask Jody ?). Think of other solutions ?

Most of the cache behavior should be under control of the query tracker, which knows the known parts of the universe and may record how often each part of the universe is accessed.
We can't remember every query, but we can idealize or normalize them by expanding each to a grid. Typically, a tile will be a quadrant of a quadtree, and we approximate a query by the minimum containing quadrant.

We can attach to each quadrant :

access statistics

possibly a list of features

On eviction, we pick a leaf quadrant, ie a quadrant that has no child, in the tree and make it invalid. Parent quadrants should be marked invalid or partial. When four quadrants in a quadrant are invalid, they should be removed from that quadrant, which reverts to a leaf. A leaf is always marked partial, unless it has the lowest allowed level.

In order to find the leaf to evict, we traverse the tree by level, looking for a quadrant that matches the criterium (for example, least recently used). If found, we iterate from that quadrant. If not found, ie all quadrants at given level are equals according to the used criterium, we do an exhaustive lookup of the lower level, and so on. If we have to exhaustively traverse many levels, this will be time consuming, and we may revert to random eviction after a given number of levels traversed with no matches.

On look-up, given an input bbox query, we first search for the minimum containing quadrant. Then we test if that quadrant is valid and entire. If not valid, we'll have to fetch this quadrant from the source store. If not entire, we search for invalid inner quadrants. To minimize the number of quadrants to fetch, each quadrant should yield less than a maximum number of quadrants, possibly depending on the level of the quadrant.

Development notes and progress

Some notes of the work being done through this Summer of Code project, and report on progress.

Week 6

working on QueryTracker

Week 5

change in design : feature cache should be an extension of FeatureStore rather than DataStore. So I designed a new interface FeatureCache that extends FeatureStore. A DataCache will then be no more than a collection of FeatureCache, one per FeatureType.

wrote FilterSplitter : a FilterVisitor that splits a filter into two filters, isolating spatial restrictions from other attribute restrictions.

tried another R-tree framework : http://research.att.com/~marioh/spatialindex ; this framework provides visitors and query strategies to query the tree, which is great, but tree seems to be slower than org.geotools.index.RTree

committed first implementation of FeatureCache : InMemoryFeatureCache, using LRU eviction. Does work, but not that fast.

submitted a presentation proposal for FOSS4G

planned to continue work on a demo using JMapPane, but task is delayed until I have a better implementation of FeatureCache

@task

move eviction mechanisms to QueryTracker

try to understrand better how transactions are working. I will need this when going from read-only cache to write-thru.

update UML diagram to reflect design changes

Weeks 3 & 4

updated wiki page

looked at JMapPane and tried a first demo ; well, this was a try

off to a seminar, so not much work done.

Week 2

worked on : in-memory read-only data cache (continued)

created module unsupported/caching under trunk, and configured my project build

@task

think of an eviction strategy for flushing part of a spatial index. Idea would be to walk along the tree to find suitable nodes we can get rid of. But org.geotools.index.RTree implementation does not ease this. Tree could be parsed by recursively splitting top bound box into quadrants.

talk with Andrea about versioning and his idea of storing features by quadrants, ie tiles.

post project javadoc somewhere

Week 1

worked on : in-memory read-only data cache

readings : caching, spatial indexing

draft implementation of classes - see attached UML diagram

created wiki page with user stories

@task

implement cache size limit

handle better multiple types in a DataStore - for now I assumed there was one and only one FeatureType in the DataStore