Is there a way to programatically invalidate the cache using the cache API?

I'll explain why I'm asking:

Our UI allows users to somewhat arbitrary join from different objects. Each object is represented in Teiid as a continuous query backed a reusable execution translator. The intent is to allow the user to see data change in the UI as the translators call dataAvailable(). The problem is that Teiid waits for all executions to indicate dataAvailable() before the engine re-executes the command. Our desired behavior is that the command is re-executed using new data from any execution indicating dataAvailable() while using previous data from any execution that is not ready.

I believe we can achieve this using delegating translators that appropriately intercept and broadcast dataAvailable() across all executions in a command and, somehow, caching previous data. I would prefer to use Teiid's caching API but it will only work for us if we can programatically invalidate the cache.

Since the entries are going in to resultset cache, you just need to invalidate resultset cache. Note that it will invalidate all the entries for that VDB. There is no fine grained approach available. Look at "clearCache" method on the Admin API.

I believe, if the data is available in the cache and it is valid, then engine will not even send the execute request to the translator. Looks like you want to synchronize the data availability across all your translators, I am not sure if the Teiid based caching can be used for your purpose, as interception is occurring much earlier in the execution cycle.

Suppose I execute a query in continuous mode: SELECT * FROM t1, t2. At some point, t1 indicates dataAvailable() and later, t2 indicates dataAvailable(). At that point, the client will receive the results of the query. Upon executing the query the second time, t1 and t2 will again throw DataNotAvailable and indicate dataAvailable() at different times. Under current Teiid behavior, the client will have to wait until both data sources have indicated dataAvailable() before it receives a result.

I'm proposing a new result set cache scope named: REUSE_EXECUTION. When an execution returns this cache scope, in a continuous query, Teiid will reuse the cached result set until the cache is invalidated *and* the execution calls ExecutionContext.dataAvailable(). In the example, if both t1 and t2 have this cache scope, the client will receive an unbroken stream of results:

Initially, the client will receive a results set

Before t1 or t2 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

When t1 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

When t1 invalidates the cache and indicates dataAvaiable(), t1's cache is invalidated, data is served from t1's execution

In my mind, there may be some value in automatically invalidating the cache on dataAvailable() or perhaps this overloading dataAvailable().

You can emulate this behavior by initially using a CacheDirective with a cache scope of USER (or whatever is appropriate) and then when you want to invalidate the result return a CacheDirective with an initial scope of NONE. Then switch the scope to USER again before the results are finished. This is not quite convienent though since the getCacheDirective method is on the ExecutionFactory and you would be caching at a higher scope than you want.

A COMMAND cache scope is possible, which would just hook into the results sharing logic and not remove the entires with each re-execution.

In general though this description seems like it may lead to redundant processing of cached results (case 2 above). Is a user level query result cache also needed in this scenario?

In general though this description seems like it may lead to redundant processing of cached results (case 2 above). Is a user level query result cache also needed in this scenario?

I'm not sure what you mean. The second execution may or may not be delayed due to exactly when both t1 and t2 throw DataNotAvailable. Typically, in our translators, we throw DNA during the execute() method so, I don't think there would be any redundant processing.

*edit* Oh, I think I see what you're talking about. My example is a bit unrealistic when I say that t1 and t2 will throw DNA at different times. In fact, they are likely to throw DNA at the same (logical) time: at their execute(). The real key to my example is that they call dataAvailable() at different times.

Sorry, I mean the comment 3 and later discussion. Specifically, the discussion around a COMMAND scope and the desire to move from an *all* data available to an (edit) *any* data available model for the engine re-executing a query.

As you indicate, with TEIID-2301 and USER/SESSION cache, we could shoehorn the behavior that we want but it would be kind of ugly.

> Before t1 or t2 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

This is new behavior to support caching of results accross each execution somewhere between a command and a session scope. If you are performing higher level joins or other expensive operations, then that's where the concern of redundant processing comes in. A higher level (user result) cache entry would serve duplicate results without the need for intermediate processing - however that does currently fit well with the notion of continuous results and would rather fit in with a more general asynch processing approach.

> When t1 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

It seems like you would not throw DataNotAvailable in this case at all. Rather the engine would continue to reprocess the query from the cached results, which implies this is the same as #2.

> When t1 invalidates the cache and indicates dataAvaiable(), t1's cache is invalidated, data is served from t1's execution

The key here would seem to be the invalidation. Once the cached results are no longer available, then the execution will be consulted.

So the questions are:

Is a new scope needed - are the results truly specific to each reusable query or is session level sufficient?

What is the invalidation criteria - is it the full command or just the tables that contributed to the results? If it's the latter, then the existing data modification event logic would be sufficient.

> Before t1 or t2 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

This is new behavior to support caching of results accross each execution somewhere between a command and a session scope. If you are performing higher level joins or other expensive operations, then that's where the concern of redundant processing comes in. A higher level (user result) cache entry would serve duplicate results without the need for intermediate processing - however that does currently fit well with the notion of continuous results and would rather fit in with a more general asynch processing approach.

I hadn't thought about the duplicate processing. In this light, a slightly different way of stating the request:

It would be nice if the engine cached results from as high in the plan tree as possible to avoid redundant processing

I need a way for individual executions to indicate that previous results are invalidated and thus any dependent results (higher in the plan tree) are also invalidated

I want the engine to return results to the client only after there's a chance that the results have changed (ie, at least one of the executions has indicated the previous results are invalidated)

Cache results from all plan nodes (this could overload the caching system?)

Cache results from only expensive operations

For now, I'm ignoring question of caching strategy (although it's an interesting one: allow the client to decide on the strategy? can the application supply a new caching strategy? can the application provide information on how to compute expensive operations?)

> When t1 throws DataNotAvailable, the client will receive a set of result sets each one identical to #1. Both t1's and t2's results are served from the cache.

It seems like you would not throw DataNotAvailable in this case at all. Rather the engine would continue to reprocess the query from the cached results, which implies this is the same as #2.

Agreed. I'm not sure what I was thinking.

> When t1 invalidates the cache and indicates dataAvaiable(), t1's cache is invalidated, data is served from t1's execution

The key here would seem to be the invalidation. Once the cached results are no longer available, then the execution will be consulted.

Yes

Is a new scope needed - are the results truly specific to each reusable query or is session level sufficient?

Yes, I think a new scope is needed. If a client issues the same query at two different times from the same connection, I would not expect the second query to receive cached results.

What is the invalidation criteria - is it the full command or just the tables that contributed to the results? If it's the latter, then the existing data modification event logic would be sufficient.

Dunno. A few considerations come to mind:

How are stored procs handled in the existing data modification event logic? I would want stored procs to be treated the same as tables

Does TEIID-2139 (common table expressions) affect the decision?

Should translators be allowed to invalidate results based on filter criteria? eg, invalidate the one of the following executions but not the other:

SELECT a FROM t WHERE b=1

SELECT a FROM t WHERE b=2

Are there source-specific considerations?

My gut says that invalidating based on the command is the more flexible approach but, having said that, our immediate use cases are served by invalidating on tables and stored procedures. If there's a substantial difference in implementation cost, I'd say start with tables&stored procs and add full command later.

While on the subject of APIs, when I was first thinking about this, I had assumed that dataAvailable would both

invalidates any previously cached results and

indicates the data source has new results for the engine

I don't see how it would be useful to indicate dataAvailable without first invalidating the previously results nor how it would be very useful to invalidate previous results without having new results to offer (except to pause the engine which shouldn't be necessary under the new behavior). If this assumption plays out in practice, I see evolving a convenience API over the existing primitives.

> Cache results from all plan nodes (this could overload the caching system?)

Yes, that is fundimentally not practical.

> Cache results from only expensive operations

This is really part of the idea around what hints could aid TEIID-2139 the user should be able to direct at some level what is considered to be a common subexpression for processing purposes (whether that's common within a single execution or in multiple).

> Yes, I think a new scope is needed. If a client issues the same query at two different times from the same connection, I would not expect the second query to receive cached results.

Is there a reason for this? If you are actively invalidating, then won't the cache always reflect your expectations? Or do you have a need more specific to continuous queries?

> How are stored procs handled in the existing data modification event logic? I would want stored procs to be treated the same as tables

Yes they are. Cached results are tracked by what tables/procedures contributed to their results. If a data event marks a contributing procedure as modified, then existing entries will be invalidated.

> Does TEIID-2139 (common table expressions) affect the decision?

The interaction with the existing TEIID-2139 logic and caching is that if the CacheDirective scope is NONE or cached results are returned, then the intra-command result caching is disabled.

> Should translators be allowed to invalidate results based on filter criteria? eg, invalidate the one of the following executions but not the other:

Possibly, as you could track down to a row level - but there is always a trade-off between the complexity of the system and the rapidity of the changes. For typical read mostly situations you want as little overhead as possible to validating cache entries (for example the h2 caching system last I checked just used a database wide increment such that any update would invalidate all older entries).

> While on the subject of APIs, when I was first thinking about this, I had assumed that dataAvailable would both

It does 2. It does not do 1 - this would only make sense if you are thinking in terms of invaliding a single command result otherwise there's no notion of what has changed.

Assuming that session level caching with group/procedure invalidation is not sufficient and ignoring other concerns about what/when to cache what is quickly doable from an implemenation perspective:

- Adding a user command scope to the cache directive is straight-forward (there is a great deal of overlap with the concept of intra-query command caching)

- Allow dataAvailable to invalide the user command cache entry if one was created.

This is really part of the idea around what hints could aid TEIID-2139 the user should be able to direct at some level what is considered to be a common subexpression for processing purposes (whether that's common within a single execution or in multiple).

Makes sense. It bothers me a little (but only a little) that caching concerns should leak up to the user query. I'd prefer that this information (somehow) be supplied in the metadata and let the planner do its job. However, I wouldn't want this to get in the way of progress.

> Should translators be allowed to invalidate results based on filter criteria? eg, invalidate the one of the following executions but not the other:

Possibly, as you could track down to a row level - but there is always a trade-off between the complexity of the system and the rapidity of the changes. For typical read mostly situations you want as little overhead as possible to validating cache entries (for example the h2 caching system last I checked just used a database wide increment such that any update would invalidate all older entries).

Absolutely agreed. My concern was around allowing for the option or not. That said, table level invalidation is good enough to start with. Does bring to mind: If we find a need for finer control, can we use views to partition the table and invalidate cache based on view name?

> Yes, I think a new scope is needed. If a client issues the same query at two different times from the same connection, I would not expect the second query to receive cached results.

Is there a reason for this? If you are actively invalidating, then won't the cache always reflect your expectations? Or do you have a need more specific to continuous queries?

I see your point but I have a concern - When does the translator have the opportunity to invalidate the cache? One possibility is that the execution factory would maintain a connection to its data source to invalidate data entirely independently from queries. This approach works in the case of a database triggers - the trigger fires every time a table is modified, the translator receives the trigger and invalidates the cache. It doesn't work so well when the data source doesn't provide asynchronous notification facilities. I'd like some opportunity to invalidate the cache synchronous to query processing. For example, getCacheDirective() could be invoked in all cases so the translator has a chance to interrogate the data source and, possibly, invalidate the cache. I imagine it would work something like this:

Client issues "SELECT c FROM t"

Engine calls ExecutionFactory.getCacheDirective()

getCacheDirective() returns SESSION scope

Cache is empty, so execution is created and data is cached & returned to client

Client issues "SELECT c FROM t"

Engine calls getCacheDirective()

getCacheDirective() interrogates data source and computes that cache is valid

getCacheDirective() return SESSION scope

Engine returns data from cache

Client issues "SELECT c FROM t"

getCacheDirective() interrogates data source and determines that data has changed

getCacheDirective() invalidates the cache

getCacheDirective() returns SESSION scope

Cache is empty / invalid, so execution is created and data is cached & returned to the client

In the simple case, I don't see a performance hit as getCacheDirective() would normally be a very cheap operation.

Thinking about this a bit further: If calling getCacheDirective() before every call doesn't make sense (eg - I'm not sure what it means to change the cache scope while something is cached), maybe it makes sense to introduce a new method in execution factory that is executed before createExecution - public void preExecution(Command, ExecutionContext, RuntimeMetadata)

The default implementation would do nothing but, obviously, execution factories would override it to 'prep' in some way for the upcoming query such as invalidating the cache. I'm not sure what other useful things this method could do.

The functionality based upon modification timestamps on all of the metadata objects. When a query is planned/executed the objects that are used are tracked with the cache entries that are created. Based upon the entry creation timestamp we'll invalid any access to that entry if there are any modifications that happened after its creation.

The api isn't directly exposed to the translator level, rather we just assume a lookup will be performed by whatever needs to asynchly notify Teiid of changes. In embedded mode the EmbeddedServer itself is the EventDistribtorFactory.

> The default implementation would do nothing but, obviously, execution factories would override it to 'prep' in some way for the upcoming query such as invalidating the cache. I'm not sure what other useful things this method could do.

It's probably best to clarify which problem you're trying to address. Caching at a "user command" level a specific source query or taking a new approach to the existing session level or above caching.