We should clone the SolrInputDocument before adding locally and then send that clone to replicas.

Details

Description

If we don't do this, the behavior is a little unexpected. You cannot avoid having other processors always hit documents twice unless we support using multiple update chains. We have another issue open that should make this better, but I'd like to do this sooner than that. We are going to have to end up cloning anyway when we want to offer the ability to not wait for the local add before sending to replicas.

Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone?

Not sure I understand the problem this is solving. version is added to the SolrInputDocument and then that can be indexed locally and sent to replicas. DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)

Yonik Seeley
added a comment - 14/Mar/12 01:52 Not sure I understand the problem this is solving. version is added to the SolrInputDocument and then that can be indexed locally and sent to replicas. DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)

1. You don't want documents to have every processor applied to them twice. Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will. Barring future features (like the JIRA around update chains and solrcloud), there is no way currently to have an update processor only hit a document once when they go to a replica.

2. When we want to offer the option of adding locally and sending to replicas at the same time, we still need this, and we don't want to get people used to the current odd behavior because you couldn't even easily duplicate it (not that i think you'd want to) if you wanted to add locally and send to replicas in parallel.

Mark Miller
added a comment - 14/Mar/12 02:00 A couple use cases:
1. You don't want documents to have every processor applied to them twice. Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will. Barring future features (like the JIRA around update chains and solrcloud), there is no way currently to have an update processor only hit a document once when they go to a replica.
2. When we want to offer the option of adding locally and sending to replicas at the same time, we still need this, and we don't want to get people used to the current odd behavior because you couldn't even easily duplicate it (not that i think you'd want to) if you wanted to add locally and send to replicas in parallel.

DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)

the main scenerio i've seen/heard mentioned is the idea of processors that are computationally cheap, but increase the size of the document significantly (ie: clone a big ass text field and strip the html from the clone) so you want it to happen after distrib (on every replica) to minimize the amount of data sent over the wire.

Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will.

to clarify (because i kept not-understanding what the crux of the issue was here so if i post this comment i'll remember next time w/o needing to ask mark on IRC again) if we do NOT clone the doc, there is a race condition where local processors executing after the distrib processor may modify the documents before the are serialized and forwarded to one or more shards.

one way to avoid this would be to stop treating the "local" replica as special, and instead have distrib forward back to localhost (via HTTP) just like every other replica) and abort the current request.

Hoss Man
added a comment - 21/May/12 19:21
DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)
the main scenerio i've seen/heard mentioned is the idea of processors that are computationally cheap, but increase the size of the document significantly (ie: clone a big ass text field and strip the html from the clone) so you want it to happen after distrib (on every replica) to minimize the amount of data sent over the wire.
Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will.
to clarify (because i kept not-understanding what the crux of the issue was here so if i post this comment i'll remember next time w/o needing to ask mark on IRC again ) if we do NOT clone the doc, there is a race condition where local processors executing after the distrib processor may modify the documents before the are serialized and forwarded to one or more shards.
one way to avoid this would be to stop treating the "local" replica as special, and instead have distrib forward back to localhost (via HTTP) just like every other replica) and abort the current request.

What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.

Mark Miller
added a comment - 21/May/12 19:32 Are there use cases for this?
What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.

What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.

that seems like a complicity distinct problem ... as i mentioned in SOLR-3473, the distrib update proc really needs to be aware of all properties of the AddUpdateCommand and serializes/deserialize that info when forwarding to the leader (and all of the replicas) .. in the case of "updateTerm" it's not even enough to make sure the shard leader and all replicas know about it – other docs in other shards might also need to be deleted.

Hoss Man
added a comment - 21/May/12 19:49 What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.
that seems like a complicity distinct problem ... as i mentioned in SOLR-3473 , the distrib update proc really needs to be aware of all properties of the AddUpdateCommand and serializes/deserialize that info when forwarding to the leader (and all of the replicas) .. in the case of "updateTerm" it's not even enough to make sure the shard leader and all replicas know about it – other docs in other shards might also need to be deleted.

one way to avoid this would be to stop treating the "local" replica as special

Just to point out explicitly: currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first. So unless we also address some other issues, we'd have to allow things to happen both ways.

Mark Miller
added a comment - 21/May/12 20:32 one way to avoid this would be to stop treating the "local" replica as special
Just to point out explicitly: currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first. So unless we also address some other issues, we'd have to allow things to happen both ways.

Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone?

... currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first.

Strawman suggestion: instead of using simple SolrInputDocument.clone(), with the scariness miller mentioned about some processor maybe creating a field value that isn't Clonable, what if instead we:

use the JavaBinCodec to serialize the SolrInputDocument to a byte[]

hang onto that byte[] while doing the local add

then (re)use that byte[] in all of the requests to the remote replicas

not sure how easy the resuse of the byte[] would be given the existing SolrJ API but...

Even if some field values aren't Clonable primatives, they have to be serializable using the codec or we already have a bigger bug to worry about the the risk of concurrent mods to the object

Bonus: we only pay the cost of serializing the SolrInputDocument once on the leader, not N times for N replicas.

Only "downside" i can think of is that the leader has to buffer the whole doc in memory as a byte[] instead of streaming it – but if we assume most shards will have more then N replicas, the trade off seems worth it – to me anyway (optimize to use more RAM and save time/cpu in serializing redundently)

Hoss Man
added a comment - 22/May/12 23:39 Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone?
... currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first.
Strawman suggestion: instead of using simple SolrInputDocument.clone() , with the scariness miller mentioned about some processor maybe creating a field value that isn't Clonable, what if instead we:
use the JavaBinCodec to serialize the SolrInputDocument to a byte[]
hang onto that byte[] while doing the local add
then (re)use that byte[] in all of the requests to the remote replicas
not sure how easy the resuse of the byte[] would be given the existing SolrJ API but...
Even if some field values aren't Clonable primatives, they have to be serializable using the codec or we already have a bigger bug to worry about the the risk of concurrent mods to the object
Bonus: we only pay the cost of serializing the SolrInputDocument once on the leader, not N times for N replicas.
Only "downside" i can think of is that the leader has to buffer the whole doc in memory as a byte[] instead of streaming it – but if we assume most shards will have more then N replicas, the trade off seems worth it – to me anyway (optimize to use more RAM and save time/cpu in serializing redundently)

FWIW: the other use case that just occurred to me today is trying to use any update processors that deal with multiple field values (either existing processors people may have written, or any of the new ones added by SOLR-2802, or SOLR-2599) in conjunction with field updating (SOLR-139) which is implemented as part of the DistributedUpdateProcessor.

ie: if you want to have a multivalued "all_prices" field, and a single valued "lowest_price" field that you populate using a combination of CloneFieldUpdateProcessorFactory + MinFieldValueUpdateProcessorFactory – that will need to happen after the DistributedUpdateProcessor in order to ensure all the values are available if someone does field update to "add" a value to "all_prices".

Hoss Man
added a comment - 23/May/12 01:48 Are there use cases for this?)
FWIW: the other use case that just occurred to me today is trying to use any update processors that deal with multiple field values (either existing processors people may have written, or any of the new ones added by SOLR-2802 , or SOLR-2599 ) in conjunction with field updating ( SOLR-139 ) which is implemented as part of the DistributedUpdateProcessor.
ie: if you want to have a multivalued "all_prices" field, and a single valued "lowest_price" field that you populate using a combination of CloneFieldUpdateProcessorFactory + MinFieldValueUpdateProcessorFactory – that will need to happen after the DistributedUpdateProcessor in order to ensure all the values are available if someone does field update to "add" a value to "all_prices".

1) nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
2) several use cases identified above are simpler (or require) putting processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
3) any processor put between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory is a time bomb of unpredictability w/o some sort of cloning.

Hoss Man
added a comment - 06/Jul/12 01:38 I think this is still pretty damn important:
1) nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
2) several use cases identified above are simpler (or require ) putting processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
3) any processor put between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory is a time bomb of unpredictability w/o some sort of cloning.

Mark Miller
added a comment - 07/Jul/12 16:25 nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
Yonik has argued we shouldn't support this rather than cloning.

Can I suggest the concept of a "preprocessed" or "flattened" or "read-only" document?

Each processor would have the responsibility to say "I only changed un-flattened documents". After all, update processors can also do things like logging- it would not be ok to skip the logger just because it is above the DistributedProcessor stage.

If you want, you can add a marker interface for DistributedProcessor that means "I alter documents". The update process chain handler can skip these processors.

Lance Norskog
added a comment - 17/Jul/12 03:15 - edited Can I suggest the concept of a "preprocessed" or "flattened" or "read-only" document?
Each processor would have the responsibility to say "I only changed un-flattened documents". After all, update processors can also do things like logging- it would not be ok to skip the logger just because it is above the DistributedProcessor stage.
If you want, you can add a marker interface for DistributedProcessor that means "I alter documents". The update process chain handler can skip these processors.