(New page: === Connectivity Consolidation/Resequencer Buffer (CCB, CRB) === There was an idea to handle this case in the connectivity directly with the help of a buffer: # each incoming PR is buffe...)

Latest revision as of 14:57, 7 October 2009

Connectivity Consolidation/Resequencer Buffer (CCB, CRB)

There was an idea to handle this case in the connectivity directly with the help of a buffer:

each incoming PR is buffered for a period of time X. X is at minimum as long as the longest processing path takes for any given record. In the beginning this value is certainly chosen manually but with evaluating Performance Counters it should be possible to get X automatically or adjust it.

during the time of PR in the buffer, additional PRs for the same resource are consolidated retaining only the latest to reduce load

CON

lagall PRs will have the lag of ~2 times X before the index is updated. for mass crawling this might be acceptable but an application using agents usually tries to minimize the period between the resource change and the update of the index.

Igor.novakovic.empolis.com This is not a problem at all! Nobody wants so find something that is work in progress (constantly changes). For example: Even if we instantly update our index the user will still have some delay between inspecting search results and viewing some specific document. If the document constantly changes, than by viewing it the user may still see some diferent version than the one we indexed.

no guarantee that X is sufficient delaying processing will reduce the chances of mishaps but there is no guarantee that this is really so. the simpliest case of voiding the mechanism even in a simples scenarios, is when the system is for what ever reason under a higher load than usual. even more so when the processing chain is more complex such as in a cluster setup to spread processing load over several nodes. in such a scenario we will also need to take into account that some nodes may be down temporarily while retaining the records that were assigned to them.

connectivity may have to store a very large amount of items before it can rout them, and these need have to presisted on shutdown etc as well.

Igor.novakovic.empolis.com The Buffer component (which would be a part of Connectivity module) would have its own queue. In this queue only the record ID and the timestamp should be stored. Document's metadata and the content would be then fetched from an agent when buffer decides to send some operation on the router.

PRO

simple to implement and has no effect on the API or other logic

Igor.novakovic.empolis.com This solution scales because the execution order of operations on _one_ particular record _does not_ matter.