Description

my 389-ds-base-1.3.2.8-1.fc20.x86_64 is randomly crashing somewhere in send_ldap_search_entry_ext() function. I have captured coredump and stack traces, I will attach them to this ticket.

You can analyze the coredump in Brno lab, vm-123.

I have a feeling that it is related to connection handling/connection termination, but I'm not able to find solid reproducer. The crash in the same function occurs randomly but I have seen that several times.

@richm: looking at backtrace.log I do see only one search:
rawbase = 0x7f6224000e40 "cn=dns-100x100, dc=ipa,dc=test"
fstr = 0x7f6224001420 "(|(objectClass=idnsConfigObject)(objectClass=idnsZone)(objectClass=idnsForwardZone)(objectClass=idnsRecord))

which is client srch request, so I don't see the relation to syncrepl.

@pspacek: what makes you think it is related to connection management ? syncrepl in persistent mode keeps the connection and operation, maybe there is a side effect of freeing some parts of the conn.
Could you collect a few more core dumps from the next crashes ?

Thread 42 is in sync_send_results() referencing connection conn = 0x7f624c313410. This is the same connection that is doing the search - Thread 32 - #7 0x00007f6260db024b in flush_ber (pb=pb@entry=0x7f622f7f5ae0, conn=conn@entry=0x7f624c313410

It looks like the code in sync_send_results() has mostly been copied from ps_send_results(), with the exception of the following: ps_send_results has some extra pblock cleanup, starting at psearch.c:401; ps_send_results calls connection_remove_operation_ext() which is required in the new no-malloc style op stack; there is this special code in connection_threadmain():

/* If this op isn't a persistent search, remove it */
if ( pb->pb_op->o_flags & OP_FLAG_PS ) {
PR_Lock( conn->c_mutex );
connection_release_nolock (conn); /* psearch acquires ref to conn - release this one now */
PR_Unlock( conn->c_mutex );
/* ps_add makes a shallow copy of the pb - so we
* can't free it or init it here - just memset it to 0
* ps_send_results will call connection_remove_operation_ext to free it
*/
memset(pb, 0, sizeof(*pb));

If the intention is that sync_send_results() "owns" the operation pblock and is responsible for its lifecycle management, then somewhere in the syncrepl code the pb->pb_op->o_flags needs to set the OP_FLAG_PS flag so that connection_threadmain knows not to touch this pblock.

I understand the crash I'm seeing in the test scenario. The intention of the refresh and persist implementation is to

send all requested initial entries (all or depending on a cookie)

send all modified entries using a seperate thread running sync_send_results

and sending modified entries should only start when th einitial refresh is completed.
This works with cookies, but without cookie the sync_send_result start to early and interleaves with sending the regular entries. The problem is that the persitent thhread uses the same operation as the original search thread (no conflict if it starts after refresh is complete), and both set and get: SLAPI_SEARCH_RESULT_ENTRY and free it when done, so in some cases on thread tries to use an entry which was just freed.
A fix would be to correctly delay the start of the sync_send_results

The other "crash" doesn't seem to be a crash. If running the test under gdb, gdb halts when the connetion is terminated while writing lin libc_send and logs: Program received signal SIGPIPE, but with continue the process continues and the error is handled in flush_ber calling do_disconnect_server.

with the attached patch the regular crashes in the test env provided by pspacek are no longer reproducable. But the client machine did runout of memory after some time, so long duration test was not yet possible

I have a question not related to that fix. In sync_persist_terminate, it may call sync_remove_request to remove a request from sync_request_list.
In sync_remove_request I have not seen any free of the request, only removal from the list. How the request is retrieved to be freed ?