History

I have read through the code of the FSM implementation. There is still a lot of stuff that needs to be completed, especially the handover stuff. I am not entirely sure if this FSM is breaking existing handover features. That would be good to know for sure first.

Also the there are a lot of signals sent from random code locations. Thats probably intentional, but maybe it would be better to call a function inside bsc_subscr_conn_fsm.c that then sends the signal like we did it with the MGCP FSMs. This would allow us to run some checks and so some assertions.

The FSM had a small bug that prevented it from working probably when making calls. I fixed that and it seems to run really well!

I have tested the internal handover with the FSM. It works fine, so the the FSM did not break anything. Also the FSM now cleans up/frees correctly. The synchronization between the MGCP FSM is still problematic. We can not run them separately, they need to be synchronized because the port/ip information for the RTP streams becomes known at different times.

While integrating the FSM I ran into some trouble, nothing that couldn't be fixed, but after discussion with Harald we decided to go a bit further and clean up the FSM that controls the MGCP side. The vision is to have a generalized FSM that can be used in osmo-bsc, osmo-msc and in any other project that somehow needs to interact with an MGW. The FSM works as a child FSM of some parent (in the bsc case this is the GSCON FSM).

For now the FSM has two states ST_READY and ST_WAIT. In ST_READY we wait for an action to do e.g. EV_MDCX. Then we call the mgcp_client, the mgcp_client calls the callback and the callback sends back an EV_MDCX_RESP. The code in ST_WAIT gets executed and a signal is sent back to the parant FSM. This works the dams for DLCX and CRCX. CRCX is an exception. When the function to execute the CRCX is calld, then the FSM is created and the EV_CRCX automatically issued. This is the only way to perform an CRCX, so once the FSM came up successfully we can be sure that the connection exists.

In case of error the FSM shall unlink from its parent, inform the parent via the term event and then safely destroy itsself. This allows us to tear down the SCCP connection quickly while the MGCP cleanups are still in progress. Its also much simpler for the parent FSM since the parent does not have to synchronize.

The API is just a set of three function, one for CRCX (sets up the FSM and does the actual CRCX), one for MDCX and one for DLCX (also takes care of cleaning up everything)

The signals that are sent back to the parent fsm will always contain a pointer to the private data of the FSM, from there the parent can get relevant information (ports, IP-Address, Endpoint identifier etc.)

I am directly implementing the FSM in osmo-mgw, so it won't be in osmo-bsc, osmo-bsc will just use it and we may also decide to use it in osmo-msc as well once we are confident that it works.

I would make sure to introduce more states. In the initial state, until the CRCX is acknowledged,no MDCX is permitted. MDCX is only permitted after a successful CRCX response has been received,so you need to differentiate thie state from the other?

The API is just a set of three function, one for CRCX (sets up the FSM and does the actual CRCX), one for MDCX and one for DLCX (also takes care of cleaning up everything)

I think the API should be more future-proof and flexible. Let's make sure that at leaset the addr/portparameters are passed in via some struct pointer. This way we can extend it later on without breakingthe ABI.

The client will soon / should for example provide the SDP related bits such as which codec shall be used for that connection, ... - and once we start doing this, we don't want to break the API again and again.

It might make sense to have a struct for the local (mgw) side of the connection and another (or a copy of the saeme struct?) for the remote (bts/core network) side. If only the "local" is passed in while the "remote" struct is NULL, then the CRCX only does the "bind" operation. If both local+remote are specified, it performs bind+connect. This should map to the concepts of MGCP (local / remote connection options).

I am directly implementing the FSM in osmo-mgw, so it won't be in osmo-bsc, osmo-bsc will just use it and we may also decide to use it in osmo-msc as well once we are confident that it works.

I see, I have now moved more of the logic into separate states. In the end I think it also makes the code easier to understand. We have now:

ST_CRCX,
ST_CRCX_RESP,
ST_READY,
ST_MDCX_RESP,
ST_DLCX_RESP,

ST_CRCX is entered immediately on startup. It sends the CRCX message to the MGW and enters ST_CRCX_RESP. When the response is received the FSM fires an EV_CRCX_RESP to tell the parent that the CRCX was successful. Then it waits in ST_READY for further operations. For the other operations, the mechanism is nearly the same. When e.g. an EV_MDCX is received, the a message is generated and sent to the MGW. We enter ST_MDCX_RESP and wait for the MGW response. When the response is received, we fire the matching event and return back to ST_READY.

Let's make sure that at leaset the addr/port

The parameters that relate on the connection are now encapsulated in struct mgcp_conn_info (see below). The same struct is used to return results with in the event pointer. This allows us to have the struct with the context information entirely private. I have removed it from the header file now, so that it is inaccessable from outside.

...If both local+remote are specified...

On the CRCX we may or may not pass in an struct mgcp_conn_info *conn_info. If it is not passed, then the we know that the user only wants to bind. If it is there it we know that the user wants to do bind+connect. The result is returned with the event pointer. (I do not really get what you mean. Why would we want to pass remote and local to the create function? To tell it a memory location where it can write to? I have reserved fixed memory in the contect struct now. I think this is more convenient, since we can return the pointer with the signal and the callee does not have to take care for the memory.)

So far I have now the CRCX and the error handling (timeout, cleanup...) working. For debugging I created myself a small test FSM that acts as parent. There is also the question what should happen if someone tries to execute an operation while the FSM is budy. Under normal conditions this should never happen, since Child and parent are synchroinzed through events. But for DLCX we need something. I have solved this with a flag. If someone tries to trigger a DLCX while the FSM is busy, it just sets the flag and does nothing. When the FSM is done with the pending operation it checks the flag. If it finds the flag set. It continues with the DLCX operation.

However, I noticed that I still get a segfault when the parent FSM frees. Otherwise I think it should be fine by now and we are ready to try it out in real life.

I am a bit unhappy with struct mgcp_conn_info We have the two members endpoint and call_id there. These are misplaced when doing an MDCX. On CRCX they are fine. We need to specify the call_id and an endpoint (with wildcard). On MDCX we do not specify no endpoint and no call_id, but the members exist. So if someone specifies the endpoint it is at least checked. The call_id is an unsigned int. On MDCX it is ignored since I can not detect if it is correctly specified or not. This is kind of a cosmetic issue. The Alternative would be more different struct types which would also be not nice either.

I am now using the client FSM with bsc_subscr_conn_fsm. Voice calls work without crashing, also the endpoints seem to be released without problems. However, the whole thing is not ready yet. I still need some error handling and return code checking.

I have merged the changes I made with the latest state of laforge/fsm and pushed the changes to pmaier/fsm. This was a bit bumpy since laforge/fsm changed here and there while I worked out my changes, but now its in sync again. I hope I did not break anything again.

I am currently working on the problem with the rogue callers of osmo_bsc_sigtran_send(). We must ensure that osmo_bsc_sigtran_send() is called from nowhere except from bsc_subscr_conn_fsm.c and even there it may only be called when an sccp connection is open.

I have identified the following locations:

./src/osmo-bsc/osmo_bsc_api.c:52: osmo_bsc_sigtran_send(conn, resp);
./src/osmo-bsc/osmo_bsc_bssap.c:679: osmo_bsc_sigtran_send(conn, resp);
calles from bssmap_handle_cipher_mode() on bad cipher mode.
./src/osmo-bsc/osmo_bsc_bssap.c:852: osmo_bsc_sigtran_send(conn, resp); called from bssmap_handle_assignm_req() on bad assignment => ok!
./src/osmo-bsc/osmo_bsc_api.c:478: osmo_bsc_sigtran_send(conn, resp); CLEAR REQUEST
This is also problematic, if the bsc_api can issue uncontrolled clear reqests
at any time. Should be fixed now, since now it uses a signal to the FSM.
./src/osmo-bsc/osmo_bsc_api.c:158: queue_msg_or_return(resp); SAPI n REJECT
./src/osmo-bsc/osmo_bsc_api.c:169: queue_msg_or_return(resp); CIPHER MODE COMPLETE
./src/osmo-bsc/osmo_bsc_api.c:462: queue_msg_or_return(resp); ASSIGNMENT FAIL
This is a bug, the assignment failure message shoud not be generated here, instead
we should dispatch GSCON_EV_RR_ASS_FAIL to the FSM so that the FSM can take care
of this properly.
./src/osmo-bsc/osmo_bsc_api.c:491: queue_msg_or_return(resp); CLASSMARK UPDATE

I need some advise for the handling of the MGW. Since we now have the FSM based interface for the MGW its a lot easier to handle the MGW connections. But even now we end up with three extra states. ST_WAIT_CRCX_BTS, ST_WAIT_MDCX_BTS, and ST_WAIT_CRCX_MSC. Between ST_WAIT_CRCX_BTS and ST_WAIT_MDCX_BTS also sits ST_WAIT_ASS_CMP. Over the whole MGW handling and assignment period we must pass DTAP traffic and be also ready to handle other SCCP traffic. I think it would look a lot nicer if the MGCP handling stuff would have its own FSM, however, then we gain even more complexity and also this separate FSM then would be responsible for a lot of GSCON related stuff. I am not sure if this is so helpful. But however. Once we add Handover again, GSCON will gain another two states for the MDCX procedure (or maybe not if we do a smart reordering of the states). I think it would be good if we could discuss this issue tomorrow. There are also some open questions regarding the FSM.

The assignment phase will be split up in lchan allocation and mode modify phase. At the moment we just call gsm0808_assign_req() which does not offer the flexibility we need here.

Last week I ran a few TTCN3 tests against the current state and found a few crashes. Those are fixed now. It looks pretty stable now. Also its now impossible to send connection oriented sigtran messages when the FSM is in an "unconnected" state.

Current status: I am currently working on distinguishing the assignments with voice channels from the assignments which only request as signaling channel (Since we do not support CSD, we should reject any requests for data channels).

Note: the current state (rebased to master) can be found on pmaier/fsm3. However, the rebase was not all to smoothly, I had a lot of merge conflicts. I did a quick manual test and it seems to work fine, also the VTY triggered handover. But I am not sure about the recently handover related stuff. We need to check on that.

BSC_Tests.TC_ho_int seems to be indeed a regression. All other tests look fine when compared against the current jenkins result.

I see TC_ho_int ending "inconclusive" and see the same on jenkins. What do you see when using the branch?

My biggest hindrance for testing the branch is that the osmo-bsc.git test suite doesn't build and hence 'make install' doesn't work. I'd need to work around that, by omitting the tests, but I'd prefer if we got those working instead. Obviously some code needs to move around to get the tests to build, still, and no point in testing thoroughly if we know it needs refactoring anyway.

Not too much progress during the last week. The internal handover code had a bug which got caught thanks to the TC_ho_int test. The current status is on pmaier/fsm4.

@neels: I can get TC_ho_int to pass, but it only works sometimes. I have the feeling that there are race conditions in the TTCN3 code. The behavior I observe is not specific to pmaier/fsm4, the current master exhibits the same behavior.

@neels: I think we are almost there. When you can manage to get the handover unit-tests to pass again we can squash all patches into one and submit to review.

The GSCON related commits are now squased + tested using TTCN3 and a manual test. The current state can be found on pmaier/fsm5. The patch is up for review in gerrit now: https://gerrit.osmocom.org/7142