Hi,
Currently if you call osmo_fsm_dispatch from a function that happens to have been triggered by another call to dispatch, things can get a little bit messy (depending on where that callback is in the chain of even that may/may not change the state of the FSM).
So what I would propose is that any event dispatched to an FSM will be processed only after the processing of any previous event is fully completed (in the order they were generated). I have a prototype implementation for review on gerrit :
https://gerrit.osmocom.org/c/libosmocore/+/12983
Now, this does break some tests, but looking at the breakage, it looks just like message re-ordering at first glance. I guess running the whole full range of test on this would be useful.
There are also another consequence is that it's now impossible to return an error code if that event can't be handled in the current state (previously dispatch would return -1). But since the actual dispatch can be deferred ... well, can't do much. Looking at the codebase, that error code is ignored most of the time. When it's returned, it seems to always end up being ignored at the end of the chain (because well ... realistically there isn't much you can do about it anyway, it's a programming bug if it happens).
Thoughts / Comments ? Useful ? Crazy idea ? ...
Cheers,
Sylvain
Hi Sylvain,
On Wed, Feb 20, 2019 at 08:50:01PM +0100, Sylvain Munaut wrote:
Currently if you call osmo_fsm_dispatch from a function that happens to have been triggered by another call to dispatch, things can get a little bit messy (depending on where that callback is in the chain of even that may/may not change the state of the FSM).
Yes, it's not particularly elegant in all cases, but it fits our synchronous architecture, where we are used to processing one packet (that may become an event) in one go.
So what I would propose is that any event dispatched to an FSM will be processed only after the processing of any previous event is fully completed (in the order they were generated). I have a prototype implementation for review on gerrit :
This will break all use cases where the "data" pointer of an event is something allocated on the stack by the caller, or where it is a msgb or other object that simply may no longer exist at the time the deferred event is being processed.
Or am I missing something?
So while I trust there are use cases where your proposed behavior is useful, we must not change the existing semantics. We could introduce a new "dispatch_event_deferred()" or something like that which exhibits the new properties, while the existing API semantics remain unmodified. Would that help you?
Regards, Harald
Hi,
So yeah, it definitely breaks stuff. Both because of those pointers become invalid and also some code seems to rely on that order. I pushed another version of the patch which just introduces a new function with the "queue" semantics, but that might not be the right approach to solve the initial problem.
So the initial issue is that once a RAN connection is first established, it goes to the RAN_CONN_S_ACCEPTED state and starts a timeout. Then when some messages are effectively exchanged, "something" eventually calls ran_conn_communicating() and this transitions the fsm to RAN_CONN_S_COMMUNICATING state (via sending a RAN_CONN_E_COMMUNICATING event) and removed the timeout.
Now for a silent call, there might not be any message exchanged (since we really have nothing to say) so the options are :
* Brute force approach is to hard code a bypass of the RAN_CONN_S_ACCEPTED state directly to RAN_CONN_S_COMMUNICATING :
case RAN_CONN_E_ACCEPTED: evaluate_acceptance_outcome(fi, true); - osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED, RAN_CONN_TIMEOUT, 0); + if (conn->silent_call) + osmo_fsm_inst_state_chg(fi, RAN_CONN_S_COMMUNICATING, 0, 0); + else + osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED, RAN_CONN_TIMEOUT, 0);
Personally not a big fan since you 'distribute' the knowledge of the silent call around in other part of the code.
* Make it transition from ACCEPTED to COMMUNICATING from silent_call.c. To achieve this, what I first tried was to simply call ran_conn_communicating() from the paging callback. Unfortunately that doesn't work because if you lookat the current code :
case RAN_CONN_E_ACCEPTED: evaluate_acceptance_outcome(fi, true); osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED, RAN_CONN_TIMEOUT, 0);
the paging callback is called as part of "evaluate_acceptance_outcome" and so this comes before the state is changed to ACCEPTED. Deferring the dispatch of the event by changing the FSM behavior was an attempt to fix this. (so the new RAN_CONN_E_COMMUNICATING I generate would only be processed lates).
But another option that also works is simply to swap the two calls and change the FSM state first :
case RAN_CONN_E_ACCEPTED: osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED, RAN_CONN_TIMEOUT, 0); evaluate_acceptance_outcome(fi, true);
I checked and this also works fine.
Cheers,
Sylvain
Cheers,
Sylvain
On Thu, Feb 21, 2019 at 03:54:10PM +0100, Sylvain Munaut wrote:
case RAN_CONN_E_ACCEPTED: osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED,RAN_CONN_TIMEOUT, 0); evaluate_acceptance_outcome(fi, true);
This should be perfectly fine, and the exact same approach has fixed other sequence-of-events problems earlier: transition the state first, and then take actions.
Re deferring events:
We can't globally change events to deferred, exactly because of the reasons mentioned: - passing along a data pointer to a local variable. - returning error code if transition is not allowed (evaluated in rare cases).
Also, in general, it is safer to use the "more deterministic" way of immediate event triggering. If events get stored in sequence, maybe some other message that gets rx'd in-between (maybe it already was in the select() queue before we added events to be triggered) gets handled in some intermediate state in-between action and effect; that opens up variations in code paths that would otherwise be guaranteed to always happen in exactly identical sequence.
It could still be useful to have some defer() API.
I recently faced the need for a deferred action. My particular scenario is: a child object tells its parent that it is gone from its FSM cleanup implementation, and the parent deallocates; which also free()s the child.
In pseudo code:
struct main_obj { struct child_obj *child; };
struct child_obj { struct osmo_fsm_inst *fi; struct main_obj *parent; };
child_obj_alloc(struct main_obj *parent) { /* our common "FSM instance owns object" scheme: */ fi = osmo_fsm_inst_alloc(talloc_ctx = parent); child = talloc_zero(talloc_ctx = fi, struct child_obj); child->fi = fi;
parent->child = child; }
child_obj_fsm_cleanup(struct child_obj *child) { main_obj_remove_child(child) }
main_obj_remove_child(main) { /* Without children, I have no purpose. Bye. */ talloc_free(main); }
--> heap use after free in osmo_fsm_inst_term(child->fi), after calling child->fi->cleanup().
The point is that the child is allocated as a talloc child of the main object, and the main object is wired to deallocate itself when no children are left. When I call osmo_fsm_inst_term() on the child, its cleanup function tells the parent that it is gone, and the parent completely deallocates itself, still within the cleanup code path. But the osmo_fsm_inst code then continues to talloc_free() the child->fi after it called the cleanup function -- though the entire object was already freed in the cleanup function.
(The actual case in osmo-msc:neels/ho: the msc subscriber struct having MSC-{A,I,T} roles, and as soon as the MSC-A role is gone, or there is neither an MSC-I nor MSC-T role, the entire subscriber must be released.)
The solution I picked in the end was to also add an FSM instance to the main_obj, because if the child FSM is an osmo_fsm_inst_alloc_child(), it will trigger an event to the parent *after* all other handling is done. So the entire main_obj FSM instance only exists to receive the parent_term event after the child FSM is deallocated, instead of being notified of a released child already in the cleanup function.
I considered solving like this instead:
child_obj_fsm_cleanup(struct child_obj *child) { defer( main_obj_remove_child, child->parent ); }
where defer could be implemented as an osmo_timer with zero delay.
Triggering events could work in the same way, i.e. the defer implementation does not have to be osmo_fsm specific.
But also, in this instance, it is quite possible that something else happens in-between the child deallocation and the deferred event handling. If some msg gets handled while the parent still points at the child, that would pose a use-after-free risk. Of course we could set parent->child = NULL in the cleanup callback and defer the deallocation ... but the point being that deferring actions opens up non-determinisms that can be non-trivial to handle well. Having a separate full FSM instance might be a bit of overkill, but has the benefit of allocation/deallocation and event logging.
~N