fsm: Delaying dispatch of events while we're already inside a dispatch callback.

historical

On Thu, Feb 21, 2019 at 03:54:10PM +0100, Sylvain Munaut wrote:
>         case RAN_CONN_E_ACCEPTED:
>                osmo_fsm_inst_state_chg(fi, RAN_CONN_S_ACCEPTED,
> RAN_CONN_TIMEOUT, 0);
>                 evaluate_acceptance_outcome(fi, true);

This should be perfectly fine, and the exact same approach has fixed other
sequence-of-events problems earlier: transition the state first, and then
take actions.

Re deferring events:

We can't globally change events to deferred, exactly because of the
reasons mentioned:
- passing along a data pointer to a local variable.
- returning error code if transition is not allowed (evaluated in rare
  cases).

Also, in general, it is safer to use the "more deterministic" way of
immediate event triggering. If events get stored in sequence, maybe some
other message that gets rx'd in-between (maybe it already was in the
select() queue before we added events to be triggered) gets handled in
some intermediate state in-between action and effect; that opens up
variations in code paths that would otherwise be guaranteed to always
happen in exactly identical sequence.

It could still be useful to have some defer() API.

I recently faced the need for a deferred action. My particular scenario
is: a child object tells its parent that it is gone from its FSM cleanup
implementation, and the parent deallocates; which also free()s the child.

In pseudo code:

  struct main_obj {
    struct child_obj *child;
  };

  struct child_obj {
    struct osmo_fsm_inst *fi;
    struct main_obj *parent;
  };

  child_obj_alloc(struct main_obj *parent)
  {
    /* our common "FSM instance owns object" scheme: */
    fi = osmo_fsm_inst_alloc(talloc_ctx = parent);
    child = talloc_zero(talloc_ctx = fi, struct child_obj);
    child->fi = fi;

    parent->child = child;
  }

  child_obj_fsm_cleanup(struct child_obj *child)
  {
    main_obj_remove_child(child)
  }

  main_obj_remove_child(main)
  {
    /* Without children, I have no purpose. Bye. */
    talloc_free(main);
  }

  --> heap use after free in osmo_fsm_inst_term(child->fi),
      after calling child->fi->cleanup().

The point is that the child is allocated as a talloc child of the main
object, and the main object is wired to deallocate itself when no children
are left. When I call osmo_fsm_inst_term() on the child, its cleanup
function tells the parent that it is gone, and the parent completely
deallocates itself, still within the cleanup code path.  But the
osmo_fsm_inst code then continues to talloc_free() the child->fi after it
called the cleanup function -- though the entire object was already freed
in the cleanup function.

(The actual case in osmo-msc:neels/ho: the msc subscriber struct having
MSC-{A,I,T} roles, and as soon as the MSC-A role is gone, or there is
neither an MSC-I nor MSC-T role, the entire subscriber must be released.)

The solution I picked in the end was to also add an FSM instance to the
main_obj, because if the child FSM is an osmo_fsm_inst_alloc_child(), it
will trigger an event to the parent *after* all other handling is done.
So the entire main_obj FSM instance only exists to receive the parent_term
event after the child FSM is deallocated, instead of being notified of a
released child already in the cleanup function.

I considered solving like this instead:

  child_obj_fsm_cleanup(struct child_obj *child)
  {
    defer( main_obj_remove_child, child->parent );
  }

where defer could be implemented as an osmo_timer with zero delay.

Triggering events could work in the same way, i.e. the defer
implementation does not have to be osmo_fsm specific.

But also, in this instance, it is quite possible that something else
happens in-between the child deallocation and the deferred event handling.
If some msg gets handled while the parent still points at the child, that
would pose a use-after-free risk. Of course we could set parent->child =
NULL in the cleanup callback and defer the deallocation ... but the point
being that deferring actions opens up non-determinisms that can be
non-trivial to handle well. Having a separate full FSM instance might be a
bit of overkill, but has the benefit of allocation/deallocation and event
logging.

~N
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20190221/96e17f28/attachment.bin>