RFC: talloc contexts / automatic free at select loop

This is merely a historical archive of years 2008-2021, before the migration to mailman3.

A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/OpenBSC@lists.osmocom.org/.

Neels Hofmeyr nhofmeyr at sysmocom.de
Sat Mar 23 23:43:21 UTC 2019


On Thu, Mar 21, 2019 at 09:30:28AM +0100, Harald Welte wrote:
> Hi Neels,
> 
> On Thu, Mar 21, 2019 at 02:24:30AM +0100, Neels Hofmeyr wrote:
> > Talking of features, another idea comes to mind: freeing of FSM instances. I
> > more often than not have the problem that a child FSM is freeing, telling the
> > parent about it, then the parent decides to free because the child's result
> > signalled some end, and then we have a double free because the child was
> > allocated with the parent as talloc context, the parent has just deallocated,
> > but after the child's cleanup is through, the fsm code wants to also deallocate
> > the child.
> 
> That just sounds like a bug and it should be fixed?  The question is whether
> we want to mandate all child contexts to be allocated from within a parent.  If
> so, then whatever mechanism to prevent this could / should become part
> of the fsm core.
> 
> > In these situations I so far need to introduce checks that prevent
> > the parent from freeing until the children are completely finished with their
> > actions -- for example, avoid signalling end from the cleanup function or event
> > handling yet, instead completely rely on the parent_term_event, which is
> > dispatched only after the child was freed. 
> 
> I wonder if those checks could be made part of the core.  Do you have a
> concrete code example to look at?

Just after writing the above, I have introduced another deallocation failure
mode in the cleanup among RTP-stream/call leg/MGW endpoint/MSC subscriber FSMs,
and am fed up with solving these ad hoc as they show up.

So now I am writing an explicit FSM deallocation test, to figure out how to
properly solve various cleanup scenarios -- so that I can hopefully apply one
and the same mechanism to all FSM implementations.


(This might get a bit complex to follow, a proposed solution is in the bottom.)


Here is the test: branch neels/fsm_dealloc_test of libosmocore
http://git.osmocom.org/libosmocore/commit/?h=neels/fsm_dealloc_test&id=31ed1036cee1ed4e0ccf12e83e2bc0918af8db47

The "scene" struct shows a number of objects, but have reduced to only three
being in use.

  branch0 <-parent/child-> twig0a
        \                   /
         ----- other -------

So both a parent and child FSM reference some "other" object which they depend
upon.

(A real world equivalent is that both a call_leg and rtp_stream FSM depend on
the MGW endpoint -- in which situation also all of those are children of an MSC
subscriber FSM, just saying complexity is easily added in real life)

The underlying problem here is:

 - when the child has an error and goes away, the parent should go away as
   well. So there is an event from the child to the parent saying "I am gone,
   you too go now": EV_CHILD_GONE.

 - if now I tell branch0 to deallocate regularly, it first tells the child to
   deallocate. The child then also says EV_CHILD_GONE.

 - The parent receives EV_CHILD_GONE and again triggers its own deallocation.
   Now there are two osmo_fsm_inst_term() called on the same FSM instance.

A way to fix this is to introduce a ST_DESTROYING state. It acts as a flag to
prevent re-triggering an osmo_fsm_inst_term(). So if as soon as once in
ST_DESTROYING, it will no longer issue osmo_fsm_inst_term() on itself.

My test models this structure.

In the test code try in turns to deallocate with a EV_DESTROY event, and then a
direct osmo_fsm_inst_term(). Both of them should ideally be safe to do,
because on weird errors anything should be able to just term an FSM instance.

Dispatching an EV_DESTROY actually works out fine.
But calling osmo_fsm_inst_term() directly fails.

The difference is that in the EV_DESTROY case, the branch0 FSM has changed to
state ST_DESTROYING, and if it receives events that peer objects are gone, it
ignores them. In the term case, the FSM instance stays in the "ALIVE" state, in
which case a term triggered at the child twig0a in turn dispatches an
EV_CHILD_GONE, which the parent branch0 acts upon by transitioning to the
ST_DESTROYING state, which launches a *second* osmo_fsm_inst_term() on itself.

The output looks like this -- first the successful run, then the failing one:


▶ ./fsm_dealloc_test 
DLGLOBAL DEBUG scene_alloc()
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0)
DLGLOBAL DEBUG test(other){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0
DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a
DLGLOBAL DEBUG scene_alloc()
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0)
DLGLOBAL DEBUG test(other){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0
DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a
DLGLOBAL DEBUG scene_alloc()
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0)
DLGLOBAL DEBUG test(other){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0
DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a
DLGLOBAL DEBUG ---------------------- test_destroy(branch0)
DLGLOBAL DEBUG --- before destroy cascade, got:
branch0
twig0a
other
DLGLOBAL DEBUG ---
DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_DESTROY
DLGLOBAL DEBUG 1 (branch0.alive())
DLGLOBAL DEBUG test(branch0){alive}: alive(EV_DESTROY)
DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying
DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter())
DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(branch0){destroying}: pre_term()
DLGLOBAL DEBUG test(twig0a){alive}: Terminating (cause = OSMO_FSM_TERM_PARENT)
DLGLOBAL DEBUG test(twig0a){alive}: pre_term()
DLGLOBAL DEBUG test(twig0a){alive}: Removing from parent test(branch0)
DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup())
DLGLOBAL DEBUG test(twig0a){alive}: cleanup()
DLGLOBAL DEBUG test(twig0a){alive}: scene forgets twig0a
DLGLOBAL DEBUG test(twig0a){alive}: removing reference twig0a.other[0] -> other
DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE
DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE)
DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.other_gone())
DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a
DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: state_chg to destroying
DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(other){destroying}: pre_term()
DLGLOBAL DEBUG 6 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup()
DLGLOBAL DEBUG test(other){destroying}: scene forgets other
DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0
DLGLOBAL DEBUG test(branch0){destroying}: Received Event EV_OTHER_GONE
DLGLOBAL DEBUG 7 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc
DLGLOBAL DEBUG test(branch0){destroying}: destroying(EV_OTHER_GONE)
DLGLOBAL DEBUG 8 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc
DLGLOBAL DEBUG test(branch0){destroying}: Dropped reference branch0.other[0] = other
DLGLOBAL DEBUG 7 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc
DLGLOBAL DEBUG 6 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup() done
DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: Freeing instance
DLGLOBAL DEBUG test(other){destroying}: Deallocated
DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup())
DLGLOBAL DEBUG test(twig0a){alive}: cleanup() done
DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter())
DLGLOBAL DEBUG test(twig0a){alive}: Freeing instance
DLGLOBAL DEBUG test(twig0a){alive}: Deallocated
DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),branch0.cleanup())
DLGLOBAL DEBUG test(branch0){destroying}: cleanup()
DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0
DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done
DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter())
DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance
DLGLOBAL DEBUG test(branch0){destroying}: Deallocated
DLGLOBAL DEBUG 1 (branch0.alive())
DLGLOBAL DEBUG 0 (-)
DLGLOBAL DEBUG --- after destroy cascade, still got:                 <------ SUCCESS!
DLGLOBAL DEBUG --- cleaning up
DLGLOBAL DEBUG scene_alloc()
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0)
DLGLOBAL DEBUG test(other){alive}: Allocated
DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0
DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other
DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a
DLGLOBAL DEBUG ---------------------- test_term(branch0)
DLGLOBAL DEBUG --- before term cascade, got:
branch0
twig0a
other
DLGLOBAL DEBUG ---
DLGLOBAL DEBUG test(branch0){alive}: Terminating (cause = OSMO_FSM_TERM_REGULAR)         <------------ branch0 terminating
DLGLOBAL DEBUG test(branch0){alive}: pre_term()
DLGLOBAL DEBUG test(twig0a){alive}: Terminating (cause = OSMO_FSM_TERM_PARENT)
DLGLOBAL DEBUG test(twig0a){alive}: pre_term()
DLGLOBAL DEBUG test(twig0a){alive}: Removing from parent test(branch0)
DLGLOBAL DEBUG 1 (twig0a.cleanup())
DLGLOBAL DEBUG test(twig0a){alive}: cleanup()
DLGLOBAL DEBUG test(twig0a){alive}: scene forgets twig0a
DLGLOBAL DEBUG test(twig0a){alive}: removing reference twig0a.other[0] -> other
DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE
DLGLOBAL DEBUG 2 (twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE)
DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.other_gone())
DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a
DLGLOBAL DEBUG 2 (twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: state_chg to destroying
DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(other){destroying}: pre_term()
DLGLOBAL DEBUG 4 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup()
DLGLOBAL DEBUG test(other){destroying}: scene forgets other
DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0
DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_OTHER_GONE
DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive())
DLGLOBAL DEBUG test(branch0){alive}: alive(EV_OTHER_GONE)
DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.other_gone())
DLGLOBAL DEBUG test(branch0){alive}: Dropped reference branch0.other[0] = other
DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive())
DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying
DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter())
DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)              <-------- branch0 terminating AGAIN!?!?
DLGLOBAL DEBUG test(branch0){destroying}: pre_term()
DLGLOBAL DEBUG 7 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter(),branc
DLGLOBAL DEBUG test(branch0){destroying}: cleanup()
DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0
DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done
DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter())
DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance
DLGLOBAL DEBUG test(branch0){destroying}: Deallocated
DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive())
DLGLOBAL DEBUG 4 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup() done
DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: Freeing instance
DLGLOBAL DEBUG test(other){destroying}: Deallocated
=================================================================
==9774==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000001128 at pc 0x7f74d8b8f249 bp 0x7ffebeffe490 sp 0x7ffebeffe488
WRITE of size 8 at 0x612000001128 thread T0
    #0 0x7f74d8b8f248 in __llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:114
    #1 0x7f74d8b8f380 in llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:126
    #2 0x7f74d8b93eaa in osmo_fsm_inst_free ../../../src/libosmocore/src/fsm.c:404
    #3 0x7f74d8b9ba9c in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:738
    #4 0x563b17db94b5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173
    #5 0x7f74d8b987be in state_chg ../../../src/libosmocore/src/fsm.c:521
    #6 0x7f74d8b98870 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577
    #7 0x563b17db8db8 in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:147
    #8 0x7f74d8b99e2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685
    #9 0x563b17dbabc0 in cleanup ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:251
    #10 0x7f74d8b9b292 in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:733
    #11 0x7f74d8b9c1b1 in _osmo_fsm_inst_term_children ../../../src/libosmocore/src/fsm.c:784
    #12 0x7f74d8b9a85e in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:720
    #13 0x563b17dbd09d in obj_term ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:392
    #14 0x563b17dbd8fb in test_term ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:425
    #15 0x563b17dbdb62 in main ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:453
    #16 0x7f74d7db909a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)
    #17 0x563b17db5319 in _start (/n/s/dev/make/libosmocore/tests/fsm/fsm_dealloc_test+0x10319)


Fixing this particular run can be done by manually setting this fi->state =
ST_DESTROYING (without osmo_fsm_inst_state_chg(), since that would run the
onenter() function and trigger an osmo_fsm_inst_term()).

So I did:

 void pre_term(struct osmo_fsm_inst *fi, enum osmo_fsm_term_cause cause)
 {
+       /* sneak into "destroying" state without triggering events */
+       fi->state = ST_DESTROYING;
        LOGPFSML(fi, LOGL_DEBUG, "%s()\n", __func__);
 }

That's not so nice, changing states in a cheating way.

A different solution could be some fsm.c internal "already terminating" flag,
so I can call osmo_fsm_inst_term() any number of times, and only the first call
will have an effect.



One might think then the problem is solved?
But then I still face this problem:

- If the child is sent an EV_DESTROY, it goes into osmo_fsm_inst_term().
- As part of its cleanup, it terminates its reference to the "other" object.
- Out of its own free will, the other object then decides that this means some
  problem appeared and decides to deallocate as well.
- But the parent object also has a reference to "other". Since "other" is
  deallocating, it notifies the parent branch0 that it is now gone.
- branch0 decides that if "other" has a problem, it is no longer useful and
  deallocates.
- twig0a is *not* signalled to terminate from the parent branch0, because it
  has already removed itself from the parent at the start of its termination.
- But branch0 calls talloc_free() on itself. Since twig0a was a talloc child of
  branch0, its memory storage is now gone.
- All of the above happened while telling "other" that the twig0a is about to
  go. Now this code path wants to continue after the twig0a.cleanup(). Alas,
  its memory is no longer valid and the fsm.c termination code causes a
  use-after-free.

Looks like this:

DLGLOBAL DEBUG ---------------------- test_destroy(twig0a)
DLGLOBAL DEBUG --- before destroy cascade, got:
branch0
twig0a
other
DLGLOBAL DEBUG ---
DLGLOBAL DEBUG test(twig0a){alive}: Received Event EV_DESTROY
DLGLOBAL DEBUG 1 (twig0a.alive())
DLGLOBAL DEBUG test(twig0a){alive}: alive(EV_DESTROY)
DLGLOBAL DEBUG test(twig0a){alive}: state_chg to destroying
DLGLOBAL DEBUG 2 (twig0a.alive(),twig0a.destroying_onenter())
DLGLOBAL DEBUG test(twig0a){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(twig0a){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(twig0a){destroying}: pre_term()
DLGLOBAL DEBUG test(twig0a){destroying}: Removing from parent test(branch0)
DLGLOBAL DEBUG 3 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup())
DLGLOBAL DEBUG test(twig0a){destroying}: cleanup()
DLGLOBAL DEBUG test(twig0a){destroying}: scene forgets twig0a
DLGLOBAL DEBUG test(twig0a){destroying}: removing reference twig0a.other[0] -> other             <----- twig0a tells "other" that it is going
DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE
DLGLOBAL DEBUG 4 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE)
DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.other_gone())
DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a
DLGLOBAL DEBUG 4 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive())
DLGLOBAL DEBUG test(other){alive}: state_chg to destroying
DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(other){destroying}: pre_term()
DLGLOBAL DEBUG 6 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup()
DLGLOBAL DEBUG test(other){destroying}: scene forgets other
DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0
DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_OTHER_GONE                                <----- branch0 gets told "other" is gone
DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){alive}: alive(EV_OTHER_GONE)
DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){alive}: Dropped reference branch0.other[0] = other
DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying
DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive
DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
DLGLOBAL DEBUG test(branch0){destroying}: pre_term()
DLGLOBAL DEBUG 9 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){destroying}: cleanup()
DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0
DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done
DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance
DLGLOBAL DEBUG test(branch0){destroying}: Deallocated                                            <----- branch0 calls talloc_free()
DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0
DLGLOBAL DEBUG 6 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup())
DLGLOBAL DEBUG test(other){destroying}: cleanup() done
DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter())
DLGLOBAL DEBUG test(other){destroying}: Freeing instance
DLGLOBAL DEBUG test(other){destroying}: Deallocated                      <---- twig0a.cleanup() is done, fsm.c wants to continue termination...
=================================================================
==16426==ERROR: AddressSanitizer: heap-use-after-free on address 0x6120000015a8 at pc 0x7faea5cb4149 bp 0x7fff936ffed0 sp 0x7fff936ffec8
WRITE of size 8 at 0x6120000015a8 thread T0
    #0 0x7faea5cb4148 in __llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:114
    #1 0x7faea5cb4280 in llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:126
    #2 0x7faea5cb8daa in osmo_fsm_inst_free ../../../src/libosmocore/src/fsm.c:404
    #3 0x7faea5cc099c in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:738
    #4 0x56367132a4c5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173
    #5 0x7faea5cbd6be in state_chg ../../../src/libosmocore/src/fsm.c:521
    #6 0x7faea5cbd770 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577
    #7 0x563671329dc8 in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:147
    #8 0x7faea5cbed2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685
    #9 0x56367132bbd0 in cleanup ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:251
    #10 0x7faea5cc0192 in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:733
    #11 0x56367132a4c5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173
    #12 0x7faea5cbd6be in state_chg ../../../src/libosmocore/src/fsm.c:521
    #13 0x7faea5cbd770 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577
    #14 0x563671329e3d in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:159
    #15 0x7faea5cbed2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685
    #16 0x56367132e0be in obj_destroy ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:389
    #17 0x56367132e519 in test_destroy ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:408
    #18 0x56367132ebf4 in main ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:454
    #19 0x7faea4edf09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)
    #20 0x563671326329 in _start (/n/s/dev/make/libosmocore/tests/fsm/fsm_dealloc_test+0x10329)



So, even though twig0a sends the EV_CHILD_GONE to its parent FSM only *after*
all terminating is done, some events crossing over via "other" objects can
still trigger the parent to also deallocate at the same time.

Here the problem merely is that the child is allocated from the parent's talloc
context. Had it been allocated from an independent talloc context, then there
would not be any use-after-free. So let's try that:

   fi = osmo_fsm_inst_alloc_child(&test_fsm, parent->fi, EV_CHILD_GONE);
   OSMO_ASSERT(fi);
+  talloc_steal(s, fi);


Now the next problem arises: when the child is done deallocating, fsm.c still
has the parent FSM pointer to which it must still dispatch the parent_term
event.

When there is only a parent and a child involved, using the parent_term event
actually solves this problem: if we were to dispatch EV_CHILD_GONE from the
child's cleanup() function, then the parent would deallocate and free the child
along with its own talloc context. By using the parent_term event, we wait
until after cleanup() and is done and until after talloc_free() on the child is
done, and only then dispatch EV_CHILD_GONE to the parent.

But because a third "other" object is involved, the deallocation triger
ricochets across that other object. It is out of scope to tell that other
object to care about which two callers are using it and why they must not be
told to deallocate in certain situations. It is certainly possible to do that,
but it requires a lot of careful thinking through all possible variations,
which is very easy to get wrong.


The vision for fixing these kinds of situations is to avoid a use-after-free
completely by keeping the talloc contexts around; Also all event dispatch and
osmo_fsm_inst_term() calls must be stopped when a fi is marked to be released.

- add flag osmo_fsm_inst.deallocating. If true, fsm.c denies all action
  triggers on the FSM instance (no osmo_fsm_inst_term(), no
  osmo_fsm_inst_state_chg(), no osmo_fsm_inst_dispatch()).

- add a talloc bucket into which to reparent FSM instances instead of
  deallocating directly. Then each parent/child/"other" object that wants to
  trigger things on it will be able to see the "deallocating" flag, and still
  even use memory like the fi->id for logging.  This talloc bucket could be a
  volatile select context, or it could even be a global context that exists
  once in fsm.c, the first osmo_fsm_inst_term() creates it, other callers add
  and remove use counts, and as soon as the last use count is removed, the
  entire context gets freed.

I think this would solve all of the above problems.

~N
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20190324/2c27a58b/attachment.bin>


More information about the OpenBSC mailing list