This is merely a historical archive of years 2008-2021, before the migration to mailman3.
A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/OpenBSC@lists.osmocom.org/.
Neels Hofmeyr nhofmeyr at sysmocom.deOn Thu, Mar 21, 2019 at 09:30:28AM +0100, Harald Welte wrote: > Hi Neels, > > On Thu, Mar 21, 2019 at 02:24:30AM +0100, Neels Hofmeyr wrote: > > Talking of features, another idea comes to mind: freeing of FSM instances. I > > more often than not have the problem that a child FSM is freeing, telling the > > parent about it, then the parent decides to free because the child's result > > signalled some end, and then we have a double free because the child was > > allocated with the parent as talloc context, the parent has just deallocated, > > but after the child's cleanup is through, the fsm code wants to also deallocate > > the child. > > That just sounds like a bug and it should be fixed? The question is whether > we want to mandate all child contexts to be allocated from within a parent. If > so, then whatever mechanism to prevent this could / should become part > of the fsm core. > > > In these situations I so far need to introduce checks that prevent > > the parent from freeing until the children are completely finished with their > > actions -- for example, avoid signalling end from the cleanup function or event > > handling yet, instead completely rely on the parent_term_event, which is > > dispatched only after the child was freed. > > I wonder if those checks could be made part of the core. Do you have a > concrete code example to look at? Just after writing the above, I have introduced another deallocation failure mode in the cleanup among RTP-stream/call leg/MGW endpoint/MSC subscriber FSMs, and am fed up with solving these ad hoc as they show up. So now I am writing an explicit FSM deallocation test, to figure out how to properly solve various cleanup scenarios -- so that I can hopefully apply one and the same mechanism to all FSM implementations. (This might get a bit complex to follow, a proposed solution is in the bottom.) Here is the test: branch neels/fsm_dealloc_test of libosmocore http://git.osmocom.org/libosmocore/commit/?h=neels/fsm_dealloc_test&id=31ed1036cee1ed4e0ccf12e83e2bc0918af8db47 The "scene" struct shows a number of objects, but have reduced to only three being in use. branch0 <-parent/child-> twig0a \ / ----- other ------- So both a parent and child FSM reference some "other" object which they depend upon. (A real world equivalent is that both a call_leg and rtp_stream FSM depend on the MGW endpoint -- in which situation also all of those are children of an MSC subscriber FSM, just saying complexity is easily added in real life) The underlying problem here is: - when the child has an error and goes away, the parent should go away as well. So there is an event from the child to the parent saying "I am gone, you too go now": EV_CHILD_GONE. - if now I tell branch0 to deallocate regularly, it first tells the child to deallocate. The child then also says EV_CHILD_GONE. - The parent receives EV_CHILD_GONE and again triggers its own deallocation. Now there are two osmo_fsm_inst_term() called on the same FSM instance. A way to fix this is to introduce a ST_DESTROYING state. It acts as a flag to prevent re-triggering an osmo_fsm_inst_term(). So if as soon as once in ST_DESTROYING, it will no longer issue osmo_fsm_inst_term() on itself. My test models this structure. In the test code try in turns to deallocate with a EV_DESTROY event, and then a direct osmo_fsm_inst_term(). Both of them should ideally be safe to do, because on weird errors anything should be able to just term an FSM instance. Dispatching an EV_DESTROY actually works out fine. But calling osmo_fsm_inst_term() directly fails. The difference is that in the EV_DESTROY case, the branch0 FSM has changed to state ST_DESTROYING, and if it receives events that peer objects are gone, it ignores them. In the term case, the FSM instance stays in the "ALIVE" state, in which case a term triggered at the child twig0a in turn dispatches an EV_CHILD_GONE, which the parent branch0 acts upon by transitioning to the ST_DESTROYING state, which launches a *second* osmo_fsm_inst_term() on itself. The output looks like this -- first the successful run, then the failing one: ▶ ./fsm_dealloc_test DLGLOBAL DEBUG scene_alloc() DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0) DLGLOBAL DEBUG test(other){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0 DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a DLGLOBAL DEBUG scene_alloc() DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0) DLGLOBAL DEBUG test(other){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0 DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a DLGLOBAL DEBUG scene_alloc() DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0) DLGLOBAL DEBUG test(other){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0 DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a DLGLOBAL DEBUG ---------------------- test_destroy(branch0) DLGLOBAL DEBUG --- before destroy cascade, got: branch0 twig0a other DLGLOBAL DEBUG --- DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_DESTROY DLGLOBAL DEBUG 1 (branch0.alive()) DLGLOBAL DEBUG test(branch0){alive}: alive(EV_DESTROY) DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter()) DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(branch0){destroying}: pre_term() DLGLOBAL DEBUG test(twig0a){alive}: Terminating (cause = OSMO_FSM_TERM_PARENT) DLGLOBAL DEBUG test(twig0a){alive}: pre_term() DLGLOBAL DEBUG test(twig0a){alive}: Removing from parent test(branch0) DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup()) DLGLOBAL DEBUG test(twig0a){alive}: cleanup() DLGLOBAL DEBUG test(twig0a){alive}: scene forgets twig0a DLGLOBAL DEBUG test(twig0a){alive}: removing reference twig0a.other[0] -> other DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE) DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.other_gone()) DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: state_chg to destroying DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(other){destroying}: pre_term() DLGLOBAL DEBUG 6 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() DLGLOBAL DEBUG test(other){destroying}: scene forgets other DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0 DLGLOBAL DEBUG test(branch0){destroying}: Received Event EV_OTHER_GONE DLGLOBAL DEBUG 7 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc DLGLOBAL DEBUG test(branch0){destroying}: destroying(EV_OTHER_GONE) DLGLOBAL DEBUG 8 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc DLGLOBAL DEBUG test(branch0){destroying}: Dropped reference branch0.other[0] = other DLGLOBAL DEBUG 7 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branc DLGLOBAL DEBUG 6 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() done DLGLOBAL DEBUG 5 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: Freeing instance DLGLOBAL DEBUG test(other){destroying}: Deallocated DLGLOBAL DEBUG 4 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),twig0a.cleanup()) DLGLOBAL DEBUG test(twig0a){alive}: cleanup() done DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter()) DLGLOBAL DEBUG test(twig0a){alive}: Freeing instance DLGLOBAL DEBUG test(twig0a){alive}: Deallocated DLGLOBAL DEBUG 3 (branch0.alive(),branch0.destroying_onenter(),branch0.cleanup()) DLGLOBAL DEBUG test(branch0){destroying}: cleanup() DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0 DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done DLGLOBAL DEBUG 2 (branch0.alive(),branch0.destroying_onenter()) DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance DLGLOBAL DEBUG test(branch0){destroying}: Deallocated DLGLOBAL DEBUG 1 (branch0.alive()) DLGLOBAL DEBUG 0 (-) DLGLOBAL DEBUG --- after destroy cascade, still got: <------ SUCCESS! DLGLOBAL DEBUG --- cleaning up DLGLOBAL DEBUG scene_alloc() DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: is child of test(branch0) DLGLOBAL DEBUG test(other){alive}: Allocated DLGLOBAL DEBUG test(branch0){alive}: branch0.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[0] = branch0 DLGLOBAL DEBUG test(twig0a){alive}: twig0a.other[0] = other DLGLOBAL DEBUG test(other){alive}: other.other[1] = twig0a DLGLOBAL DEBUG ---------------------- test_term(branch0) DLGLOBAL DEBUG --- before term cascade, got: branch0 twig0a other DLGLOBAL DEBUG --- DLGLOBAL DEBUG test(branch0){alive}: Terminating (cause = OSMO_FSM_TERM_REGULAR) <------------ branch0 terminating DLGLOBAL DEBUG test(branch0){alive}: pre_term() DLGLOBAL DEBUG test(twig0a){alive}: Terminating (cause = OSMO_FSM_TERM_PARENT) DLGLOBAL DEBUG test(twig0a){alive}: pre_term() DLGLOBAL DEBUG test(twig0a){alive}: Removing from parent test(branch0) DLGLOBAL DEBUG 1 (twig0a.cleanup()) DLGLOBAL DEBUG test(twig0a){alive}: cleanup() DLGLOBAL DEBUG test(twig0a){alive}: scene forgets twig0a DLGLOBAL DEBUG test(twig0a){alive}: removing reference twig0a.other[0] -> other DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE DLGLOBAL DEBUG 2 (twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE) DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.other_gone()) DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a DLGLOBAL DEBUG 2 (twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: state_chg to destroying DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(other){destroying}: pre_term() DLGLOBAL DEBUG 4 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() DLGLOBAL DEBUG test(other){destroying}: scene forgets other DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0 DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_OTHER_GONE DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive()) DLGLOBAL DEBUG test(branch0){alive}: alive(EV_OTHER_GONE) DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.other_gone()) DLGLOBAL DEBUG test(branch0){alive}: Dropped reference branch0.other[0] = other DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive()) DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter()) DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) <-------- branch0 terminating AGAIN!?!? DLGLOBAL DEBUG test(branch0){destroying}: pre_term() DLGLOBAL DEBUG 7 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter(),branc DLGLOBAL DEBUG test(branch0){destroying}: cleanup() DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0 DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done DLGLOBAL DEBUG 6 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive(),branch0.destroying_onenter()) DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance DLGLOBAL DEBUG test(branch0){destroying}: Deallocated DLGLOBAL DEBUG 5 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0.alive()) DLGLOBAL DEBUG 4 (twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() done DLGLOBAL DEBUG 3 (twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: Freeing instance DLGLOBAL DEBUG test(other){destroying}: Deallocated ================================================================= ==9774==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000001128 at pc 0x7f74d8b8f249 bp 0x7ffebeffe490 sp 0x7ffebeffe488 WRITE of size 8 at 0x612000001128 thread T0 #0 0x7f74d8b8f248 in __llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:114 #1 0x7f74d8b8f380 in llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:126 #2 0x7f74d8b93eaa in osmo_fsm_inst_free ../../../src/libosmocore/src/fsm.c:404 #3 0x7f74d8b9ba9c in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:738 #4 0x563b17db94b5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173 #5 0x7f74d8b987be in state_chg ../../../src/libosmocore/src/fsm.c:521 #6 0x7f74d8b98870 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577 #7 0x563b17db8db8 in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:147 #8 0x7f74d8b99e2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685 #9 0x563b17dbabc0 in cleanup ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:251 #10 0x7f74d8b9b292 in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:733 #11 0x7f74d8b9c1b1 in _osmo_fsm_inst_term_children ../../../src/libosmocore/src/fsm.c:784 #12 0x7f74d8b9a85e in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:720 #13 0x563b17dbd09d in obj_term ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:392 #14 0x563b17dbd8fb in test_term ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:425 #15 0x563b17dbdb62 in main ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:453 #16 0x7f74d7db909a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) #17 0x563b17db5319 in _start (/n/s/dev/make/libosmocore/tests/fsm/fsm_dealloc_test+0x10319) Fixing this particular run can be done by manually setting this fi->state = ST_DESTROYING (without osmo_fsm_inst_state_chg(), since that would run the onenter() function and trigger an osmo_fsm_inst_term()). So I did: void pre_term(struct osmo_fsm_inst *fi, enum osmo_fsm_term_cause cause) { + /* sneak into "destroying" state without triggering events */ + fi->state = ST_DESTROYING; LOGPFSML(fi, LOGL_DEBUG, "%s()\n", __func__); } That's not so nice, changing states in a cheating way. A different solution could be some fsm.c internal "already terminating" flag, so I can call osmo_fsm_inst_term() any number of times, and only the first call will have an effect. One might think then the problem is solved? But then I still face this problem: - If the child is sent an EV_DESTROY, it goes into osmo_fsm_inst_term(). - As part of its cleanup, it terminates its reference to the "other" object. - Out of its own free will, the other object then decides that this means some problem appeared and decides to deallocate as well. - But the parent object also has a reference to "other". Since "other" is deallocating, it notifies the parent branch0 that it is now gone. - branch0 decides that if "other" has a problem, it is no longer useful and deallocates. - twig0a is *not* signalled to terminate from the parent branch0, because it has already removed itself from the parent at the start of its termination. - But branch0 calls talloc_free() on itself. Since twig0a was a talloc child of branch0, its memory storage is now gone. - All of the above happened while telling "other" that the twig0a is about to go. Now this code path wants to continue after the twig0a.cleanup(). Alas, its memory is no longer valid and the fsm.c termination code causes a use-after-free. Looks like this: DLGLOBAL DEBUG ---------------------- test_destroy(twig0a) DLGLOBAL DEBUG --- before destroy cascade, got: branch0 twig0a other DLGLOBAL DEBUG --- DLGLOBAL DEBUG test(twig0a){alive}: Received Event EV_DESTROY DLGLOBAL DEBUG 1 (twig0a.alive()) DLGLOBAL DEBUG test(twig0a){alive}: alive(EV_DESTROY) DLGLOBAL DEBUG test(twig0a){alive}: state_chg to destroying DLGLOBAL DEBUG 2 (twig0a.alive(),twig0a.destroying_onenter()) DLGLOBAL DEBUG test(twig0a){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(twig0a){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(twig0a){destroying}: pre_term() DLGLOBAL DEBUG test(twig0a){destroying}: Removing from parent test(branch0) DLGLOBAL DEBUG 3 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup()) DLGLOBAL DEBUG test(twig0a){destroying}: cleanup() DLGLOBAL DEBUG test(twig0a){destroying}: scene forgets twig0a DLGLOBAL DEBUG test(twig0a){destroying}: removing reference twig0a.other[0] -> other <----- twig0a tells "other" that it is going DLGLOBAL DEBUG test(other){alive}: Received Event EV_OTHER_GONE DLGLOBAL DEBUG 4 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: alive(EV_OTHER_GONE) DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.other_gone()) DLGLOBAL DEBUG test(other){alive}: Dropped reference other.other[1] = twig0a DLGLOBAL DEBUG 4 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive()) DLGLOBAL DEBUG test(other){alive}: state_chg to destroying DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(other){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(other){destroying}: pre_term() DLGLOBAL DEBUG 6 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() DLGLOBAL DEBUG test(other){destroying}: scene forgets other DLGLOBAL DEBUG test(other){destroying}: removing reference other.other[0] -> branch0 DLGLOBAL DEBUG test(branch0){alive}: Received Event EV_OTHER_GONE <----- branch0 gets told "other" is gone DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){alive}: alive(EV_OTHER_GONE) DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){alive}: Dropped reference branch0.other[0] = other DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){alive}: state_chg to destroying DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){destroying}: destroying_onenter() from alive DLGLOBAL DEBUG test(branch0){destroying}: Terminating (cause = OSMO_FSM_TERM_REGULAR) DLGLOBAL DEBUG test(branch0){destroying}: pre_term() DLGLOBAL DEBUG 9 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){destroying}: cleanup() DLGLOBAL DEBUG test(branch0){destroying}: scene forgets branch0 DLGLOBAL DEBUG test(branch0){destroying}: cleanup() done DLGLOBAL DEBUG 8 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG test(branch0){destroying}: Freeing instance DLGLOBAL DEBUG test(branch0){destroying}: Deallocated <----- branch0 calls talloc_free() DLGLOBAL DEBUG 7 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup(),branch0 DLGLOBAL DEBUG 6 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter(),other.cleanup()) DLGLOBAL DEBUG test(other){destroying}: cleanup() done DLGLOBAL DEBUG 5 (twig0a.alive(),twig0a.destroying_onenter(),twig0a.cleanup(),other.alive(),other.destroying_onenter()) DLGLOBAL DEBUG test(other){destroying}: Freeing instance DLGLOBAL DEBUG test(other){destroying}: Deallocated <---- twig0a.cleanup() is done, fsm.c wants to continue termination... ================================================================= ==16426==ERROR: AddressSanitizer: heap-use-after-free on address 0x6120000015a8 at pc 0x7faea5cb4149 bp 0x7fff936ffed0 sp 0x7fff936ffec8 WRITE of size 8 at 0x6120000015a8 thread T0 #0 0x7faea5cb4148 in __llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:114 #1 0x7faea5cb4280 in llist_del ../../../src/libosmocore/include/osmocom/core/linuxlist.h:126 #2 0x7faea5cb8daa in osmo_fsm_inst_free ../../../src/libosmocore/src/fsm.c:404 #3 0x7faea5cc099c in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:738 #4 0x56367132a4c5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173 #5 0x7faea5cbd6be in state_chg ../../../src/libosmocore/src/fsm.c:521 #6 0x7faea5cbd770 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577 #7 0x563671329dc8 in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:147 #8 0x7faea5cbed2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685 #9 0x56367132bbd0 in cleanup ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:251 #10 0x7faea5cc0192 in _osmo_fsm_inst_term ../../../src/libosmocore/src/fsm.c:733 #11 0x56367132a4c5 in destroying_onenter ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:173 #12 0x7faea5cbd6be in state_chg ../../../src/libosmocore/src/fsm.c:521 #13 0x7faea5cbd770 in _osmo_fsm_inst_state_chg ../../../src/libosmocore/src/fsm.c:577 #14 0x563671329e3d in alive ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:159 #15 0x7faea5cbed2f in _osmo_fsm_inst_dispatch ../../../src/libosmocore/src/fsm.c:685 #16 0x56367132e0be in obj_destroy ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:389 #17 0x56367132e519 in test_destroy ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:408 #18 0x56367132ebf4 in main ../../../src/libosmocore/tests/fsm/fsm_dealloc_test.c:454 #19 0x7faea4edf09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) #20 0x563671326329 in _start (/n/s/dev/make/libosmocore/tests/fsm/fsm_dealloc_test+0x10329) So, even though twig0a sends the EV_CHILD_GONE to its parent FSM only *after* all terminating is done, some events crossing over via "other" objects can still trigger the parent to also deallocate at the same time. Here the problem merely is that the child is allocated from the parent's talloc context. Had it been allocated from an independent talloc context, then there would not be any use-after-free. So let's try that: fi = osmo_fsm_inst_alloc_child(&test_fsm, parent->fi, EV_CHILD_GONE); OSMO_ASSERT(fi); + talloc_steal(s, fi); Now the next problem arises: when the child is done deallocating, fsm.c still has the parent FSM pointer to which it must still dispatch the parent_term event. When there is only a parent and a child involved, using the parent_term event actually solves this problem: if we were to dispatch EV_CHILD_GONE from the child's cleanup() function, then the parent would deallocate and free the child along with its own talloc context. By using the parent_term event, we wait until after cleanup() and is done and until after talloc_free() on the child is done, and only then dispatch EV_CHILD_GONE to the parent. But because a third "other" object is involved, the deallocation triger ricochets across that other object. It is out of scope to tell that other object to care about which two callers are using it and why they must not be told to deallocate in certain situations. It is certainly possible to do that, but it requires a lot of careful thinking through all possible variations, which is very easy to get wrong. The vision for fixing these kinds of situations is to avoid a use-after-free completely by keeping the talloc contexts around; Also all event dispatch and osmo_fsm_inst_term() calls must be stopped when a fi is marked to be released. - add flag osmo_fsm_inst.deallocating. If true, fsm.c denies all action triggers on the FSM instance (no osmo_fsm_inst_term(), no osmo_fsm_inst_state_chg(), no osmo_fsm_inst_dispatch()). - add a talloc bucket into which to reparent FSM instances instead of deallocating directly. Then each parent/child/"other" object that wants to trigger things on it will be able to see the "deallocating" flag, and still even use memory like the fi->id for logging. This talloc bucket could be a volatile select context, or it could even be a global context that exists once in fsm.c, the first osmo_fsm_inst_term() creates it, other callers add and remove use counts, and as soon as the last use count is removed, the entire context gets freed. I think this would solve all of the above problems. ~N -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20190324/2c27a58b/attachment.bin>