easier way to avoid msgb leaks in libosmo-sccp callers?

historical

Hi Neels,

On Sun, Mar 10, 2019 at 04:31:05AM +0100, Neels Hofmeyr wrote:
> Avoiding msgb leaks is easiest if the caller retains ownership of the msgb.

ACK.

> However, if the caller passes ownership down to the msgb consumer, things
> become nightmarishly complex:

ACK.

> If any link in this call chain fails to be aware of the importance to return a
> failed RC or to free a msgb if the chain is broken, we have a hidden msgb leak.

Indeed.

This is why I have been arguing in favor of something like a "packet dispatch
talloc context" for some time.  Unfortuantely I didn't have any time to work
on this so far.

The proposal is along those lines:
* we create a talloc context every time we drop out of the select() statement
* any msgb's allocated from routines that read from sockets / etc. are allocated
  from that context
* just before we return into osmo_select_main() [or in the start of that function],
  we free the entire talloc context, which will free any msgb's in it
* if anyone wants to take ownership of a msgb for a longer time (e.g. enqueue it
  somewhere for delayed processing, ...) they need to re-parent the object away
  from the volatile "packet dispathc talloc context" and move it into another
  context.  Talloc provides talloc_steal() or better talloc_move() for it.

I would think this is an elegant solution.  That particular volatile talloc context
would not be restricted to msgb's but could be used for any kind of 'temporary'
allocations.  We could lazily allocate whatever we wanted (like temporary string buffers,
etc.) and know that it will be safely released once the current select dispatch
ceases to exist.  Sure, whenever we can we should still use the stack, as stack
allocations are cheaper than heap allocations.

> This is the case with osmo_sccp_user_sap_down(). In new osmo-msc, passing data
> through various FSM instances, there is high potential for leak/double-free
> bugs. A very large brain is required to track down every msgb path.

Let me know what you think of the above-mentioned proposal.

We could have something like msgb_alloc2() which takes an additional talloc
context as argument.  Pluse creating that [gloal variable] talloc context
in the beginning of osmo_fd_disp_fds() and talloc_free()ing it at the end
of osmo_fd_disp_fds() should do the trick.

One might evene be a able to do some trickery with talloc_set_destructor()
to print a BIG FAT WARNING/ERROR message in case anyone ever tries to free
memory allocated from this context manually. That destructor would simpl have
to be removed or disabled when the select loop handling takes care of it.

> Isn't it possible to provide osmo_sccp_user_sap_down() in the caller-owns
> paradigm? Thinking about an osmo_sccp_user_sap_down2() that simply doesn't
> msgb_free().

It's possible.  But I think there are use cases for both behaviors.

> Passing ownership to the consumer is imperative if a msg queue is involved that
> might send out asynchronously. (A workaround could be to copy the message
> before passing into the wqueue.) 

The proposed approach would allow for "re-parenting" without having to copy the msgb
or any of the metadata.  Even the address of the msgb stays the same if you move/steal
it.

> I think I'll just try it, but if anyone knows a definite reason why this will
> not work, please let me know.

Not off my head, but I guess nobody is able to make guarantees on this without
a thorough code audit and/or extensive testing.  THat's why I like the idea of
this volatile context.

However, I realize only now that you're referring not to the case of received message
buffers, but about locally-originated msgbs.  But for them, too, the rule could
be simply something like this:
* allocate memory from the volatile context
* pass it into any library function
* if the library function transmits it right away, no need to free/steal
* if the library function enqueues it somewhere and doesn't immediately send
  it over a socket or the like, it needs to move/steal it.

Thinking of the above, I think the 'being queued' case is more or less the normal
one, as we can never make the assumption that the underlying transport layer
is able to accept transmission of any data in a non-blocking way.  So there will
always be some kind of write queue wherever we transmit.

The only situation where we can "normally" assume that the msgb of any outbound message
can be free'd immediately is:
* datagram sockets like for RTP
* local primitives that are simply sent to another layer in the local stack,
  and which are released immediately after that stack has received them

Regards,
	Harald
-- 
- Harald Welte <laforge at gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)