osmo-sip-connector in bad state

List overview All Threads
Download

newer

older

libosmocore build failure cascade

packages broken by osmo-pcu commit...

OMAR RAMADAN

25 Jan 2017 25 Jan '17

6:06 p.m.

If the SIP server dies in the middle of a call, osmo-sip-connector is in a bad state and generates a never ending stream of error messages:

58): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f25

It looks like the messages are generated from sofia-sip and I have managed to suppress the messages by setting the environment variable SU_DEBUG < 3: http://sofia-sip.sourcearchive.com/documentation/1.12.7/tport__internal_8h_e...

However, it looks like osmo-sip-connector is clearly in a bad state when this happens and we need a way to detect and release these ghosted calls.

Attachments:

attachment.html (text/html — 3.5 KB)

Show replies by date

Holger Freyther

25 Jan 25 Jan

9:06 p.m.

...

On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu wrote:

If the SIP server dies in the middle of a call, osmo-sip-connector is in a bad state and generates a never ending stream of error messages:

Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(

holger

OMAR RAMADAN

26 Jan 26 Jan

12:31 a.m.

I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.

On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:

...

...
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu

wrote:

...
If the SIP server dies in the middle of a call, osmo-sip-connector is in

a bad state and generates a never ending stream of error messages:

Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(

holger

Alexander Chemeris

5:30 a.m.

Omar,

Just curious - is there any reason you're running RTP through the osmo-sip-connector instead of directly to FreeSWITCH?

Please excuse typos. Written with a touchscreen keyboard.

-- Regards, Alexander Chemeris CEO Fairwaves, Inc. https://fairwaves.co

On Jan 26, 2017 02:31, "OMAR RAMADAN" omar.ramadan@berkeley.edu wrote:

...

I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.

On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:

...
...
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu

wrote:

...
If the SIP server dies in the middle of a call, osmo-sip-connector is

in a bad state and generates a never ending stream of error messages:

Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(

holger

OMAR RAMADAN

3:15 p.m.

Good point Alexander, I just realized that the connector is only handling the signaling. What probably is happened is a call is trying to be established to a dead SIP server. But we should still terminate the call cleanly

On Jan 25, 2017 8:30 PM, "Alexander Chemeris" alexander.chemeris@gmail.com wrote:

...

Omar,

Just curious - is there any reason you're running RTP through the osmo-sip-connector instead of directly to FreeSWITCH?

Please excuse typos. Written with a touchscreen keyboard.

-- Regards, Alexander Chemeris CEO Fairwaves, Inc. https://fairwaves.co

On Jan 26, 2017 02:31, "OMAR RAMADAN" omar.ramadan@berkeley.edu wrote:

...
I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.

On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:

...
...
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu

wrote:

...
If the SIP server dies in the middle of a call, osmo-sip-connector is

in a bad state and generates a never ending stream of error messages:

Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(

holger

Holger Freyther

3:18 p.m.

...

On 26 Jan 2017, at 00:31, OMAR RAMADAN omar.ramadan@berkeley.edu wrote:

I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.

that sucks. I intend to look at early media in the connector and try to reproduce the issue as well, I have seen it once when a NAT was involved too..

holger

Holger Freyther

12 Feb 12 Feb

4:23 a.m.

...

On 26 Jan 2017, at 21:18, Holger Freyther holger@freyther.de wrote:

...

that sucks. I intend to look at early media in the connector and try to reproduce the issue as well, I have seen it once when a NAT was involved too..

I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.

thank you holger

[1] git pull https://gerrit.osmocom.org/osmo-sip-connector refs/changes/97/1797/1

Holger Freyther

3 Mar 3 Mar

2:30 a.m.

...

On 12 Feb 2017, at 11:23, Holger Freyther holger@freyther.de wrote:

Dear Omar,

...

I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.

did you have time to see if the updated eventloop code is improving the situation?

thank you holger

Holger Freyther

6 Mar 6 Mar

3:28 p.m.

...

On 3 Mar 2017, at 02:30, Holger Freyther holger@freyther.de wrote:

...
On 12 Feb 2017, at 11:23, Holger Freyther holger@freyther.de wrote:

Dear Omar,

...
I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.

did you have time to see if the updated eventloop code is improving the situation?

I was working on a manual testcase for DTMF handling.. and I am able to reproduce this now. Will analyze it now and then return to DTMF.

holger

Holger Freyther

9:51 p.m.

...

On 6 Mar 2017, at 15:28, Holger Freyther holger@freyther.de wrote:

Hi!

...

I was working on a manual testcase for DTMF handling.. and I am able to reproduce this now. Will analyze it now and then return to DTMF.

I don't have a fix yet but a workaround. One can patch sofia-sip to not use the IP_RECVERR sockopt. Either patch it out or configure with an ac_... var to not enable the feature.

Sofia SIP is using the IP_RECVERR socket option to understand if an error occurred on send (or after, e.g. ICMP unreachable). In my case I tried to the INVITE to 127.0.0._2_:5060 and the kernel knows that no one is there and enqueues an error. The error can only be read with recvmsg+MSG_ERRQUEUE. This happens in sofia-sip in tport_udp_error and is called by the tport_error_event function.

Now to the issue. There are three ways to use sofia-sip:

* Just call it to poll every X units of time (like LCR) * Implement the complex vtable for event loop integration * Integrate with glib to have the vtable thing work

I had ruled out polling early (it wastes energy and has higher latency), and picked glib as it seemed easier to integrate than the abstraction of sofia-sip.

After sofia-sip enables the IP_RECVERR option it is doing:

events |= SU_WAIT_ERR;

This is then registered with g_source_add_poll, e.g. like:

0xb7fd7573 in su_source_register (self=0x807d564, root=0x807d858, wait=0xbfffef2c, callback=0xb7f45fdd <tport_wakeup_pri>, arg=0x8081548, priority=0) at su_source.c:650 650 g_source_add_poll(self->sup_source, (GPollFD*)&self->sup_waits[n]); (gdb) p *(su_wait_t *) 0x8081860 $2 = {fd = 5, events = 9, revents = 0}

so SU_WAIT_ERR is set.. but when glib is calling us:

(gdb) p fds[2] $23 = {fd = 5, events = 1, revents = 0}

I am currently figuring out where it is mapped and lost and will then try to find a conclusion for this.

holger

Holger Freyther

10:08 p.m.

...

On 6 Mar 2017, at 21:51, Holger Freyther holger@freyther.de wrote:

...
On 6 Mar 2017, at 15:28, Holger Freyther holger@freyther.de wrote:

Hi!

...

I am currently figuring out where it is mapped and lost and will then try to find a conclusion for this.

glib/gmain.c:g_main_context_query

/* In direct contradiction to the Unix98 spec, IRIX runs into * difficulty if you pass in POLLERR, POLLHUP or POLLNVAL * flags in the events field of the pollfd while it should * just ignoring them. So we mask them out here. */ events = pollrec->fd->events & ~(G_IO_ERR|G_IO_HUP|G_IO_NVAL);

in sofia-sip:

#define SU_WAIT_ERR POLLERR

we should probably map this POLLPRI.. not sure how to fix this up. Maybe copy the glib code and tweak it, start implementing a direct osmocom version... Not sure how to progress and if people are willing to patch sofia-sip or not. :(

holger

Holger Freyther

7 Mar 7 Mar

3:16 p.m.

...

On 6 Mar 2017, at 22:08, Holger Freyther holger@freyther.de wrote:

Hi!

...

we should probably map this POLLPRI.. not sure how to fix this up. Maybe copy the glib code and tweak it, start implementing a direct osmocom version... Not sure how to progress and if people are willing to patch sofia-sip or not. :(

so there is a simple workaround to always signal POLLERR and two changes to map select to poll instead of poll to select. They seem to work but are bigger than the simple workaround.

All can be seen in the osmo-sip-connector project on gerrit.osmocom.org. If you could try any of the two that would be nice.

holger

Harald Welte

4:03 p.m.

Hi Holger,

On Tue, Mar 07, 2017 at 03:16:01PM +0100, Holger Freyther wrote:

...

so there is a simple workaround to always signal POLLERR and two changes to map select to poll instead of poll to select. They seem to work but are bigger than the simple workaround.

what's the problem / disadvantage with the simple POLERR hack/workaround? Yes, it's not elegant, but it should come without any performance overhead, as poll will only return when actual errors are pending in the queue.

-- - Harald Welte laforge@gnumonks.org http://laforge.gnumonks.org/ ============================================================================ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)

Holger Freyther

4:40 p.m.

...

On 7 Mar 2017, at 16:03, Harald Welte laforge@gnumonks.org wrote:

Hi Holger,

Hey LaForge,

...

what's the problem / disadvantage with the simple POLERR hack/workaround? Yes, it's not elegant, but it should come without any performance overhead, as poll will only return when actual errors are pending in the queue.

The logic is FD_ISSET(fd, &readset) => POLLIN | POLLERR. We speculate that there is an error for every possible and force another recvmsg call.

Not an act of beauty but only done when there is work anyway and not performance critical. At the same time the emulate select will poll seems to be working(tm) so we might use that.

holger

Harald Welte

4:06 p.m.

Hi Holger,

On Mon, Mar 06, 2017 at 10:08:57PM +0100, Holger Freyther wrote:

...

glib/gmain.c:g_main_context_query

  /* In direct contradiction to the Unix98 spec, IRIX runs into
   * difficulty if you pass in POLLERR, POLLHUP or POLLNVAL
   * flags in the events field of the pollfd while it should
   * just ignoring them. So we mask them out here.
   */
  events = pollrec->fd->events & ~(G_IO_ERR|G_IO_HUP|G_IO_NVAL);

I think we should try to get this fixed in upstream. If there's an IRIX work-around, then it shuold be a compile-time decision and only enabled on IRIX, right?

It won't help the problem in the short term, as fixed/updated glib would first have to dissipate through their next release, get picked up by distributions, etc. - but sooner or later somebody else will run into the same trap, with glib disabling POLERR on Linux and thus destroying quite a bit of capability the operating system offers.

Holger Freyther

4:47 p.m.

...

On 7 Mar 2017, at 16:06, Harald Welte laforge@gnumonks.org wrote:

Hi Holger,

Hey!

...

I think we should try to get this fixed in upstream. If there's an IRIX work-around, then it shuold be a compile-time decision and only enabled on IRIX, right?

It won't help the problem in the short term, as fixed/updated glib would first have to dissipate through their next release, get picked up by distributions, etc. - but sooner or later somebody else will run into the same trap, with glib disabling POLERR on Linux and thus destroying quite a bit of capability the operating system offers.

I didn't know how poll works (hehe, used select all my life). Putting these into pollfd.events have no effect only the kernel will set the on the revents. For us they would indicate the intention of sofia-sip but then the RECVERR will not be signaled in the exception set unless we set another socket option (for a socket we can't access directly).

So the joke is on me and I didn't know what I was doing when implementing poll with select. So...

a.) We do POLLIN | POLLERR when a socket becomes readable b.) We merge the code to use ppoll (added in 2.6.14)

cheers holger

3164

Age (days ago)

3205

Last active (days ago)

openbsc@lists.osmocom.org

15 comments

4 participants

tags (0)

participants (4)

Alexander Chemeris
Harald Welte
Holger Freyther
OMAR RAMADAN