If the SIP server dies in the middle of a call, osmo-sip-connector is in a bad state and generates a never ending stream of error messages:
58): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f258): zero length packettport(0xb7e9f25
It looks like the messages are generated from sofia-sip and I have managed to suppress the messages by setting the environment variable SU_DEBUG < 3: http://sofia-sip.sourcearchive.com/documentation/1.12.7/tport__internal_8h_e...
However, it looks like osmo-sip-connector is clearly in a bad state when this happens and we need a way to detect and release these ghosted calls.
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu wrote:
If the SIP server dies in the middle of a call, osmo-sip-connector is in a bad state and generates a never ending stream of error messages:
Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(
holger
I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.
On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu
wrote:
If the SIP server dies in the middle of a call, osmo-sip-connector is in
a bad state and generates a never ending stream of error messages:
Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(
holger
Omar,
Just curious - is there any reason you're running RTP through the osmo-sip-connector instead of directly to FreeSWITCH?
Please excuse typos. Written with a touchscreen keyboard.
-- Regards, Alexander Chemeris CEO Fairwaves, Inc. https://fairwaves.co
On Jan 26, 2017 02:31, "OMAR RAMADAN" omar.ramadan@berkeley.edu wrote:
I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.
On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu
wrote:
If the SIP server dies in the middle of a call, osmo-sip-connector is
in a bad state and generates a never ending stream of error messages:
Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(
holger
Good point Alexander, I just realized that the connector is only handling the signaling. What probably is happened is a call is trying to be established to a dead SIP server. But we should still terminate the call cleanly
On Jan 25, 2017 8:30 PM, "Alexander Chemeris" alexander.chemeris@gmail.com wrote:
Omar,
Just curious - is there any reason you're running RTP through the osmo-sip-connector instead of directly to FreeSWITCH?
Please excuse typos. Written with a touchscreen keyboard.
-- Regards, Alexander Chemeris CEO Fairwaves, Inc. https://fairwaves.co
On Jan 26, 2017 02:31, "OMAR RAMADAN" omar.ramadan@berkeley.edu wrote:
I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.
On Wed, Jan 25, 2017 at 12:06 PM, Holger Freyther holger@freyther.de wrote:
On 25 Jan 2017, at 18:06, OMAR RAMADAN omar.ramadan@berkeley.edu
wrote:
If the SIP server dies in the middle of a call, osmo-sip-connector is
in a bad state and generates a never ending stream of error messages:
Can you reliable reproduce it? It seems sofia-sip is struggling with some input to it and goes crazy after that. I lack a stable way to reproduce it. The lack of \n in that message is annoying too. :(
holger
On 26 Jan 2017, at 00:31, OMAR RAMADAN omar.ramadan@berkeley.edu wrote:
I've seen it a few times in production already and it filled the disk. You should be able to reproduce it by killing an active RTP stream. I have been using freeswitch, but I don't imagine it is limited to this SIP server. It looks like sofia-sip is driven to continue to receiving media and gets nothing back while the call should be terminated.
that sucks. I intend to look at early media in the connector and try to reproduce the issue as well, I have seen it once when a NAT was involved too..
holger
On 26 Jan 2017, at 21:18, Holger Freyther holger@freyther.de wrote:
Hi
that sucks. I intend to look at early media in the connector and try to reproduce the issue as well, I have seen it once when a NAT was involved too..
I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.
thank you holger
[1] git pull https://gerrit.osmocom.org/osmo-sip-connector refs/changes/97/1797/1
On 12 Feb 2017, at 11:23, Holger Freyther holger@freyther.de wrote:
Dear Omar,
I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.
did you have time to see if the updated eventloop code is improving the situation?
thank you holger
On 3 Mar 2017, at 02:30, Holger Freyther holger@freyther.de wrote:
On 12 Feb 2017, at 11:23, Holger Freyther holger@freyther.de wrote:
Dear Omar,
I am traveling in south east asia right now and my test setup is quite limited but I read a bit of sofia-sip code, the glib integration and our event loop integration and while I couldn't reproduce it, there are things to improve in our eventloop code. If you are building from source you could pull this[1] to give it a try. I intend to submit the changes on Tuesday morning (giving europe/us time to review on a business day). Nightly packages should be available soon.
did you have time to see if the updated eventloop code is improving the situation?
I was working on a manual testcase for DTMF handling.. and I am able to reproduce this now. Will analyze it now and then return to DTMF.
holger
On 6 Mar 2017, at 15:28, Holger Freyther holger@freyther.de wrote:
Hi!
I was working on a manual testcase for DTMF handling.. and I am able to reproduce this now. Will analyze it now and then return to DTMF.
I don't have a fix yet but a workaround. One can patch sofia-sip to not use the IP_RECVERR sockopt. Either patch it out or configure with an ac_... var to not enable the feature.
Sofia SIP is using the IP_RECVERR socket option to understand if an error occurred on send (or after, e.g. ICMP unreachable). In my case I tried to the INVITE to 127.0.0._2_:5060 and the kernel knows that no one is there and enqueues an error. The error can only be read with recvmsg+MSG_ERRQUEUE. This happens in sofia-sip in tport_udp_error and is called by the tport_error_event function.
Now to the issue. There are three ways to use sofia-sip:
* Just call it to poll every X units of time (like LCR) * Implement the complex vtable for event loop integration * Integrate with glib to have the vtable thing work
I had ruled out polling early (it wastes energy and has higher latency), and picked glib as it seemed easier to integrate than the abstraction of sofia-sip.
After sofia-sip enables the IP_RECVERR option it is doing:
events |= SU_WAIT_ERR;
This is then registered with g_source_add_poll, e.g. like:
0xb7fd7573 in su_source_register (self=0x807d564, root=0x807d858, wait=0xbfffef2c, callback=0xb7f45fdd <tport_wakeup_pri>, arg=0x8081548, priority=0) at su_source.c:650 650 g_source_add_poll(self->sup_source, (GPollFD*)&self->sup_waits[n]); (gdb) p *(su_wait_t *) 0x8081860 $2 = {fd = 5, events = 9, revents = 0}
so SU_WAIT_ERR is set.. but when glib is calling us:
(gdb) p fds[2] $23 = {fd = 5, events = 1, revents = 0}
I am currently figuring out where it is mapped and lost and will then try to find a conclusion for this.
holger
On 6 Mar 2017, at 21:51, Holger Freyther holger@freyther.de wrote:
On 6 Mar 2017, at 15:28, Holger Freyther holger@freyther.de wrote:
Hi!
I am currently figuring out where it is mapped and lost and will then try to find a conclusion for this.
glib/gmain.c:g_main_context_query
/* In direct contradiction to the Unix98 spec, IRIX runs into * difficulty if you pass in POLLERR, POLLHUP or POLLNVAL * flags in the events field of the pollfd while it should * just ignoring them. So we mask them out here. */ events = pollrec->fd->events & ~(G_IO_ERR|G_IO_HUP|G_IO_NVAL);
in sofia-sip:
#define SU_WAIT_ERR POLLERR
we should probably map this POLLPRI.. not sure how to fix this up. Maybe copy the glib code and tweak it, start implementing a direct osmocom version... Not sure how to progress and if people are willing to patch sofia-sip or not. :(
holger
On 6 Mar 2017, at 22:08, Holger Freyther holger@freyther.de wrote:
Hi!
we should probably map this POLLPRI.. not sure how to fix this up. Maybe copy the glib code and tweak it, start implementing a direct osmocom version... Not sure how to progress and if people are willing to patch sofia-sip or not. :(
so there is a simple workaround to always signal POLLERR and two changes to map select to poll instead of poll to select. They seem to work but are bigger than the simple workaround.
All can be seen in the osmo-sip-connector project on gerrit.osmocom.org. If you could try any of the two that would be nice.
holger
Hi Holger,
On Tue, Mar 07, 2017 at 03:16:01PM +0100, Holger Freyther wrote:
so there is a simple workaround to always signal POLLERR and two changes to map select to poll instead of poll to select. They seem to work but are bigger than the simple workaround.
what's the problem / disadvantage with the simple POLERR hack/workaround? Yes, it's not elegant, but it should come without any performance overhead, as poll will only return when actual errors are pending in the queue.
On 7 Mar 2017, at 16:03, Harald Welte laforge@gnumonks.org wrote:
Hi Holger,
Hey LaForge,
what's the problem / disadvantage with the simple POLERR hack/workaround? Yes, it's not elegant, but it should come without any performance overhead, as poll will only return when actual errors are pending in the queue.
The logic is FD_ISSET(fd, &readset) => POLLIN | POLLERR. We speculate that there is an error for every possible and force another recvmsg call.
Not an act of beauty but only done when there is work anyway and not performance critical. At the same time the emulate select will poll seems to be working(tm) so we might use that.
holger
Hi Holger,
On Mon, Mar 06, 2017 at 10:08:57PM +0100, Holger Freyther wrote:
glib/gmain.c:g_main_context_query
/* In direct contradiction to the Unix98 spec, IRIX runs into * difficulty if you pass in POLLERR, POLLHUP or POLLNVAL * flags in the events field of the pollfd while it should * just ignoring them. So we mask them out here. */ events = pollrec->fd->events & ~(G_IO_ERR|G_IO_HUP|G_IO_NVAL);
I think we should try to get this fixed in upstream. If there's an IRIX work-around, then it shuold be a compile-time decision and only enabled on IRIX, right?
It won't help the problem in the short term, as fixed/updated glib would first have to dissipate through their next release, get picked up by distributions, etc. - but sooner or later somebody else will run into the same trap, with glib disabling POLERR on Linux and thus destroying quite a bit of capability the operating system offers.
On 7 Mar 2017, at 16:06, Harald Welte laforge@gnumonks.org wrote:
Hi Holger,
Hey!
I think we should try to get this fixed in upstream. If there's an IRIX work-around, then it shuold be a compile-time decision and only enabled on IRIX, right?
It won't help the problem in the short term, as fixed/updated glib would first have to dissipate through their next release, get picked up by distributions, etc. - but sooner or later somebody else will run into the same trap, with glib disabling POLERR on Linux and thus destroying quite a bit of capability the operating system offers.
I didn't know how poll works (hehe, used select all my life). Putting these into pollfd.events have no effect only the kernel will set the on the revents. For us they would indicate the intention of sofia-sip but then the RECVERR will not be signaled in the exception set unless we set another socket option (for a socket we can't access directly).
So the joke is on me and I didn't know what I was doing when implementing poll with select. So...
a.) We do POLLIN | POLLERR when a socket becomes readable b.) We merge the code to use ppoll (added in 2.6.14)
cheers holger