Hi all
In some communities we are using 5Ghz wifi links between the box running openBSC and the sbts2050.
In often works well for some time, but then we get plagued with BROKEN_UNUSABLE channels: I cannot say exactly why, probably rain, solar radiation, aliens, goblins.. etc. I have yet to actually be able to watch it happen. I usually just discover the problem, correct it and then it works fine until I get bored watching.
BTS 1, TRX 0, Timeslot 1, Lchan 2: Type NONE Connection: 0, State: BROKEN UNUSABLE Error reason: activation timeout
This requires then either
A) use the drop command in the vty B) restart the osmo-bts process on the 2050 (same thing as A really) C) restart osmo-nitb
All these things are ugly, as they will drop other calls on the BTS at the time.
The question is, would somebody help make this more robust, as in have it self repair?
Initial questions:
Why does the broken unusable channel stay as such until manual intervention, Is is just because the code has not been written to monitor and recover, or is there some other reason? I am thinking that nobody ever did this, probably all testing and installation has been done using nice reliable cabled ethernet and not ugly nasty packet dropping 3km wifi links. :-)
Thanks!
On 17 Aug 2016, at 13:11, Keith keith@rhizomatica.org wrote:
Initial questions:
Why does the broken unusable channel stay as such until manual intervention, Is is just because the code has not been written to monitor and recover, or is there some other reason? I am thinking that nobody ever did this, probably all testing and installation has been done using nice reliable cabled ethernet and not ugly nasty packet dropping 3km wifi links. :-)
A RF Channel Release NACKed, or two responses or no response. With equipment like a nanoBTS you just don't know what the state of the channel is.
/* * The BTS didn't respond within the timeout to our channel * release request and we have marked the channel as broken. * Now we do receive an ACK and let's be conservative. If it * is a sysmoBTS we know that only one RF Channel Release ACK * will be sent. So let's "repair" the channel. */ if (lchan->state == LCHAN_S_BROKEN) { int do_free = is_sysmobts_v2(lchan->ts->trx->bts); LOGP(DRSL, LOGL_NOTICE, "%s CHAN REL ACK for broken channel. %s.\n", gsm_lchan_name(lchan), do_free ? "Releasing it" : "Keeping it broken"); if (do_free) do_lchan_free(lchan); return 0; }
so I assume you want to put "type sysmobts" into the OpenBSC config. IIRC I added that specially for Rhizomatica.
cheers holger
On 17/08/2016 13:22, Holger Freyther wrote:
so I assume you want to put "type sysmobts" into the OpenBSC config. IIRC I added that specially for Rhizomatica.
It's there. So, something else is up. I will try to get some more data to work with. I'm thinking the best thing to do is setup a lab using tc or something to mimic the nasty wifi link.
Thanks!
On 17 Aug 2016, at 13:34, Keith keith@rhizomatica.org wrote:
On 17/08/2016 13:22, Holger Freyther wrote:
so I assume you want to put "type sysmobts" into the OpenBSC config. IIRC I added that specially for Rhizomatica.
It's there. So, something else is up. I will try to get some more data to work with. I'm thinking the best thing to do is setup a lab using tc or something to mimic the nasty wifi link.
Maybe not the version of osmo-bts you run but I think later versions mark channels as broken as well (e.g. if the DSP did something odd).
*If* the channel release ack (or activation) is just delayed it should clean-up properly if "type sysmobts" is set in the config.
holger
On 17/08/2016 21:43, Holger Freyther wrote:
Maybe not the version of osmo-bts you run but I think later versions mark channels as broken as well (e.g. if the DSP did something odd). *If* the channel release ack (or activation) is just delayed it should clean-up properly if "type sysmobts" is set in the config.
type sysmobts is set in the config. Unfortunately, in a new installation (https://twitter.com/ninalakhani/status/765250788046143488) with a wifi link of some 3 KM or so we are ending up with this kind of situation after a short time:
OpenBSC# show lchan su BTS 0, TRX 0, Timeslot 1, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -101 dBm RXL-FULL-ul: -104 dBm BTS 0, TRX 0, Timeslot 1, Lchan 1, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 2, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 3, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 4, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 5, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 6, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 1, Lchan 7, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -104 dBm RXL-FULL-ul: -106 dBm BTS 0, TRX 0, Timeslot 2, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 3, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 4, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 5, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 0, TRX 0, Timeslot 6, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -87 dBm RXL-FULL-ul: -91 dBm BTS 0, TRX 0, Timeslot 7, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 1, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 1, Lchan 1, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -87 dBm RXL-FULL-ul: -95 dBm BTS 1, TRX 0, Timeslot 1, Lchan 2, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 1, Lchan 3, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -78 dBm RXL-FULL-ul: -82 dBm BTS 1, TRX 0, Timeslot 1, Lchan 4, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 1, Lchan 5, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -101 dBm RXL-FULL-ul: -98 dBm BTS 1, TRX 0, Timeslot 1, Lchan 6, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 1, Lchan 7, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -96 dBm RXL-FULL-ul: -95 dBm BTS 1, TRX 0, Timeslot 3, Lchan 0, Type NONE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm BTS 1, TRX 0, Timeslot 4, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -75 dBm RXL-FULL-ul: -83 dBm BTS 1, TRX 0, Timeslot 5, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -93 dBm RXL-FULL-ul: -94 dBm BTS 1, TRX 0, Timeslot 6, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -95 dBm RXL-FULL-ul: -105 dBm BTS 1, TRX 0, Timeslot 7, Lchan 0, Type NONE - L1 MS Power: 33 dBm RXL-FULL-dl: -78 dBm RXL-FULL-ul: -77 dBm
During all the much time I have been watching LOG output, I cannot recall ever having seen: CHAN REL ACK for broken channel. Releasing it nor: CHAN REL ACK for broken channel. Keeping it broken
Can I ask, What triggers the eventual calling of rsl_rx_rf_chan_rel_ack(), (I tried to trace it back but get a bit lost.) I do not understand the logic, as it would seem that once the REL ACK is lost, then it is lost, right? It would seem that something in openBSC needs to check for broken channels and release them. But then.. my understanding here is limited.
Any help very much appreciated
holger
On 17 Aug 2016, at 22:27, Keith keith@rhizomatica.org wrote:
On 17/08/2016 21:43, Holger Freyther wrote:
Maybe not the version of osmo-bts you run but I think later versions mark channels as broken as well (e.g. if the DSP did something odd). *If* the channel release ack (or activation) is just delayed it should clean-up properly if "type sysmobts" is set in the config.
type sysmobts is set in the config. Unfortunately, in a new installation (https://twitter.com/ninalakhani/status/765250788046143488) with a wifi link of some 3 KM or so we are ending up with this kind of situation after a short time:
OpenBSC# show lchan su BTS 0, TRX 0, Timeslot 1, Lchan 0, Type NONE - L1 MS Power: 33 dBm
type none? What OpenBSC version is that? So just the none already looks broken.
Can I ask, What triggers the eventual calling of rsl_rx_rf_chan_rel_ack(), (I tried to trace it back but get a bit lost.) I do not understand the logic, as it would seem that once the REL ACK is lost, then it is lost, right? It would seem that something in openBSC needs to check for broken channels and release them. But then.. my understanding here is limited.
the actual arrival of the REL_ACK message.
holger
On 17 Aug 2016, at 23:35, Holger Freyther holger@freyther.de wrote:
OpenBSC# show lchan su BTS 0, TRX 0, Timeslot 1, Lchan 0, Type NONE - L1 MS Power: 33 dBm
type none? What OpenBSC version is that? So just the none already looks broken.
So please do the following for reference:
* Create a ticket in the Osmocom tracker * List libosmocore/libosmo-abis/../OpenBSC/osmo-bts git commit (+patches if it makes sense) * Attach both BTS and NITB config * PCAP file (if it doesn't include personal data, otherwise try to pseudomize it or share out of band).
thank you holger
On 17/08/2016 23:52, Holger Freyther wrote:
So please do the following for reference:
- Create a ticket in the Osmocom tracker
OK
- List libosmocore/libosmo-abis/../OpenBSC/osmo-bts git commit (+patches if it makes sense)
Unknown, as I didn't build it, but we could dispense with this, and I can just compile from a know good/recent commit and go from there.
- Attach both BTS and NITB config
- PCAP file (if it doesn't include personal data, otherwise try to pseudomize it or share out of band).
I think specifically a capture of the point at which this kind of thing starts to happen would be the interesting part, although I have to sit around for a while to get it. Not sure how to filter then the resulting pcap to just that part but I'll figure it out.
<0004> abis_rsl.c:630 (bts=1,trx=0,ts=1,ss=0) is back in operation. <0004> abis_rsl.c:615 (bts=0,trx=0,ts=1,ss=0) DEACTivate SACCH CMD <0004> abis_rsl.c:1102 (bts=0,trx=0,ts=1,ss=0): MEAS RES for inactive channel <0004> abis_rsl.c:1330 (bts=0,trx=0,ts=1,ss=0) SACCH deactivation timeout. <0004> abis_rsl.c:661 (bts=0,trx=0,ts=1,ss=0) RF Channel Release CMD due error 1 <0004> abis_rsl.c:1102 (bts=0,trx=0,ts=1,ss=0): MEAS RES for inactive channel <0004> abis_rsl.c:223 (bts=0,trx=0,ts=1,ss=0) Timeout during deactivation! Marked as broken.
now, Just as soon as what I said in a previous mail about not having seen it, those rsl log entries are followed by
<0004> abis_rsl.c:717 (bts=0,trx=0,ts=1,ss=0) RF CHANNEL RELEASE ACK <0004> abis_rsl.c:735 (bts=0,trx=0,ts=1,ss=0) CHAN REL ACK for broken channel. Releasing it.
However, this follows in the log and these remain broken "forever"
<0004> abis_rsl.c:1198 (bts=0,trx=0,ts=1,ss=1) CHANNEL ACTIVATE ACK <0004> abis_rsl.c:949 (bts=0,trx=0,ts=1,ss=1) CHAN ACT ACK for broken channel. <0004> abis_rsl.c:1198 (bts=0,trx=0,ts=1,ss=2) CHANNEL ACTIVATE ACK <0004> abis_rsl.c:949 (bts=0,trx=0,ts=1,ss=2) CHAN ACT ACK for broken channel. <0004> abis_rsl.c:1198 (bts=0,trx=0,ts=1,ss=3) CHANNEL ACTIVATE ACK <0004> abis_rsl.c:949 (bts=0,trx=0,ts=1,ss=3) CHAN ACT ACK for broken channel. <0004> abis_rsl.c:1198 (bts=0,trx=0,ts=1,ss=4) CHANNEL ACTIVATE ACK <0004> abis_rsl.c:949 (bts=0,trx=0,ts=1,ss=4) CHAN ACT ACK for broken channel.
On 18 Aug 2016, at 00:28, Keith keith@rhizomatica.org wrote:
Good Morning Keith,
- List libosmocore/libosmo-abis/../OpenBSC/osmo-bts git commit (+patches if it makes sense)
Unknown, as I didn't build it, but we could dispense with this, and I can just compile from a know good/recent commit and go from there.
well, dpkg -l and changelogs should give you an idea what rhizomatica has deployed. But without knowing which versions are in place, it is extremely difficult to look.
Same for the BTS and firmware version (opkg list_installed). E.g. it wouldn't make sense to chase issues in old firmware releases.
- Attach both BTS and NITB config
missing. Very crucial part of the puzzle. My assumptions right now:
* Do you use depends-on-bts to add dependency between BTS #1 and BTS #2 of the sysmoBTS 2050?
- PCAP file (if it doesn't include personal data, otherwise try to pseudomize it or share out of band).
I think specifically a capture of the point at which this kind of thing starts to happen would be the interesting part, although I have to sit around for a while to get it. Not sure how to filter then the resulting pcap to just that part but I'll figure it out.
a good point will be to filter for "bts=0, trx=0, ts=1, ss=0" in RSL packages.
<0004> abis_rsl.c:630 (bts=1,trx=0,ts=1,ss=0) is back in operation. <0004> abis_rsl.c:615 (bts=0,trx=0,ts=1,ss=0) DEACTivate SACCH CMD <0004> abis_rsl.c:1102 (bts=0,trx=0,ts=1,ss=0): MEAS RES for inactive channel <0004> abis_rsl.c:1330 (bts=0,trx=0,ts=1,ss=0) SACCH deactivation timeout. <0004> abis_rsl.c:661 (bts=0,trx=0,ts=1,ss=0) RF Channel Release CMD due error 1 <0004> abis_rsl.c:1102 (bts=0,trx=0,ts=1,ss=0): MEAS RES for inactive channel <0004> abis_rsl.c:223 (bts=0,trx=0,ts=1,ss=0) Timeout during deactivation! Marked as broken.
So SACCH didn't detect signal, error handling, channel release and then no answer. If you enable timestamps in the printing we would see how much time has passed. But it must be at least four seconds.
now, Just as soon as what I said in a previous mail about not having seen it, those rsl log entries are followed by
<0004> abis_rsl.c:717 (bts=0,trx=0,ts=1,ss=0) RF CHANNEL RELEASE ACK <0004> abis_rsl.c:735 (bts=0,trx=0,ts=1,ss=0) CHAN REL ACK for broken channel. Releasing it.
So here it "repaired it"
do_lchan_free() ... } else { rsl_lchan_set_state(lchan, LCHAN_S_NONE); } lchan_free(lchan);
However, this follows in the log and these remain broken "forever"
<0004> abis_rsl.c:1198 (bts=0,trx=0,ts=1,ss=1) CHANNEL ACTIVATE ACK <0004> abis_rsl.c:949 (bts=0,trx=0,ts=1,ss=1) CHAN ACT ACK for broken channel.
We are missing the activation. But only non-broken channels should be allocated. So it looks like the BTS(firmware) had enough? We can add the same "freeing" to the channel_act_ack as well. But you should have a look if:
* Delay is added by network * Delay is added by BTS being too busy
holger
PS: With TCP it is not possible for something that has been written into the socket to arrive out-of-order. It might be that the BTS has not written into it (but that doesn't seem to be the case either).
On 18 Aug 2016, at 08:38, Holger Freyther holger@freyther.de wrote:
We are missing the activation. But only non-broken channels should be allocated. So it looks like the BTS(firmware) had enough? We can add the same "freeing" to the channel_act_ack as well. But you should have a look if:
https://gerrit.osmocom.org/#/c/713/ for the untested activation fix-up. Do you have a gerrit log-in? For a ticket like this it would be nice to be able to put you as reviewer to make you aware of the change.
There are two patchsets, one before dynts was completed and one for latest master.
holger
On 17/08/2016 23:35, Holger Freyther wrote:
OpenBSC# show lchan su BTS 0, TRX 0, Timeslot 1, Lchan 0, Type NONE - L1 MS Power: 33 dBm
type none? What OpenBSC version is that? So just the none already looks broken.
Sure, If you look at the non summary version of lchan, you see the likes of BTS 1, TRX 0, Timeslot 7, Lchan 0: Type NONE Connection: 0, State: BROKEN UNUSABLE Error reason: activation timeout BS Power: 37 dBm, MS Power: 33 dBm No Subscriber Bound IP: 0.0.0.0 Port 0 RTP_TYPE2=0 CONN_ID=0 Measurement Report: Flags: DLinval RXL-FULL-ul: -110 dBm, RXL-SUB-ul: -110 dBm RXQ-FULL-ul: 0, RXQ-SUB-ul: 0 osmo-nitb tells me it is OpenBSC version 0.14.0-dirty
You mentioned off list about the sbts2050 software, yes I'm afraid it's 201208-testing.
Can I ask, What triggers the eventual calling of rsl_rx_rf_chan_rel_ack(),
the actual arrival of the REL_ACK message.
So it won't ever be called, once it's "lost" What I don't understand is how it can get lost. Or arrive out of order of something, that's not possible with tcp is it? I have tried momentarily interrupting the connection BSC<->BTS with iptables but I cannot as yet provoke the event that causes this.. ah just now it happens again (meanwhile I am watching a straight ping to the bts (for what it's worth) and I see nothing unusual there. Maybe this actually has not got to do with the wifi link after all, and that is just coincidence?
<0004> abis_rsl.c:717 (bts=0,trx=0,ts=1,ss=0) RF CHANNEL RELEASE ACK <0004> abis_rsl.c:735 (bts=0,trx=0,ts=1,ss=0) CHAN REL ACK for broken channel. Releasing it. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=1) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=0) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=2) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=3) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=4) Timeout during activation. Marked as broken. <0004> abis_rsl.c:1464 (bts=1,trx=0,ts=1,ss=5) Activating ARFCN(251) SS(5) lctype SDCCH r=LOCATION_UPDATE ra=0x0b ta=3 <0004> abis_rsl.c:1464 (bts=1,trx=0,ts=1,ss=6) Activating ARFCN(251) SS(6) lctype SDCCH r=LOCATION_UPDATE ra=0x0f ta=3 <0004> abis_rsl.c:1464 (bts=1,trx=0,ts=1,ss=7) Activating ARFCN(251) SS(7) lctype SDCCH r=LOCATION_UPDATE ra=0x0a ta=3 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=0) Activating ARFCN(249) SS(0) lctype SDCCH r=LOCATION_UPDATE ra=0x0c ta=2 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=1) Activating ARFCN(249) SS(1) lctype SDCCH r=OTHER ra=0x1c ta=1 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=2) Activating ARFCN(249) SS(2) lctype SDCCH r=LOCATION_UPDATE ra=0x0d ta=2 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=3) Activating ARFCN(249) SS(3) lctype SDCCH r=OTHER ra=0x10 ta=1 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=4) Activating ARFCN(249) SS(4) lctype SDCCH r=LOCATION_UPDATE ra=0x07 ta=2 <0004> abis_rsl.c:1464 (bts=0,trx=0,ts=1,ss=5) Activating ARFCN(249) SS(5) lctype SDCCH r=OTHER ra=0x17 ta=1 <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=5) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=6) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=1,trx=0,ts=1,ss=7) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=0) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=1) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=2) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=3) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=4) Timeout during activation. Marked as broken. <0004> abis_rsl.c:211 (bts=0,trx=0,ts=1,ss=5) Timeout during activation. Marked as broken.