Hi Tom and others,
in our testing setup, we have sporadic failures (~2 out of 10 times) with:
DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx
What would be possible reasons for this failure, and how can we go about fixing it? Some more logging around it:
20170614032014399 DRSL <0000> rsl.c:2333 (bts=0,trx=0,ts=0,ss=2) Fwd RLL msg EST_IND from LAPDm to A-bis 20170614032018533 DL1C <0006> scheduler_trx.c:1451 PC clock skew: elapsed uS 4136730 20170614032018533 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx 20170614032018533 DL1C <0006> scheduler.c:240 Exit scheduler for trx=0 20170614032018533 DL1C <0006> scheduler.c:216 Init scheduler for trx=0 20170614032018533 DOML <0001> oml.c:280 OC=RADIO-CARRIER INST=(00,00,ff) AVAIL STATE OK -> Off line [...] Shutdown timer expired
(We're using an external 10MHz OCXO timing source)
It appears there's four seconds of nothing from osmo-trx?
Most curious is that the next run will be completely fine, until some time later we get this same failure.
We wait until osmo-trx logs
-- Transceiver active with 1 channel(s)
and then we "immediately" or up to a second later launch osmo-bts-trx. Would it help to give it more grace time??
Thanks!
~N
Neels,
A reason could be that osmo-trx is losing connection with the SDR. Are you running this on bare metal or a VM?
USB based SDRs like USRP B2x0 have hard time keeping the Tx/Rx alignment when there are any disturbances. So osmo-trx features a sophisticated algorithm to maintain this alignment for USB based devices. Thomas has spent tremendous effort tuning it to perform well, but may be there are edge cases which are not handled there yet. Let's wait for his comments.
(That's one the primary reasons we use Ethernet in UmTRX, btw - it's much more robust to issues like this)
Please excuse typos. Written with a touchscreen keyboard.
-- Regards, Alexander Chemeris CTO/Founder Fairwaves, Inc. https://fairwaves.co
On Jun 14, 2017 04:34, "Neels Hofmeyr" nhofmeyr@sysmocom.de wrote:
Hi Tom and others,
in our testing setup, we have sporadic failures (~2 out of 10 times) with:
DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx
What would be possible reasons for this failure, and how can we go about fixing it? Some more logging around it:
20170614032014399 DRSL <0000> rsl.c:2333 (bts=0,trx=0,ts=0,ss=2) Fwd RLL msg EST_IND from LAPDm to A-bis 20170614032018533 DL1C <0006> scheduler_trx.c:1451 PC clock skew: elapsed uS 4136730 20170614032018533 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx 20170614032018533 DL1C <0006> scheduler.c:240 Exit scheduler for trx=0 20170614032018533 DL1C <0006> scheduler.c:216 Init scheduler for trx=0 20170614032018533 DOML <0001> oml.c:280 OC=RADIO-CARRIER INST=(00,00,ff) AVAIL STATE OK -> Off line [...] Shutdown timer expired
(We're using an external 10MHz OCXO timing source)
It appears there's four seconds of nothing from osmo-trx?
Most curious is that the next run will be completely fine, until some time later we get this same failure.
We wait until osmo-trx logs
-- Transceiver active with 1 channel(s)
and then we "immediately" or up to a second later launch osmo-bts-trx. Would it help to give it more grace time??
Thanks!
~N
On Wed, Jun 14, 2017 at 3:47 PM, Alexander Chemeris alexander.chemeris@gmail.com wrote:
USB based SDRs like USRP B2x0 have hard time keeping the Tx/Rx alignment when there are any disturbances. So osmo-trx features a sophisticated algorithm to maintain this alignment for USB based devices. Thomas has spent tremendous effort tuning it to perform well, but may be there are edge cases which are not handled there yet. Let's wait for his comments.
The 'clock' issue is occurring between osmo-bts and osmo-trx and not between osmo-trx and the device. For the latter, irregular packet timing would appear as underruns, overflows, late packets, etc. - errors non-specific to GSM numerology.
There are timing considerations at startup because the device needs time to initialize. In the case of the B200 on first boot, the startup time is especially long because of the FPGA load. Running the uhd_usrp_probe utility will give an indication of the device initialization time. On top of that delay, osmo-trx could add another second for Tx/Rx synchronization purposes.
If clock skew is not occurring at startup, then process scheduling is probably related. If the flow of CLK IND stops entirely, as in the case when osmo-trx stops running, the message would be "No clock from osmo-trx". Clock skew could also occur because of variability in calling gettimeofday(), but I have not encountered that on any systems that I run.
-TT
On Wed, Jun 14, 2017 at 04:03:44PM -0700, Tom Tsou wrote:
There are timing considerations at startup because the device needs time to initialize. In the case of the B200 on first boot, the startup time is especially long because of the FPGA load.
We are specifically wating for the "Transceiver active" on stdout to wait until the FPGA load is done.
Running the uhd_usrp_probe utility will give an indication of the device initialization time. On top of that delay, osmo-trx could add another second for Tx/Rx synchronization purposes.
Ok, I'll try adding a little head room after receiving the "Transceiver active" message.
One point may be that it's not a very powerful machine: an APU with an 800MHz dual core.
If clock skew is not occurring at startup, then process scheduling is probably related. If the flow of CLK IND stops entirely, as in the case when osmo-trx stops running, the message would be "No clock from osmo-trx". Clock skew could also occur because of variability in calling gettimeofday(), but I have not encountered that on any systems that I run.
With NTP switched off I have no idea why the system clock could jump around. I also looked in the root crontab and so on, maybe something is still calling ntpdate on that system...
Could also make sense to wipe that OS to be sure. A lot has happened on there in the past...
Will keep you posted.
~N
Hi Neels and Tom,
On Fri, Jun 16, 2017 at 08:20:35PM +0200, Neels Hofmeyr wrote:
One point may be that it's not a very powerful machine: an APU with an 800MHz dual core.
That actually means: An AMD Embedded G-Series T40E APU. We like to use passive cooled, low-end processors whenever possible.
If clock skew is not occurring at startup, then process scheduling is probably related. If the flow of CLK IND stops entirely, as in the case when osmo-trx stops running, the message would be "No clock from osmo-trx". Clock skew could also occur because of variability in calling gettimeofday(), but I have not encountered that on any systems that I run.
With NTP switched off I have no idea why the system clock could jump around. I also looked in the root crontab and so on, maybe something is still calling ntpdate on that system...
you can always run a tcpdump on the ntp port, if you're worried about that.