On Sun, Jun 25, 2017 at 05:20:49PM +0200, Harald Welte wrote:
Hi Neels,
On Sun, Jun 25, 2017 at 03:22:06PM +0200, Neels Hofmeyr wrote:
On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald
Welte wrote:
On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels
Hofmeyr wrote:
We're still having massive stability problems
with osmo-bts-trx on the osmo-gsm-tester.
I'm sorry, but I have to ask for more specifics:
What exactly is a 'massive stability problem'? How does it manifest
To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:
20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame
difference is > 1!
that's the higher-layer code complaining that the frame number as
reported by the lower layer code (osmo-bts-trx) has not incremented by
+1. The normal expectation is tha that osmo-bts-* feeds every FN into
the common layer (via l1sap).
20170625121036320 DL1C <0006>
scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004
That's 62 frames "missed", which is quite a lot (translating to 285ms).
I can of course test things in case anyone has
more ideas.
As indicated in the related ticket, I have submitted a patch to gerrit
that switches from gettimeofday() based osmo_timer_list to a monotonic
timerfd based interval timer for the FN clock inside osmo-bts-trx. It
would be good if you can see to this being tested.
I have put your trx patches on a branch and built a binary from it, from
http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester…
the patches are being tested on the gsm-tester. branch: osmo-bts:neels/trx_test
(it actually started from 973, which failed because 'settsc' config is removed
by one of the patches but was still in the osmo-bts-trx config file)
976 has a different failure in *one* of two trx tests:
20170626171713445 DOML <0001> oml.c:333 OC=CHANNEL INST=(00,00,07) AVAIL STATE
Dependency -> OK
20170626171713445 DOML <0001> oml.c:340 OC=CHANNEL INST=(00,00,07) OPER STATE
Disabled -> Enabled
20170626171713445 DOML <0001> oml.c:301 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
20170626171713513 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX,
compensating
20170626171713514 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX,
compensating
20170626171713515 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX,
compensating
20170626171713727 DL1C <0006> scheduler_trx.c:1600 PC clock skew: elapsed_us=614659,
error_us=610044
20170626171713727 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from
osmo-trx
[...]
Shutdown timer expired
The next run, 977, is successful.
All following runs until now (982) are failing.
See
http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester…
and click once on the (+) to expand one level of child nodes.
So at first glance it appears that the patches make things worse.
Starting from build #983, we are testing an osmo-bts-trx with *only* the
CLOCK_MONOTONIC patch applied.
Notably we have removed the settsc config option from the osmo-bts-trx config,
but then again settsc seems to not have any effect in the code.
fashion, I would use the opposite approach: Set up
osmo-bts-trx on the
same hardware (APU) next to your laptop on your personal desk, and then
try to see if and when the above problems can be reproduced, maybe by
putting some more CPU load on the APU, or I/O load, or whatever..
Yes, may be something Pau should take on?
If osmo-bts-trx is too unstable for the
"production" osmo-gsm-tester, I
would simply disable it until we have adressed related bugs.
We'll see about disabling soon.
We *did* catch a regression with it recently...
~N