On Sun, Jun 25, 2017 at 05:20:49PM +0200, Harald Welte wrote:
Hi Neels,
On Sun, Jun 25, 2017 at 03:22:06PM +0200, Neels Hofmeyr wrote:
On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:
On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:
We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.
I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest
To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:
20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!
that's the higher-layer code complaining that the frame number as reported by the lower layer code (osmo-bts-trx) has not incremented by +1. The normal expectation is tha that osmo-bts-* feeds every FN into the common layer (via l1sap).
20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004
That's 62 frames "missed", which is quite a lot (translating to 285ms).
I can of course test things in case anyone has more ideas.
As indicated in the related ticket, I have submitted a patch to gerrit that switches from gettimeofday() based osmo_timer_list to a monotonic timerfd based interval timer for the FN clock inside osmo-bts-trx. It would be good if you can see to this being tested.
I have put your trx patches on a branch and built a binary from it, from http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_... the patches are being tested on the gsm-tester. branch: osmo-bts:neels/trx_test
(it actually started from 973, which failed because 'settsc' config is removed by one of the patches but was still in the osmo-bts-trx config file)
976 has a different failure in *one* of two trx tests:
20170626171713445 DOML <0001> oml.c:333 OC=CHANNEL INST=(00,00,07) AVAIL STATE Dependency -> OK 20170626171713445 DOML <0001> oml.c:340 OC=CHANNEL INST=(00,00,07) OPER STATE Disabled -> Enabled 20170626171713445 DOML <0001> oml.c:301 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP 20170626171713513 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713514 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713515 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713727 DL1C <0006> scheduler_trx.c:1600 PC clock skew: elapsed_us=614659, error_us=610044 20170626171713727 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx [...] Shutdown timer expired
The next run, 977, is successful. All following runs until now (982) are failing.
See http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_... and click once on the (+) to expand one level of child nodes.
So at first glance it appears that the patches make things worse.
Starting from build #983, we are testing an osmo-bts-trx with *only* the CLOCK_MONOTONIC patch applied.
Notably we have removed the settsc config option from the osmo-bts-trx config, but then again settsc seems to not have any effect in the code.
fashion, I would use the opposite approach: Set up osmo-bts-trx on the same hardware (APU) next to your laptop on your personal desk, and then try to see if and when the above problems can be reproduced, maybe by putting some more CPU load on the APU, or I/O load, or whatever..
Yes, may be something Pau should take on?
If osmo-bts-trx is too unstable for the "production" osmo-gsm-tester, I would simply disable it until we have adressed related bugs.
We'll see about disabling soon. We *did* catch a regression with it recently...
~N