osmo-bts-trx fails frequently on osmo-gsm-tester

This is merely a historical archive of years 2008-2021, before the migration to mailman3.

A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/OpenBSC@lists.osmocom.org/.

Neels Hofmeyr nhofmeyr at sysmocom.de
Mon Jun 26 18:39:45 UTC 2017


On Sun, Jun 25, 2017 at 05:20:49PM +0200, Harald Welte wrote:
> Hi Neels,
> 
> On Sun, Jun 25, 2017 at 03:22:06PM +0200, Neels Hofmeyr wrote:
> > On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:
> > > On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:
> > > > We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.
> > > 
> > > I'm sorry, but I have to ask for more specifics:
> > > What exactly is a 'massive stability problem'?  How does it manifest
> > 
> > To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:
> > 
> > 20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!
> 
> that's the higher-layer code complaining that the frame number as
> reported by the lower layer code (osmo-bts-trx) has not incremented by
> +1.  The normal expectation is tha that osmo-bts-* feeds every FN into
> the common layer (via l1sap).
> 
> > 20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004
> 
> That's 62 frames "missed", which is quite a lot (translating to 285ms).
> 
> > I can of course test things in case anyone has more ideas.
> 
> As indicated in the related ticket, I have submitted a patch to gerrit
> that switches from gettimeofday() based osmo_timer_list to a monotonic
> timerfd based interval timer for the FN clock inside osmo-bts-trx.  It
> would be good if you can see to this being tested.

I have put your trx patches on a branch and built a binary from it, from
http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/976
the patches are being tested on the gsm-tester. branch: osmo-bts:neels/trx_test

(it actually started from 973, which failed because 'settsc' config is removed
by one of the patches but was still in the osmo-bts-trx config file)

976 has a different failure in *one* of two trx tests:

20170626171713445 DOML <0001> oml.c:333 OC=CHANNEL INST=(00,00,07) AVAIL STATE Dependency -> OK
20170626171713445 DOML <0001> oml.c:340 OC=CHANNEL INST=(00,00,07) OPER STATE Disabled -> Enabled
20170626171713445 DOML <0001> oml.c:301 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
20170626171713513 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating
20170626171713514 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating
20170626171713515 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating
20170626171713727 DL1C <0006> scheduler_trx.c:1600 PC clock skew: elapsed_us=614659, error_us=610044
20170626171713727 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx
[...]
Shutdown timer expired



The next run, 977, is successful.
All following runs until now (982) are failing.

See http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/test_results_analyzer/
and click once on the (+) to expand one level of child nodes.

So at first glance it appears that the patches make things worse.

Starting from build #983, we are testing an osmo-bts-trx with *only* the
CLOCK_MONOTONIC patch applied.

Notably we have removed the settsc config option from the osmo-bts-trx config,
but then again settsc seems to not have any effect in the code.

> fashion, I would use the opposite approach:  Set up osmo-bts-trx on the
> same hardware (APU) next to your laptop on your personal desk, and then
> try to see if and when the above problems can be reproduced, maybe by
> putting some more CPU load on the APU, or I/O load, or whatever..

Yes, may be something Pau should take on?


> If osmo-bts-trx is too unstable for the "production" osmo-gsm-tester, I
> would simply disable it until we have adressed related bugs.

We'll see about disabling soon.
We *did* catch a regression with it recently...

~N
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20170626/49e9a5f4/attachment.bin>


More information about the OpenBSC mailing list