osmo-bts-trx fails frequently on osmo-gsm-tester

This is merely a historical archive of years 2008-2021, before the migration to mailman3.

A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/OpenBSC@lists.osmocom.org/.

Neels Hofmeyr nhofmeyr at sysmocom.de
Sun Jun 25 13:22:06 UTC 2017


On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:
> Hi Neels,
> 
> On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:
> > We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.
> 
> I'm sorry, but I have to ask for more specifics:
> What exactly is a 'massive stability problem'?  How does it manifest

To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:

20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!
20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004
20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!

Detailed logs in
http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/940/artifact/trial-940-run.tgz
in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/osmo-bts-trx/stderr

Related osmo-trx output is in the same tgz at
in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/osmo-trx/stderr

(Number crunching: if 30% of the test runs fail, where each run contains two
osmo-bts-trx tests, it means that roughly 15% of osmo-bts-trx tests fail.)

(The reason why I say "massive": it's really annoying to have this rate of
sporadic failure. Instead of investigating upon first failure, we will only
notice a regression when runs fail consistently, i.e. when there are no
successful runs for, say, 5 or more runs. We don't take action immediately, yet
we have to be careful to not be too late and loose jenkins run logs of the last
successful run. The first failing runs in a series can well be just trx
failures, so it needs more effort to find out which run introduced an actual
regression.)

> > I have run a tcpdump on the ntp port for the past days, and nothing is doing
> > ntp besides the actual ntp service.
> 
> And that service was presumably disabled (before your test described in
> the next paragraph)?

Yes, started the tcpdump filtering on the ntp port, saw ntp packets (to verify
that it works), disabled the ntp service, saw that packets cease, restarted the
tcpdump in a tmux, forgot about it for a couple of days, then came back to the
tmux and saw that the tcpdump was completely empty. Then again I started the
ntp service, immediately saw ntp packets in the tcpdump and the osmo-bts-trx
test run failed promptly.


Let me mention that I see myself as "the messenger", relaying the results I see
on the tester setup; I will pursue a solution in a limited fashion, to not
neglect other tasks.

I can of course test things in case anyone has more ideas.

Tom mentioned the idea of running osmo-bts-trx on a different machine from
osmo-trx -- that is certainly possible in a manual test, but I guess not really
an option for the regular tests. It would be a lot of manual supervision to
perform a series of tests, like 20 or more, to find out the success rate; or a
code and jenkins config change to run the osmo-bts-trx binary on a different
build slave, not trivial. It would be much preferred to stay on a single host
computer...

~N
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20170625/14d984da/attachment.bin>


More information about the OpenBSC mailing list