osmo-bts-trx fails frequently on osmo-gsm-tester

List overview All Threads
Download

newer

older

libosmo-netif interface

prng change feedback

Neels Hofmeyr

23 Jun 2017 23 Jun '17

4:51 a.m.

We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I have run a tcpdump on the ntp port for the past days, and nothing is doing ntp besides the actual ntp service.

Today I started ntp while an osmo-bts-trx run was active and what do you know, the osmo-bts-trx process exits immediately. I think this is bad, osmo-bts-trx shouldn't use wall clock time for precise timing needs.

Besides that, I have no idea what could cause the clock skews, except maybe that the CPU or the USB are not fast enough?? I'm wondering, is there still such a thing as a separate linux realtime kernel?

We will soon take to productive use another main unit which will be a cleanly installed OS. If we see the same problems on that system and can't find a software fix, we may need to reconsider the tester for osmo-bts-trx...

Attachments:

signature.asc (application/pgp-signature — 819 bytes)

Show replies by date

Tomcsányi, Domonkos

23 Jun 23 Jun

8:58 a.m.

On Ubuntu they maintain a version of the kernel called "lowlatency" (linux-image-lowlatency), it might be worth a try.

Regards, Domi

2017. jún. 23. dátummal, 4:52 időpontban Neels Hofmeyr nhofmeyr@sysmocom.de írta:

...

We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I have run a tcpdump on the ntp port for the past days, and nothing is doing ntp besides the actual ntp service.

Today I started ntp while an osmo-bts-trx run was active and what do you know, the osmo-bts-trx process exits immediately. I think this is bad, osmo-bts-trx shouldn't use wall clock time for precise timing needs.

Besides that, I have no idea what could cause the clock skews, except maybe that the CPU or the USB are not fast enough?? I'm wondering, is there still such a thing as a separate linux realtime kernel?

We will soon take to productive use another main unit which will be a cleanly installed OS. If we see the same problems on that system and can't find a software fix, we may need to reconsider the tester for osmo-bts-trx...

~N

Harald Welte

11:19 a.m.

Hi Neels,

On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:

...

We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest itself in detail at the lowest possible interface (i.e. log output of osmo-trx, osmo-bts-trx, ...)?

...

I have run a tcpdump on the ntp port for the past days, and nothing is doing ntp besides the actual ntp service.

And that service was presumably disabled (before your test described in the next paragraph)?

...

Today I started ntp while an osmo-bts-trx run was active and what do you know, the osmo-bts-trx process exits immediately. I think this is bad, osmo-bts-trx shouldn't use wall clock time for precise timing needs.

Yes, I think it's a sign of very poor design if we cannot even sync the local wall clock to a NTP or GPS reference. CLOCK_MONOTONIC_RAW should be used on Linux for use cases like the one in osmo-bts-trx, having to schedule bursts at specific time intervals.

In fact, I think the entire TRX<->BTS interface is not all that good an idea to begin with.

In OsmoTRX, we have the ADC/DAC sample clock that is driving transmission of samples. Normally, the entire PHY layer runs synchronous to that, and it would drive the "clock" of L2 by means of PH-RTS.ind, so the L2 knows whenever it wants to transmit something.

However, the OsmoTRX <-> osmo-bts-trx interface is not at the PHY<->L2 boundary, but it is at an inner boundary between the radio modem (OsmoTRX) and the L1 (in osmo-bts-trx). And those are two separate processes, without any way to synchronously trigger some action based on the ADC/DAC master sample clock. As a result, osmo-bts-trx needs to keep its own clock, based on whatever clock source available in the operating system / hardware, and make sure it sends bursts at the right speed to OsmoTRX. So OsmoTRX and osmo-bts-trx run actually asynchronous, at something that is specified/designed to be a synchronous interface in the GSM architecture.

But then, I guess we don't have the luxury of changing all of this, so migrating to something like CLOCK_MONOTONIC_RAW or CLOCK_MONOTONIC. Instead of osmocom timers, using timer_create(CLOCK_MONOTONIC, ..)) sounds like a good idea, or even timerfd_create() which would integrate with our select() loop. Problem is only that those are about periodic timers. While we do want periodicity (once every burst period of 577us), the local clock of the Linux system is >= 1000 times less accurate than the clock of the GSM transmitting hardware, i.e. we need to adjust the expiration of our timer based on clock information provided by osmo-trx.

...

Besides that, I have no idea what could cause the clock skews, except maybe that the CPU or the USB are not fast enough??

where is evidence of that? * do we get underruns / overruns in reading/writing from/to the SDR? ** if this is not properly logged yet, we should make sure all such instances are properly logged, and that we have a counter that counts such events since the process start. Printing of related counters could be done at time of sending a signal to the process, or in periodic intervals (every 10s?) on stdout * do we see indications of packet loss between TRX and BTS? ** each UDP on the per-TRX data interface contains frame number and timeslot index in its header, so detecting missing frames is easy, whether or not this is currently already implemented.

Regards, Harald

-- - Harald Welte laforge@gnumonks.org http://laforge.gnumonks.org/ ============================================================================ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)

Tom Tsou

8:19 p.m.

On Fri, Jun 23, 2017 at 2:19 AM, Harald Welte laforge@gnumonks.org wrote:

...

Yes, I think it's a sign of very poor design if we cannot even sync the local wall clock to a NTP or GPS reference. CLOCK_MONOTONIC_RAW should be used on Linux for use cases like the one in osmo-bts-trx, having to schedule bursts at specific time intervals.

In fact, I think the entire TRX<->BTS interface is not all that good an idea to begin with.

I agree that the L1<->L0 socket interface is quite unusual. The historical reason for a distinct mid-PHY split was to create a license shim layer between commercial licensed OpenBTS code and GPL based GNU Radio. I don't believe that there was ever a good technical reason in terms of code or structure for the separation.

Currently, the only reason that the socket layer needs to exist is for backwards compatibility with OpenBTS, and I'm not sure how much support there is for that option now. Perhaps there are some fronthaul / C-RAN application benefits, but I'm not aware of that being a popular use case for osmo-trx. So the justification for the the existing TRX<->BTS interface for use with osmo-bts is not very strong.

...

...
Besides that, I have no idea what could cause the clock skews, except maybe that the CPU or the USB are not fast enough??

where is evidence of that?

do we get underruns / overruns in reading/writing from/to the SDR?

** if this is not properly logged yet, we should make sure all such instances are properly logged, and that we have a counter that counts such events since the process start. Printing of related counters could be done at time of sending a signal to the process, or in periodic intervals (every 10s?) on stdout

We do not have overrun / underrun counters in osmo-trx, but I agree that this is a good idea.

...

do we see indications of packet loss between TRX and BTS?

** each UDP on the per-TRX data interface contains frame number and timeslot index in its header, so detecting missing frames is easy, whether or not this is currently already implemented.

Packet loss between TRX-BTS is definitely a concern, but I think that is unlikely. The skew between OS time and device time is likely driven by scheduling and transient delays in BTS burst processing and/or late UDP arrival from TRX. In that case, a faster machine certainly helps. Another test could be running BTS and TRX on separate machines to isolate process scheduling for each application.

-TT

Harald Welte

26 Jun 26 Jun

9:30 p.m.

Hi Tom,

thanks for your input.

On Fri, Jun 23, 2017 at 11:19:53AM -0700, Tom Tsou wrote:

...

On Fri, Jun 23, 2017 at 2:19 AM, Harald Welte laforge@gnumonks.org wrote:

I agree that the L1<->L0 socket interface is quite unusual. The historical reason for a distinct mid-PHY split was to create a license shim layer between commercial licensed OpenBTS code and GPL based GNU Radio. I don't believe that there was ever a good technical reason in terms of code or structure for the separation.

ah, I didn't know (or remember) there was actually any gnuradio dependency in the OpenBTS transceiver.

...

Currently, the only reason that the socket layer needs to exist is for backwards compatibility with OpenBTS, and I'm not sure how much support there is for that option now.

I also don't think there is much value in this. I'm not aware anyone regularly testing that configuration, and I don't think there is value to support something that nobody is testing.

...

Perhaps there are some fronthaul / C-RAN application benefits, but I'm not aware of that being a popular use case for osmo-trx.

not yet, at least. Maybe once such systems become more deployed in regions where GSM is not phased out. But even in such cases, I would expect that actual I/Q baseband samples are required (e.g. in CPRI or OBSAI), and not unmodulated/demodulated symbols.

...

So the justification for the the existing TRX<->BTS interface for use with osmo-bts is not very strong.

Agreed. On the other hand, changing it would be quite some amount of work, so we might just as well keep it.

...

...
where is evidence of that?

do we get underruns / overruns in reading/writing from/to the SDR?

** if this is not properly logged yet, we should make sure all such instances are properly logged, and that we have a counter that counts such events since the process start. Printing of related counters could be done at time of sending a signal to the process, or in periodic intervals (every 10s?) on stdout

We do not have overrun / underrun counters in osmo-trx, but I agree that this is a good idea.

I think the should definitely be added. It might also make sense to do some runtime evaluation of how long it typically takes us to process a burst in uplink and downlink, to get an idea about how much margin there is. I'm thinking of something like taking a timestamp when we read from the UDP socket to the time we go back to sleep (and the same in the inverse direction, from samples to UDP.

...

Packet loss between TRX-BTS is definitely a concern, but I think that is unlikely. The skew between OS time and device time is likely driven by scheduling and transient delays in BTS burst processing and/or late UDP arrival from TRX. In that case, a faster machine certainly helps.

The problem I have is that right now there is no clear indication of whta's happening. If a given machine is unable to provide sufficient CPU to operate, we should fail gracefully with some explicit message on that regard. Looking into under/overruns on the SDR side as well as keeping an eye on (and exporting/reporting) the "margin" in terms of how soon we finish our processing before the next burst period happens would improve the situation here. Thisis true for both the OsmoTRX doing modulation/demodulation, but probably even more so for osmo-bts-trx doing convolutional decoding, etc.

It might also be worthwhile to consider whether short, occasional drop-outs are acceptable, and whether we can recover from that in a more meaningful way - short of exiting the process and having it re-spawned by systemd.

Individual missing samples at some occasions shouldn't be that critical, I guess? Sure, they will increase BER when they happen, but beyond that?

And in terms of osmo-bts-trx missing received UDP burst data, it is basically FER.

In both cases, it might make sense to accept this in rare intervals + raise a related OML ALERT to the BSC.

I'm likely going to look a bit more into the osmo-bts-trx side soon (beyond the CLOCK_MONOTONIC patches under review), but for OsmoTRX I have no current plans to do any of the above improvements myself.

Regards, Harald

Neels Hofmeyr

25 Jun 25 Jun

3:22 p.m.

On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:

...

Hi Neels,

On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:

...
We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest

To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:

20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1! 20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004 20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!

Detailed logs in http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_... in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/osmo-bts-trx/stderr

Related osmo-trx output is in the same tgz at in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/osmo-trx/stderr

(Number crunching: if 30% of the test runs fail, where each run contains two osmo-bts-trx tests, it means that roughly 15% of osmo-bts-trx tests fail.)

(The reason why I say "massive": it's really annoying to have this rate of sporadic failure. Instead of investigating upon first failure, we will only notice a regression when runs fail consistently, i.e. when there are no successful runs for, say, 5 or more runs. We don't take action immediately, yet we have to be careful to not be too late and loose jenkins run logs of the last successful run. The first failing runs in a series can well be just trx failures, so it needs more effort to find out which run introduced an actual regression.)

...

...
I have run a tcpdump on the ntp port for the past days, and nothing is doing ntp besides the actual ntp service.

And that service was presumably disabled (before your test described in the next paragraph)?

Yes, started the tcpdump filtering on the ntp port, saw ntp packets (to verify that it works), disabled the ntp service, saw that packets cease, restarted the tcpdump in a tmux, forgot about it for a couple of days, then came back to the tmux and saw that the tcpdump was completely empty. Then again I started the ntp service, immediately saw ntp packets in the tcpdump and the osmo-bts-trx test run failed promptly.

Let me mention that I see myself as "the messenger", relaying the results I see on the tester setup; I will pursue a solution in a limited fashion, to not neglect other tasks.

I can of course test things in case anyone has more ideas.

Tom mentioned the idea of running osmo-bts-trx on a different machine from osmo-trx -- that is certainly possible in a manual test, but I guess not really an option for the regular tests. It would be a lot of manual supervision to perform a series of tests, like 20 or more, to find out the success rate; or a code and jenkins config change to run the osmo-bts-trx binary on a different build slave, not trivial. It would be much preferred to stay on a single host computer...

Harald Welte

5:20 p.m.

Hi Neels,

On Sun, Jun 25, 2017 at 03:22:06PM +0200, Neels Hofmeyr wrote:

...

On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:

...
On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:

...
We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest

To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:

20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!

that's the higher-layer code complaining that the frame number as reported by the lower layer code (osmo-bts-trx) has not incremented by +1. The normal expectation is tha that osmo-bts-* feeds every FN into the common layer (via l1sap).

...

20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004

That's 62 frames "missed", which is quite a lot (translating to 285ms).

...

I can of course test things in case anyone has more ideas.

As indicated in the related ticket, I have submitted a patch to gerrit that switches from gettimeofday() based osmo_timer_list to a monotonic timerfd based interval timer for the FN clock inside osmo-bts-trx. It would be good if you can see to this being tested. I am travelling more than I'm at home or at the office (i.e. no access to related equipment), nor do I have insight into how we could test a non-master patch in the osmo-gsm-tester setup.

There are more odd parts in osmo-bts-trx that I could imagine having an inpact on this, but we should take it step-by-step. One problem is for example that the UDP sockets for the TRX/BTS communication are not set to non-blocking-mode, so a blocking write could mess a lot with timing.

...

Tom mentioned the idea of running osmo-bts-trx on a different machine from osmo-trx -- that is certainly possible in a manual test, but I guess not really an option for the regular tests.

It is an option. However, we need to understand what exactly is the problem here. Rather than adding additional hardware to the osmo-gsm-tester setup in a "trial and error" aka "stumbling in the dark" fashion, I would use the opposite approach: Set up osmo-bts-trx on the same hardware (APU) next to your laptop on your personal desk, and then try to see if and when the above problems can be reproduced, maybe by putting some more CPU load on the APU, or I/O load, or whatever..

If osmo-bts-trx is too unstable for the "production" osmo-gsm-tester, I would simply disable it until we have adressed related bugs.

Regards, Harald

Neels Hofmeyr

26 Jun 26 Jun

8:39 p.m.

On Sun, Jun 25, 2017 at 05:20:49PM +0200, Harald Welte wrote:

...

Hi Neels,

On Sun, Jun 25, 2017 at 03:22:06PM +0200, Neels Hofmeyr wrote:

...
On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:

...
On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:

...
We're still having massive stability problems with osmo-bts-trx on the osmo-gsm-tester.

I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest

To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:

20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!

that's the higher-layer code complaining that the frame number as reported by the lower layer code (osmo-bts-trx) has not incremented by +1. The normal expectation is tha that osmo-bts-* feeds every FN into the common layer (via l1sap).

...
20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004

That's 62 frames "missed", which is quite a lot (translating to 285ms).

...
I can of course test things in case anyone has more ideas.

As indicated in the related ticket, I have submitted a patch to gerrit that switches from gettimeofday() based osmo_timer_list to a monotonic timerfd based interval timer for the FN clock inside osmo-bts-trx. It would be good if you can see to this being tested.

I have put your trx patches on a branch and built a binary from it, from http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_... the patches are being tested on the gsm-tester. branch: osmo-bts:neels/trx_test

(it actually started from 973, which failed because 'settsc' config is removed by one of the patches but was still in the osmo-bts-trx config file)

976 has a different failure in *one* of two trx tests:

20170626171713445 DOML <0001> oml.c:333 OC=CHANNEL INST=(00,00,07) AVAIL STATE Dependency -> OK 20170626171713445 DOML <0001> oml.c:340 OC=CHANNEL INST=(00,00,07) OPER STATE Disabled -> Enabled 20170626171713445 DOML <0001> oml.c:301 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP 20170626171713513 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713514 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713515 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713727 DL1C <0006> scheduler_trx.c:1600 PC clock skew: elapsed_us=614659, error_us=610044 20170626171713727 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx [...] Shutdown timer expired

The next run, 977, is successful. All following runs until now (982) are failing.

See http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_... and click once on the (+) to expand one level of child nodes.

So at first glance it appears that the patches make things worse.

Starting from build #983, we are testing an osmo-bts-trx with *only* the CLOCK_MONOTONIC patch applied.

Notably we have removed the settsc config option from the osmo-bts-trx config, but then again settsc seems to not have any effect in the code.

...

fashion, I would use the opposite approach: Set up osmo-bts-trx on the same hardware (APU) next to your laptop on your personal desk, and then try to see if and when the above problems can be reproduced, maybe by putting some more CPU load on the APU, or I/O load, or whatever..

Yes, may be something Pau should take on?

...

If osmo-bts-trx is too unstable for the "production" osmo-gsm-tester, I would simply disable it until we have adressed related bugs.

We'll see about disabling soon. We *did* catch a regression with it recently...

Harald Welte

27 Jun 27 Jun

10:21 a.m.

On Mon, Jun 26, 2017 at 08:39:45PM +0200, Neels Hofmeyr wrote:

...

976 has a different failure in *one* of two trx tests:

20170626171713513 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713514 DL1C <0006> scheduler_trx.c:1704 We were 47 FN faster than TRX, compensating 20170626171713515 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713517 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713518 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating 20170626171713519 DL1C <0006> scheduler_trx.c:1704 We were 44 FN faster than TRX, compensating

the above are all very odd, as they indicate that the 4.6ms timer expires constantly way faster than actual 4.6ms as per the clock we receive from the TRX. And it's not an occasional frame every so often, but lots (44*4.6=202.4ms) within one second, that's a 20% clock deviation.

...

20170626171713727 DL1C <0006> scheduler_trx.c:1600 PC clock skew: elapsed_us=614659, error_us=610044

This means that the Linux kernel was supposed to scheudle us after 4.6ms, but actually took 610ms longer, i.e. more than half a second. This is highly unusual. Something really odd must be happening to the system here. What other tasks with realtime priority (SCHED_RR) are running on the system?

Max

10:24 a.m.

On a related note: what kernel are you running - could you share "uname -a" output from that system?

-- Max Suraev msuraev@sysmocom.de http://www.sysmocom.de/ ======================================================================= * sysmocom - systems for mobile communications GmbH * Alt-Moabit 93 * 10559 Berlin, Germany * Sitz / Registered office: Berlin, HRB 134158 B * Geschaeftsfuehrer / Managing Director: Harald Welte

Pau Espin Pedrol

11:42 a.m.

On 27/06/17 10:24, Max wrote:

...

On a related note: what kernel are you running - could you share "uname -a" output from that system?

Linux osmo-gsm-tester-rnd 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

-- - Pau Espin Pedrol pespin@sysmocom.de http://www.sysmocom.de/ ======================================================================= * sysmocom - systems for mobile communications GmbH * Alt-Moabit 93 * 10559 Berlin, Germany * Sitz / Registered office: Berlin, HRB 134158 B * Geschaeftsfuehrer / Managing Director: Harald Welte

Max

11:49 a.m.

It's been suggested by few people (me included) here to use -lowlatency kernels from Ubuntu. If there's particular reason why we can't use Ubuntu instead of Debian for osmo-gsm-tester than we could try installing kernels from https://liquorix.net/ repository.

On 27.06.2017 11:42, Pau Espin Pedrol wrote:

...

Linux osmo-gsm-tester-rnd 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

Harald Welte

12:56 p.m.

Hi Max,

On Tue, Jun 27, 2017 at 11:49:24AM +0200, Max wrote:

...

It's been suggested by few people (me included) here to use -lowlatency kernels from Ubuntu. If there's particular reason why we can't use Ubuntu instead of Debian for osmo-gsm-tester than we could try installing kernels from https://liquorix.net/ repository.

I would appreciate if we would invest our time and energy into actually *debugging* the issue and finding what the problem, rather than a very high level trial+error approach with swapping distributions and kernels.

Alexander Chemeris

26 Jun 26 Jun

12:25 p.m.

Hi Neels,

Are you running osmo-trx in a single TRX or dual-TRX configuration?

Do you have a CPU usage information from the system?

Could you try disabling all timeslots but the ones take needed? It won't completely disable them with the current code, but IIRC it will somewhat help with the CPU load which I think is the real issue here.

Please excuse typos. Written with a touchscreen keyboard.

-- Regards, Alexander Chemeris CTO/Founder Fairwaves, Inc. https://fairwaves.co

On Jun 25, 2017 22:23, "Neels Hofmeyr" nhofmeyr@sysmocom.de wrote:

...

On Fri, Jun 23, 2017 at 11:19:10AM +0200, Harald Welte wrote:

...
Hi Neels,

On Fri, Jun 23, 2017 at 04:51:07AM +0200, Neels Hofmeyr wrote:

...
We're still having massive stability problems with osmo-bts-trx on the

osmo-gsm-tester.

...
I'm sorry, but I have to ask for more specifics: What exactly is a 'massive stability problem'? How does it manifest

To quantify: between 30 and 40% of all osmo-gsm-tester runs fail because of:

20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1! 20170625121036320 DL1C <0006> scheduler_trx.c:1527 GSM clock skew: old fn=2289942, new fn=2290004 20170625121036320 DL1P <0007> l1sap.c:423 Invalid condition detected: Frame difference is > 1!

Detailed logs in http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/ job/osmo-gsm-tester_run/940/artifact/trial-940-run.tgz in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/ osmo-bts-trx/stderr

Related osmo-trx output is in the same tgz at in /run.2017-06-25_12-05-43/sms:trx/mo_mt_sms.py/osmo-bts-trx/ osmo-trx/stderr

(Number crunching: if 30% of the test runs fail, where each run contains two osmo-bts-trx tests, it means that roughly 15% of osmo-bts-trx tests fail.)

(The reason why I say "massive": it's really annoying to have this rate of sporadic failure. Instead of investigating upon first failure, we will only notice a regression when runs fail consistently, i.e. when there are no successful runs for, say, 5 or more runs. We don't take action immediately, yet we have to be careful to not be too late and loose jenkins run logs of the last successful run. The first failing runs in a series can well be just trx failures, so it needs more effort to find out which run introduced an actual regression.)

...
...
I have run a tcpdump on the ntp port for the past days, and nothing is

doing

...
...
ntp besides the actual ntp service.

And that service was presumably disabled (before your test described in the next paragraph)?

Yes, started the tcpdump filtering on the ntp port, saw ntp packets (to verify that it works), disabled the ntp service, saw that packets cease, restarted the tcpdump in a tmux, forgot about it for a couple of days, then came back to the tmux and saw that the tcpdump was completely empty. Then again I started the ntp service, immediately saw ntp packets in the tcpdump and the osmo-bts-trx test run failed promptly.

Let me mention that I see myself as "the messenger", relaying the results I see on the tester setup; I will pursue a solution in a limited fashion, to not neglect other tasks.

I can of course test things in case anyone has more ideas.

Tom mentioned the idea of running osmo-bts-trx on a different machine from osmo-trx -- that is certainly possible in a manual test, but I guess not really an option for the regular tests. It would be a lot of manual supervision to perform a series of tests, like 20 or more, to find out the success rate; or a code and jenkins config change to run the osmo-bts-trx binary on a different build slave, not trivial. It would be much preferred to stay on a single host computer...

~N

Neels Hofmeyr

27 Jun 27 Jun

6:27 p.m.

On Mon, Jun 26, 2017 at 07:25:29PM +0900, Alexander Chemeris wrote:

...

Hi Neels,

Are you running osmo-trx in a single TRX or dual-TRX configuration?

Single TRX.

...

Do you have a CPU usage information from the system?

Like load average? It doesn't really give hard information...

I like Harald's suggestion to put load on the system and try to trigger the failure. In the sense of trying to make the failure more frequent to be able to figure out the cause more easily.

@pespin, can you probe in that direction?

Tom Tsou

26 Jun 26 Jun

10:21 p.m.

On Sun, Jun 25, 2017 at 6:22 AM, Neels Hofmeyr nhofmeyr@sysmocom.de wrote:

...

Tom mentioned the idea of running osmo-bts-trx on a different machine from osmo-trx -- that is certainly possible in a manual test, but I guess not really an option for the regular tests. It would be a lot of manual supervision to perform a series of tests, like 20 or more, to find out the success rate; or a code and jenkins config change to run the osmo-bts-trx binary on a different build slave, not trivial. It would be much preferred to stay on a single host computer...

To be clear, I am not advocating the use of a separate machines as a permanent solution, which I agree is not ideal, but as a method to confirm that the issue is directly related to process scheduling.

-TT

Neels Hofmeyr

5 Jul 5 Jul

12:13 p.m.

pespin has submitted a patch for osmo-trx that appears to completely fix the osmo-bts-trx clock skew instability problem! Please merge it if possible :)

https://gerrit.osmocom.org/3120

Thanks,

Pau Espin Pedrol

6 Jul 6 Jul

5:07 p.m.

On 05/07/17 12:13, Neels Hofmeyr wrote:

...

pespin has submitted a patch for osmo-trx that appears to completely fix the osmo-bts-trx clock skew instability problem! Please merge it if possible :)

https://gerrit.osmocom.org/3120

The issue is still there, however error rate dropped dramatically after that patch. See for instance: https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester...

I think there are some bits remaining to be improved on the osmo-bts-trx side now in order to get rid of the issue.

Regards,

Neels Hofmeyr

7 Jul 7 Jul

2:59 a.m.

On Thu, Jul 06, 2017 at 05:07:00PM +0200, Pau Espin Pedrol wrote:

...

On 05/07/17 12:13, Neels Hofmeyr wrote:

...
pespin has submitted a patch for osmo-trx that appears to completely fix the osmo-bts-trx clock skew instability problem! Please merge it if possible :)

https://gerrit.osmocom.org/3120

The issue is still there, however error rate dropped dramatically after that patch. See for instance: https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester...

To clarify the term "dramatically": failure rate is now about three out of 50 tester runs, i.e. three out of 100 osmo-bts-trx tests. Lately it had been failing every second test, so the improvement is huge, from 50% down to 3%.

Kudos to pespin for finding that issue. Hopefully we can also get rid of the remaining odd failure.

Interesting that it doesn't happen every time, nor on everyone else's equipment. Maybe CPU speed or faster I/O makes it less likely to happen.

Tom Tsou

3:23 a.m.

On Thu, Jul 6, 2017 at 5:59 PM, Neels Hofmeyr nhofmeyr@sysmocom.de wrote:

...

Kudos to pespin for finding that issue. Hopefully we can also get rid of the remaining odd failure.

Absolutely.

...

Interesting that it doesn't happen every time, nor on everyone else's equipment. Maybe CPU speed or faster I/O makes it less likely to happen.

That is likely. The bts-trx socket interface has been fairly static and I know that L2 scheduling issues have existed in OpenBTS. I doubt that there has been any recent regression; the use of faster CPU's has probably been covering up the issue for a very long time.

-TT

Pau Espin Pedrol

9 Jul 9 Jul

6:35 p.m.

Hi,

On 07/07/17 03:23, Tom Tsou wrote:

...

On Thu, Jul 6, 2017 at 5:59 PM, Neels Hofmeyr nhofmeyr@sysmocom.de wrote:

...
Kudos to pespin for finding that issue. Hopefully we can also get rid of the remaining odd failure.

Absolutely.

...
Interesting that it doesn't happen every time, nor on everyone else's equipment. Maybe CPU speed or faster I/O makes it less likely to happen.

That is likely. The bts-trx socket interface has been fairly static and I know that L2 scheduling issues have existed in OpenBTS. I doubt that there has been any recent regression; the use of faster CPU's has probably been covering up the issue for a very long time.

On top of that, I think the issues come too from the fact that we are starting the whole network in an automatic way at the same time, which increases the number of messages and time required to sync everything and become stable.

Neels Hofmeyr

10 Jul 10 Jul

11:59 a.m.

On Sun, Jul 09, 2017 at 06:35:51PM +0200, Pau Espin Pedrol wrote:

...

On top of that, I think the issues come too from the fact that we are starting the whole network in an automatic way at the same time, which increases the number of messages and time required to sync everything and become stable.

I did try once to add several seconds of delay between launching osmo-trx and osmo-bts-trx, without effect, so I doubt that it's system load. (osmo-bts-trx is typically the last program to be launched.)

Pau Espin Pedrol

5:59 p.m.

On 10/07/17 11:59, Neels Hofmeyr wrote:

...

On Sun, Jul 09, 2017 at 06:35:51PM +0200, Pau Espin Pedrol wrote:

...
On top of that, I think the issues come too from the fact that we are starting the whole network in an automatic way at the same time, which increases the number of messages and time required to sync everything and become stable.

I did try once to add several seconds of delay between launching osmo-trx and osmo-bts-trx, without effect, so I doubt that it's system load. (osmo-bts-trx is typically the last program to be launched.)

Indeed, that was not the root cause for the issue I already fixed in osmo-trx, which was the most common issue, but I'm still not sure about the other one we are still facing, I need to look into it.

Pau Espin Pedrol

9 Oct 9 Oct

2:04 p.m.

Hi everybody,

On 10/07/17 17:59, Pau Espin Pedrol wrote:

...

Indeed, that was not the root cause for the issue I already fixed in osmo-trx, which was the most common issue, but I'm still not sure about the other one we are still facing, I need to look into it.

After storing output from several test failures in osmo-gsm-tester using a B200, I was able to reduce the scope of the issue into osmo-trx / UHD / B200 HW.

Please find the related information in this task comment: https://osmocom.org/issues/2325#note-34

I would really appreciate help from anybody with a better knowledge on UHD and B200 HW to provide some feedback or advises on how to proceed here.

Sandi Suhendro

2:42 p.m.

Im working fine osmo-bts-trx with UHD 3.10.2, never install UHD from apt install, my advice is remove all UHD from apt install then rebuid it fresh from the source code. hope this help!

On Mon, Oct 9, 2017 at 7:04 PM, Pau Espin Pedrol pespin@sysmocom.de wrote:

...

Hi everybody,

On 10/07/17 17:59, Pau Espin Pedrol wrote:

...
Indeed, that was not the root cause for the issue I already fixed in osmo-trx, which was the most common issue, but I'm still not sure about the other one we are still facing, I need to look into it.

After storing output from several test failures in osmo-gsm-tester using a B200, I was able to reduce the scope of the issue into osmo-trx / UHD / B200 HW.

Please find the related information in this task comment: https://osmocom.org/issues/2325#note-34

I would really appreciate help from anybody with a better knowledge on UHD and B200 HW to provide some feedback or advises on how to proceed here.

--

Pau Espin Pedrol pespin@sysmocom.de http://www.sysmocom.de/

=======================================================================

sysmocom - systems for mobile communications GmbH

Alt-Moabit 93

10559 Berlin, Germany

Sitz / Registered office: Berlin, HRB 134158 B

Geschaeftsfuehrer / Managing Director: Harald Welte

-- best regards, Krazy Sandi Blue Soho Recordings Number One Recordings

Harald Welte

5:37 p.m.

Hi Sandi,

On Mon, Oct 09, 2017 at 07:42:47PM +0700, Sandi Suhendro wrote:

...

Im working fine osmo-bts-trx with UHD 3.10.2, never install UHD from apt install, my advice is remove all UHD from apt install then rebuid it fresh from the source code. hope this help!

Thanks for your feedback. However, this is nothing but a work-around and no solution to the problem.

* application software like osmo-trx should work with any version of UHD that it successfully builds against

* if there are some specific bugs in the UHD versions that are shipped by stable/actively maintained distributions, then related patches have to be submitted to the package maintainers in those distributions

Sandi Suhendro

10 Oct 10 Oct

10:13 p.m.

I know it will work fine with any version with UHD, cause my self tried many version of UHD and seems ok. what I mean is the installation of UHD..... just in case, some people just use libuhd from apt install as example :

- sudo apt-get install uhd-host libuhd003 libuhd-dev

you need to remove all uhd-host and libuhd-dev then re-install from the source.

hope some one with good knowledge of UHD will tell us the problem with HW and UHD driver with osmo-trx.

cheers,

DUO

On Mon, Oct 9, 2017 at 10:37 PM, Harald Welte laforge@gnumonks.org wrote:

...

Hi Sandi,

On Mon, Oct 09, 2017 at 07:42:47PM +0700, Sandi Suhendro wrote:

...
Im working fine osmo-bts-trx with UHD 3.10.2, never install UHD from apt install, my advice is remove all UHD from apt install then rebuid it

fresh from the source code.

...
hope this help!

Thanks for your feedback. However, this is nothing but a work-around and no solution to the problem.

application software like osmo-trx should work with any version of UHD that it successfully builds against

if there are some specific bugs in the UHD versions that are shipped by stable/actively maintained distributions, then related patches have to

be submitted to the package maintainers in those distributions

--

Harald Welte laforge@gnumonks.org

http://laforge.gnumonks.org/

================ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)

-- best regards, Krazy Sandi Blue Soho Recordings Number One Recordings

2973

Age (days ago)

3082

Last active (days ago)

openbsc@lists.osmocom.org

26 comments

8 participants

tags (0)

participants (8)

Alexander Chemeris
Harald Welte
Max
Neels Hofmeyr
Pau Espin Pedrol
Sandi Suhendro
Tom Tsou
Tomcsányi, Domonkos