Hi all,
T200 is the LAPDm re-transmission timer, after which the sender assumes a L2 frame transmitted was lost and starts with recovery.
Until recently, OsmoBTS ignored the T200 values that the BSC specified via OML, always falling back to the (relatively long) T200 values that exist in libosmocore (1s for the main channel, 2s for the associated channel).
As you know, OsmoBTS is used in production in this configuration and we never recevied any associated bug reports.
Recently, in e9f12acbeb5a369282719f8e0deecc88034a5488 I started to use the T200 values as communicated from the BSC via OML. This is what proprietary BTSs like the BS-11, nanoBTS etc. (supposedly) have been using all the time anyway, so I thought of it as a bug fix.
However, as it turns out, it breaks our LAPDm implementation in many ways. The LAPDm performance gets that bad that you cannot even transmit a single SMS anymore, and even LU only occasionally work anymore.
You can see the erroneous behavior in the attached PCAP file showing OsmoBTS-generated GSMTAP and RSL. Also attached are log file outputs of LAPDm logging for mo-sms and mt-sms. In them you can find troubling lines like 'S frame response with F=1 error' which should never happen...
I suspect two issues related to this:
1) Our lapdm.c code uses regular osmo_timer_* functions to determine once T200 expires, rather than a GSM frame number time-base.
This wouldn't be a problem in a synchronous real-time environemnt. However, in OsmoBTS (as in OsmocomBB), there is a relatively long and queue / delay between the point where a frame is pulled out of the bottom end of LAPDm and actual transmission on the radio interface. However, LAPDm T200 starts ticking from the point the frame was pulled out, rather than from the point transmission started.
In order to change this, I suggest that we either change LAPDm timers to work on frame numbers passed up from L1 via every L1SAP primitive (comparing RTS.ind for downlink vs. DATA.ind for uplink), or simply keeping a per-PHY/per-TRX measurement of that 'round trip time between the actual radio and the L2'. We can then compensate for this by adding it to T200.
I briefly tried this here, where one given hardware indicated about 56ms round-trip-time (13 frames difference between RTS.ind and DATA.ind). However, it didn't help. Even compensating by 120ms was not sufficient. This needs to be revisited.
2) I think libosmogsm LAPDm implementation is actually buggy, specifically in situations where T200 expires. We don't see that often, as the 1s/2s is so long that in reality it rarely happens. Once the T200 value is reduced, the probability of running into T200 expiration increases, and so does the probability of seeing related problems.
Our existing lapdm unit tests don't seem to cover timing related behavior, so that definitely is something tha appears to need improvement.
Any help on those issues is appreciated.
Meanwhile, I decided to revert back to the libosmogsm T200 defaults by the means of commit 3ca59512d2f4eb1f87699e8fada67f33674918b4
Regards, Harald