OsmoBTS and T200 handling / libosmocore lapdm - OpenBSC

25 Jan 2016

Hi all,

T200 is the LAPDm re-transmission timer, after which the sender assumes
a L2 frame transmitted was lost and starts with recovery.

Until recently, OsmoBTS ignored the T200 values that the BSC specified
via OML, always falling back to the (relatively long) T200 values that
exist in libosmocore (1s for the main channel, 2s for the associated
channel).

As you know, OsmoBTS is used in production in this configuration and we
never recevied any associated bug reports.

Recently, in e9f12acbeb5a369282719f8e0deecc88034a5488 I started to use
the T200 values as communicated from the BSC via OML.  This is what
proprietary BTSs like the BS-11, nanoBTS etc. (supposedly) have been
using all the time anyway, so I thought of it as a bug fix.

However, as it turns out, it breaks our LAPDm implementation in many
ways.  The LAPDm performance gets that bad that you cannot even transmit
a single SMS anymore, and even LU only occasionally work anymore.

You can see the erroneous behavior in the attached PCAP file showing
OsmoBTS-generated GSMTAP and RSL.  Also attached are log file outputs of
LAPDm logging for mo-sms and mt-sms.  In them you can find troubling
lines like 'S frame response with F=1 error' which should never
happen...

I suspect two issues related to this:

1) Our lapdm.c code uses regular osmo_timer_* functions to determine
   once T200 expires, rather than a GSM frame number time-base.

   This wouldn't be a problem in a synchronous real-time environemnt.
   However, in OsmoBTS (as in OsmocomBB), there is a relatively long and
   queue / delay between the point where a frame is pulled out of the
   bottom end of LAPDm and actual transmission on the radio interface.
   However, LAPDm T200 starts ticking from the point the frame was
   pulled out, rather than from the point transmission started.

   In order to change this, I suggest that we either change LAPDm timers
   to work on frame numbers passed up from L1 via every L1SAP primitive
   (comparing RTS.ind for downlink vs. DATA.ind for uplink),  or simply
   keeping a per-PHY/per-TRX measurement of that 'round trip time
   between the actual radio and the L2'.  We can then compensate for
   this by adding it to T200.

   I briefly tried this here, where one given hardware indicated about
   56ms round-trip-time (13 frames difference between RTS.ind and
   DATA.ind).  However, it didn't help.  Even compensating by 120ms was
   not sufficient.  This needs to be revisited.

2) I think libosmogsm LAPDm implementation is actually buggy,
   specifically in situations where T200 expires.  We don't see that
   often, as the 1s/2s is so long that in reality it rarely happens.
   Once the T200 value is reduced, the probability of running into T200
   expiration increases, and so does the probability of seeing related
   problems.

   Our existing lapdm unit tests don't seem to cover timing related
   behavior, so that definitely is something tha appears to need
   improvement.

Any help on those issues is appreciated.

Meanwhile, I decided to revert back to the libosmogsm T200 defaults by
the means of commit 3ca59512d2f4eb1f87699e8fada67f33674918b4

Regards,
	Harald

-- 
- Harald Welte &lt;laforge(a)gnumonks.org&gt;           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)