Attention is currently required from: fixeria, kirr.
Hello Jenkins Builder,
I'd like you to reexamine a change. Please visit
https://gerrit.osmocom.org/c/osmocom-bb/+/39327?usp=email
to look at the new patch set (#2).
The following approvals got outdated and were removed: Verified+1 by Jenkins Builder
Change subject: trx_toolkit/clck_gen.py: Fix clock generator not to accumulate timing error ......................................................................
trx_toolkit/clck_gen.py: Fix clock generator not to accumulate timing error
CLCKGen currently works as follows:
sleep(ctr_interval) some work sleep(ctr_interval) some work sleep(ctr_interval) some work ...
The intent here is to do some work at timestamps that are multiple of ctr_interval, however the implementation does not match the intent, because
1) sleep(ctr_interval) is not guaranteed by the OS to be ideal, so there will always be some jitter in actually slept time without any guarantee that the error will fluctuate over zero without accumulating.
2) "some work" takes some time to run and that time adds again and again to the current time when next sleep(ctr_interval) starts. As the result even if sleep implementation would be ideal, then n'th sleep would start not at
t₀ + n·ctr_interval
but instead at
t₀ + n·ctr_interval + Σ1..n t(work_i)
where trailing Σ term adds over and over as the timing error which can be seen as e.g. increasing trend of received GSM clock jitter in https://osmocom.org/issues/4658#note-10 .
The thinko in the clock generator logic is not so much visible if "some work" takes only a bit of time or is done infrequently. That was actually the case before fake_trx added tx queueing in 6e1c82d2 (trx_toolkit/transceiver.py: implement the transmit burst queue) because before that commit some work was only "send IND CLOCK data every ~ 100th tick". However after 6e1c82d2 the work was adjusted to do linear scan of tx queue over and over at every tick which amplified error accumulation and highlighted the problem.
With that tx queuing in fake_trx was disabled in d4ed09df (Revert "trx_toolkit/transceiver.py: implement the transmit burst queue") with the rationale being most likely, as https://osmocom.org/issues/4658#note-10 says,
Unfortunately, Python is not fast enough to handle the queues in time. Despite the relatively low CPU usage, fake_trx.py fails to scheduler everything during one TDMA frame period. This causes some of our TTCN-3 test cases to fail.
...
Most likely, the problem is that Python's threading.Event is not accurate enough. Running with SCHED_RR does not change anything.
However with the above analysis we can see that it is the logic in CLCKgen that needs fixing, not threading.Event . For the reference threading.Event indeed used dumb timeout implementation on Python2:
https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L... https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L...
but on Python3 it essentially uses plain Lock.acquire(timeout) which, under the hood, uses PyThread_acquire_lock_timed - a plain wrapper over sem_timedwait:
https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Lib/threading.... https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Modules/_threa... https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Python/thread_...
so at least with py3 there should be no question about threading.Event .
-> Fix timing error accumulation by reworking the clock generator loop to compensate observed jitter, caused by OS noise and the work taking time, by adjusting to-sleep δt each tick accordingly.
This is generally good for correctness and will allow us to reinstate tx queueing in fake_trx.
Without the fix added test fails as
FAIL: test_no_timing_error_accumulated (test_clck_gen.CLCKGen_Test.test_no_timing_error_accumulated) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/kirr/src/osmocom/bb/src/target/trx_toolkit/test_clck_gen.py", line 60, in test_no_timing_error_accumulated self.assertTrue((ntick+1)*clck.ctr_interval > δT, "tick #%d: time overrun by %dµs total" % AssertionError: False is not true : tick #200: time overrun by 572478µs total
Change-Id: I928801422c9af80c368261f617b91d7ecfedbabf Related: OS#4658, OS#6672 --- M src/target/trx_toolkit/clck_gen.py A src/target/trx_toolkit/test_clck_gen.py 2 files changed, 87 insertions(+), 1 deletion(-)
git pull ssh://gerrit.osmocom.org:29418/osmocom-bb refs/changes/27/39327/2