kirr has uploaded this change for review. ( https://gerrit.osmocom.org/c/osmocom-bb/+/40047?usp=email )
Change subject: trx_toolkit/clck_gen: Don't use threads because Python GIL is latency killer ......................................................................
trx_toolkit/clck_gen: Don't use threads because Python GIL is latency killer
Currently running fake_trx.py with just 1 BTS and 1 ccch_scan shows a bit high CPU usage on 2 cores and regular time overrun in CLCKgen:
1 BTS + 1 ccch_scan:
2 threads: first ~ 85-90% CPU, second ~ 50% CPU; regular time overrun in CLCKgen
Trying to run fake_trx.py with 1 BTS and 2 ccch_scan shows even more CPU usage and non-stop CLCKGen time overruns:
1 BTS + 2 ccch_scan:
2 threads: first ~ 100% CPU, second ~ 70% CPU; time overrun all the time
with BTS also emitting something like the following all the time:
20250416102139774 <0006> scheduler_trx.c:591 We were 42 FN faster than TRX, compensating 20250416102140358 <0006> scheduler_trx.c:591 We were 24 FN faster than TRX, compensating
Profiling fake_trx.py for 1 BTS + 1 ccch_scan case via perf (see https://docs.python.org/3/howto/perf_profiling.html) as
$ perf record -g python -X perf fake_trx.py --trx M2@127.0.0.1:7700 --trx M3@127.0.0.1:8700
shows that besides in-kernel work related to send/recv/select syscalls the system is also releasing/reacquiring GIL frequently
http://navytux.spb.ru/~kirr/osmo/fake_trx/gil-pingpong.html (search for "gil" and "Thread" there)
Which is understandable because in current architecture there are 2 threads:
- T1 is running CLCKGen loop + TX work - T2 is running FakeTRX loop + RX work
Unfortunately, even though the functions that take/release the GIL are not very huge themselves in the profile, still the GIL creates a huge problem about latency, and fake_trx problems of having CLCKGen time overrun are directly related to latency, because it means that CLCKGen could not be woken up in time and missed 2 GSM frames to forward queued burst timely.
The GIL creates latency problems because of the following: in py2 the GIL was working like this: the main interpreter evaluator was checking whether there is a GIL taking request every 100 instructions, and if there was such a request from another python thread, it was making the switch. However in py3 they did change the way how GIL works and instead of switching every 100 instructions, the GIL was made to preemptively switch every 5 _milliseconds_ to avoid unnecessary switches. This helped throughput of computation-heavy workloads, but harmed the workloads that are latency sensitive.
Please see the following presentation by David Beazley with overview on how GIL works on py2 and py3:
https://speakerdeck.com/dabeaz/understanding-the-python-gil https://dabeaz.com/GIL/
That presentation actually mentions the problem of GIL latency on py3 and further explains it on
https://www.dabeaz.com/blog/2010/02/revisiting-thread-priorities-and-new.htm...
that problem was further filed to Python issue tracker at the same 2010 in
https://bugs.python.org/issue7946
but was never addressed despite several propositional patches. In the end it was canceled by the original reported ten years later
https://bugs.python.org/issue7946#msg377865
So despite the many years of this problem being known and not addressed despite several patches proposed, there is, practically, no hope that it will be addressed at all. And what happens in fake_trx case is that CLCKGen thread can be prevented from timely woken up if main loop thread is e.g. busy in decoding received bursts. That's why we see lots of time overruns, because even due to the Python GIL only we can be practically wasting 5 milliseconds. Not to mention that there are more problems related to threads and frequently switching in between them, because of e.g. 2 cores need to ping pong data in between their caches and similar things.
-> cancel all that threading to avoid latency hits and organize CLCKGen timer to be hooked into the sole IO loop by the way of timerfd system call.
This help latency and CPU usage significantly. After this patch:
1 BTS + 1 ccch_scan:
1 thread ~ 55% CPU; no CLCKGen time overruns
1 BTS + 2 ccch_scan:
1 thread ~ 65% CPU; seldom CLCKGen time overruns
Unfortunately os.timerfd_create() & friends are only available starting from Python 3.13 . If this poses a problem it can be easily solved by doing those timerfd related system calls in Cython, after we switch most of the codebase to Cython later.
Change-Id: Iaa675c95059ec8ccfad667f69984d5a7f608c249 --- M src/target/trx_toolkit/.gitignore M src/target/trx_toolkit/clck_gen.py M src/target/trx_toolkit/fake_trx.py M src/target/trx_toolkit/test_clck_gen.py 4 files changed, 63 insertions(+), 65 deletions(-)
git pull ssh://gerrit.osmocom.org:29418/osmocom-bb refs/changes/47/40047/1
diff --git a/src/target/trx_toolkit/.gitignore b/src/target/trx_toolkit/.gitignore index 749ccda..dc634e2 100644 --- a/src/target/trx_toolkit/.gitignore +++ b/src/target/trx_toolkit/.gitignore @@ -2,3 +2,5 @@ __pycache__/ *.py[cod] *$py.class +/perf.data* +/flamegraph.html diff --git a/src/target/trx_toolkit/clck_gen.py b/src/target/trx_toolkit/clck_gen.py index f769c3e..a223572 100755 --- a/src/target/trx_toolkit/clck_gen.py +++ b/src/target/trx_toolkit/clck_gen.py @@ -21,25 +21,26 @@ APP_CR_HOLDERS = [("2017-2019", "Vadim Yanitskiy axilirator@gmail.com")]
import logging as log -import threading import time import signal import os +import sys
from app_common import ApplicationBase from udp_link import UDPLink from gsm_shared import *
+ +ns = 1e-9 +us = 1e-6 + + class CLCKGen: # GSM TDMA definitions SEC_DELAY_US = 1000 * 1000 GSM_FRAME_US = 4615.0
- def __init__(self, clck_links, clck_start = 0, ind_period = 102, sched_rr_prio = None): - # This event is needed to control the thread - self._breaker = threading.Event() - self._thread = None - + def __init__(self, clck_links, clck_start = 0, ind_period = 102): self.clck_links = clck_links self.ind_period = ind_period self.clck_start = clck_start @@ -47,71 +48,56 @@ # Calculate counter time self.ctr_interval = self.GSM_FRAME_US self.ctr_interval /= self.SEC_DELAY_US + self._t_tick = int(self.ctr_interval // ns) + + # Initialize timer fd + self._timerfd = os.timerfd_create(time.CLOCK_MONOTONIC)
# (Optional) clock consumer self.clck_handler = None
- # RR Scheduler priority of thread. None = don't set it. - self.sched_rr_prio = sched_rr_prio + def __del__(self): + os.close(self._timerfd)
@property def running(self): - if self._thread is None: - return False - return self._thread.is_alive() + t_next, _ = os.timerfd_gettime_ns(self._timerfd) + return (t_next != 0)
def start(self): - # Make sure we won't start two threads - assert(self._thread is None) - # (Re)set the clock counter self.clck_src = self.clck_start
- # Initialize and start a new thread - self._thread = threading.Thread(target = self._worker) - self._thread.daemon = True - self._thread.start() + # start timer fd + os.timerfd_settime_ns(self._timerfd, initial=self._t_tick, interval=self._t_tick)
def stop(self): - # No thread, no problem ;) - if self._thread is None: - return + # stop timer fd + os.timerfd_settime_ns(self._timerfd, initial=0, interval=0)
- # Stop the thread first - self._breaker.set() - self._thread.join() - - # Free memory, reset breaker - self._thread = None - self._breaker.clear() - - def _worker(self): - if self.sched_rr_prio is not None: - sched_param = os.sched_param(self.sched_rr_prio) - try: - log.info("CLCKGen: Setting real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio)) - os.sched_setscheduler(0, os.SCHED_RR, sched_param) - except OSError: - log.error("CLCKGen: Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio)) + # tick must be called periodically by CLCKGen user. + # + # It waits for the next GSM frame to happen and emits corresponding clock indication at that time. + # It also runs attached .clck_handler if there is one. + # + # It is possible to use .tick in both blocking and non-blocking ways: + # + # - without extra care .tick will block waiting for the next GSM frame as explained above, + # - client code can also poll/select on ._timerfd to wait for GSM frame. + # After ._timerfd becomes ready it is guaranteed that the next .tick call will not block. + def tick(self): # run .send_clck_ind() every .ctr_interval - # be careful not to accumulate timing error when organizing the clock loop - ns = 1e-9 - us = 1e-6 - t_tick = int(self.ctr_interval // ns) - t_next = time.monotonic_ns() - while 1: - t_next += t_tick - t = time.monotonic_ns() - dt = (t_next - t) - if dt < 0: - log.warning("CLCKGen: time overrun by %dus; resetting the clock" % (dt * ns // us)) - t_next = time.monotonic_ns() - dt = 0 + # NOTE timerfd is careful not to accumulate timing error when organizing the clock loop
- if self._breaker.wait(dt * ns): - break + _ = os.read(self._timerfd, 8) + assert len(_) == 8, len(_) + ticks = int.from_bytes(_, byteorder=sys.byteorder) + assert ticks > 0, ticks + if ticks > 1: + log.warning("CLCKGen: time overrun by %dus; resetting the clock" % ((ticks-1)*self._t_tick * ns // us)) + # (the kernel does clock reset by itself)
- self.send_clck_ind() + self.send_clck_ind()
def send_clck_ind(self): # We don't need to send so often @@ -135,6 +121,8 @@ # Just a wrapper for independent usage class Application(ApplicationBase): def __init__(self): + self.stop = False + # Print copyright self.app_print_copyright(APP_CR_HOLDERS)
@@ -150,13 +138,16 @@ self.clck = CLCKGen([self.link], ind_period = 51) self.clck.start()
- # Block unless we receive a signal - self.clck._thread.join() + while not self.stop: + self.clck.tick() + + self.clck.stop() +
def sig_handler(self, signum, frame): log.info("Signal %d received" % signum) if signum == signal.SIGINT: - self.clck.stop() + self.stop = True
if __name__ == '__main__': app = Application() diff --git a/src/target/trx_toolkit/fake_trx.py b/src/target/trx_toolkit/fake_trx.py index 711ad21..12d5b77 100755 --- a/src/target/trx_toolkit/fake_trx.py +++ b/src/target/trx_toolkit/fake_trx.py @@ -397,7 +397,7 @@ self.trx_list = TRXList()
# Init shared clock generator - self.clck_gen = CLCKGen([], sched_rr_prio = None if self.argv.sched_rr_prio is None else self.argv.sched_rr_prio + 1) + self.clck_gen = CLCKGen([]) # This method will be called on each TDMA frame self.clck_gen.clck_handler = self.clck_handler
@@ -459,7 +459,7 @@ log.error("Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.argv.sched_rr_prio))
# Compose list of to be monitored sockets - sock_list = [] + sock_list = [self.clck_gen._timerfd] for trx in self.trx_list.trx_list: sock_list.append(trx.ctrl_if.sock) sock_list.append(trx.data_if.sock) @@ -469,6 +469,10 @@ # Wait until we get any data on any socket r_event, _, _ = select.select(sock_list, [], [])
+ # clock is priority + if self.clck_gen._timerfd in r_event: + self.clck_gen.tick() + # Iterate over all transceivers for trx in self.trx_list.trx_list: # DATA interface @@ -479,7 +483,7 @@ if trx.ctrl_if.sock in r_event: trx.ctrl_if.handle_rx()
- # This method will be called by the clock thread + # This method will be called by the clock generator def clck_handler(self, fn): # We assume that this list is immutable at run-time for trx in self.trx_list.trx_list: diff --git a/src/target/trx_toolkit/test_clck_gen.py b/src/target/trx_toolkit/test_clck_gen.py index 92e5e55..3eaa4de 100644 --- a/src/target/trx_toolkit/test_clck_gen.py +++ b/src/target/trx_toolkit/test_clck_gen.py @@ -14,7 +14,6 @@ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details.
-import threading import time import unittest
@@ -32,23 +31,25 @@ ntick = 200 # ~ 1 s tick = 0 tstart = tend = None - done = threading.Event() + done = False
def _(fn): - nonlocal tick, tstart, tend + nonlocal tick, tstart, tend, done if tick == 0: tstart = time.monotonic() if tick == ntick: tend = time.monotonic() - done.set() + done = True tick += 1 time.sleep(clck.ctr_interval / 2) clck.clck_handler = _
clck.start() try: - ok = done.wait(10) - self.assertTrue(ok, "clck_gen stuck") + t0 = time.monotonic() + while not done: + self.assertTrue(time.monotonic() - t0 < 10, "clck_gen stuck") + clck.tick()
self.assertIsNotNone(tstart) self.assertIsNotNone(tend)