[M] Change in osmocom-bb[master]: trx_toolkit/clck_gen: Don't use threads because Python GIL is latency...

17 Apr 2025

kirr has uploaded this change for review. ( https://gerrit.osmocom.org/c/osmocom-bb/+/40047?usp=email )
Change subject: trx_toolkit/clck_gen: Don't use threads because Python GIL is latency killer
......................................................................
trx_toolkit/clck_gen: Don't use threads because Python GIL is latency killer
Currently running fake_trx.py with just 1 BTS and 1 ccch_scan shows a
bit high CPU usage on 2 cores and regular time overrun in CLCKgen:
1 BTS + 1 ccch_scan:
2 threads: first ~ 85-90% CPU, second ~ 50% CPU; regular time overrun in CLCKgen
Trying to run fake_trx.py with 1 BTS and 2 ccch_scan shows even more CPU
usage and non-stop CLCKGen time overruns:
1 BTS + 2 ccch_scan:
2 threads: first ~ 100% CPU, second ~ 70% CPU; time overrun all the time
with BTS also emitting something like the following all the time:
20250416102139774 <0006> scheduler_trx.c:591 We were 42 FN faster than TRX, compensating
        20250416102140358 <0006> scheduler_trx.c:591 We were 24 FN faster than TRX, compensating
Profiling fake_trx.py for 1 BTS + 1 ccch_scan case via perf (see
https://docs.python.org/3/howto/perf_profiling.html) as
$ perf record -g python -X perf fake_trx.py --trx M2@127.0.0.1:7700 --trx M3@127.0.0.1:8700
shows that besides in-kernel work related to send/recv/select syscalls
the system is also releasing/reacquiring GIL frequently
http://navytux.spb.ru/~kirr/osmo/fake_trx/gil-pingpong.html  (search for "gil" and "Thread" there)
Which is understandable because in current architecture there are 2
threads:
- T1 is running CLCKGen loop + TX work
    - T2 is running FakeTRX loop + RX work
Unfortunately, even though the functions that take/release the GIL are
not very huge themselves in the profile, still the GIL creates a huge
problem about latency, and fake_trx problems of having CLCKGen time
overrun are directly related to latency, because it means that CLCKGen
could not be woken up in time and missed 2 GSM frames to forward queued
burst timely.
The GIL creates latency problems because of the following: in py2 the
GIL was working like this: the main interpreter evaluator was checking
whether there is a GIL taking request every 100 instructions, and if
there was such a request from another python thread, it was making the
switch. However in py3 they did change the way how GIL works and
instead of switching every 100 instructions, the GIL was made to
preemptively switch every 5 _milliseconds_ to avoid unnecessary
switches. This helped throughput of computation-heavy workloads, but
harmed the workloads that are latency sensitive.
Please see the following presentation by David Beazley with overview on
how GIL works on py2 and py3:
https://speakerdeck.com/dabeaz/understanding-the-python-gil
    https://dabeaz.com/GIL/
That presentation actually mentions the problem of GIL latency on py3
and further explains it on
https://www.dabeaz.com/blog/2010/02/revisiting-thread-priorities-and-new.htm...
that problem was further filed to Python issue tracker at the same 2010 in
https://bugs.python.org/issue7946
but was never addressed despite several propositional patches. In the
end it was canceled by the original reported ten years later
https://bugs.python.org/issue7946#msg377865
So despite the many years of this problem being known and not addressed
despite several patches proposed, there is, practically, no hope that it
will be addressed at all. And what happens in fake_trx case is that
CLCKGen thread can be prevented from timely woken up if main loop thread
is e.g. busy in decoding received bursts. That's why we see lots of time
overruns, because even due to the Python GIL only we can be practically
wasting 5 milliseconds. Not to mention that there are more problems
related to threads and frequently switching in between them, because of
e.g. 2 cores need to ping pong data in between their caches and similar
things.
-> cancel all that threading to avoid latency hits and organize CLCKGen
   timer to be hooked into the sole IO loop by the way of timerfd system call.
This help latency and CPU usage significantly. After this patch:
1 BTS + 1 ccch_scan:
1 thread ~ 55% CPU; no CLCKGen time overruns
1 BTS + 2 ccch_scan:
1 thread ~ 65% CPU; seldom CLCKGen time overruns
Unfortunately os.timerfd_create() & friends are only available starting
from Python 3.13 . If this poses a problem it can be easily solved by
doing those timerfd related system calls in Cython, after we switch most
of the codebase to Cython later.
Change-Id: Iaa675c95059ec8ccfad667f69984d5a7f608c249
---
M src/target/trx_toolkit/.gitignore
M src/target/trx_toolkit/clck_gen.py
M src/target/trx_toolkit/fake_trx.py
M src/target/trx_toolkit/test_clck_gen.py
4 files changed, 63 insertions(+), 65 deletions(-)
git pull ssh://gerrit.osmocom.org:29418/osmocom-bb refs/changes/47/40047/1

diff --git a/src/target/trx_toolkit/.gitignore b/src/target/trx_toolkit/.gitignore
index 749ccda..dc634e2 100644
--- a/src/target/trx_toolkit/.gitignore
+++ b/src/target/trx_toolkit/.gitignore
@@ -2,3 +2,5 @@
 __pycache__/
 *.py[cod]
 *$py.class
+/perf.data*
+/flamegraph.html
diff --git a/src/target/trx_toolkit/clck_gen.py b/src/target/trx_toolkit/clck_gen.py
index f769c3e..a223572 100755
--- a/src/target/trx_toolkit/clck_gen.py
+++ b/src/target/trx_toolkit/clck_gen.py
@@ -21,25 +21,26 @@
 APP_CR_HOLDERS = [("2017-2019", "Vadim Yanitskiy axilirator@gmail.com")]
import logging as log
-import threading
 import time
 import signal
 import os
+import sys
from app_common import ApplicationBase
 from udp_link import UDPLink
 from gsm_shared import *
+
+ns = 1e-9
+us = 1e-6
+
+
 class CLCKGen:
    # GSM TDMA definitions
    SEC_DELAY_US = 1000 * 1000
    GSM_FRAME_US = 4615.0
-	def __init__(self, clck_links, clck_start = 0, ind_period = 102, sched_rr_prio = None):
-		# This event is needed to control the thread
-		self._breaker = threading.Event()
-		self._thread = None
-
+	def __init__(self, clck_links, clck_start = 0, ind_period = 102):
    	self.clck_links = clck_links
    	self.ind_period = ind_period
    	self.clck_start = clck_start
@@ -47,71 +48,56 @@
    	# Calculate counter time
    	self.ctr_interval  = self.GSM_FRAME_US
    	self.ctr_interval /= self.SEC_DELAY_US
+		self._t_tick  = int(self.ctr_interval // ns)
+
+		# Initialize timer fd
+		self._timerfd = os.timerfd_create(time.CLOCK_MONOTONIC)
# (Optional) clock consumer
    	self.clck_handler = None
-		# RR Scheduler priority of thread. None = don't set it.
-		self.sched_rr_prio = sched_rr_prio
+	def __del__(self):
+		os.close(self._timerfd)
@property
    def running(self):
-		if self._thread is None:
-			return False
-		return self._thread.is_alive()
+		t_next, _ = os.timerfd_gettime_ns(self._timerfd)
+		return (t_next != 0)
def start(self):
-		# Make sure we won't start two threads
-		assert(self._thread is None)
-
    	# (Re)set the clock counter
    	self.clck_src = self.clck_start
-		# Initialize and start a new thread
-		self._thread = threading.Thread(target = self._worker)
-		self._thread.daemon = True
-		self._thread.start()
+		# start timer fd
+		os.timerfd_settime_ns(self._timerfd, initial=self._t_tick, interval=self._t_tick)
def stop(self):
-		# No thread, no problem ;)
-		if self._thread is None:
-			return
+		# stop timer fd
+		os.timerfd_settime_ns(self._timerfd, initial=0, interval=0)
-		# Stop the thread first
-		self._breaker.set()
-		self._thread.join()
-
-		# Free memory, reset breaker
-		self._thread = None
-		self._breaker.clear()
-
-	def _worker(self):
-		if self.sched_rr_prio is not None:
-			sched_param = os.sched_param(self.sched_rr_prio)
-			try:
-				log.info("CLCKGen: Setting real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio))
-				os.sched_setscheduler(0, os.SCHED_RR, sched_param)
-			except OSError:
-				log.error("CLCKGen: Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio))
+	# tick must be called periodically by CLCKGen user.
+	#
+	# It waits for the next GSM frame to happen and emits corresponding clock indication at that time.
+	# It also runs attached .clck_handler if there is one.
+	#
+	# It is possible to use .tick in both blocking and non-blocking ways:
+	#
+	# - without extra care .tick will block waiting for the next GSM frame as explained above,
+	# - client code can also poll/select on ._timerfd to wait for GSM frame.
+	#   After ._timerfd becomes ready it is guaranteed that the next .tick call will not block.
+	def tick(self):
    	# run .send_clck_ind() every .ctr_interval
-		# be careful not to accumulate timing error when organizing the clock loop
-		ns = 1e-9
-		us = 1e-6
-		t_tick = int(self.ctr_interval // ns)
-		t_next = time.monotonic_ns()
-		while 1:
-			t_next += t_tick
-			t = time.monotonic_ns()
-			dt = (t_next - t)
-			if dt < 0:
-				log.warning("CLCKGen: time overrun by %dus; resetting the clock" % (dt * ns // us))
-				t_next = time.monotonic_ns()
-				dt = 0
+		# NOTE timerfd is careful not to accumulate timing error when organizing the clock loop
-			if self._breaker.wait(dt * ns):
-				break
+		_ = os.read(self._timerfd, 8)
+		assert len(_) == 8, len(_)
+		ticks = int.from_bytes(_, byteorder=sys.byteorder)
+		assert ticks > 0, ticks
+		if ticks > 1:
+			log.warning("CLCKGen: time overrun by %dus; resetting the clock" % ((ticks-1)*self._t_tick * ns // us))
+			# (the kernel does clock reset by itself)
-			self.send_clck_ind()
+		self.send_clck_ind()
def send_clck_ind(self):
    	# We don't need to send so often
@@ -135,6 +121,8 @@
 # Just a wrapper for independent usage
 class Application(ApplicationBase):
    def __init__(self):
+		self.stop = False
+
    	# Print copyright
    	self.app_print_copyright(APP_CR_HOLDERS)
@@ -150,13 +138,16 @@
    	self.clck = CLCKGen([self.link], ind_period = 51)
    	self.clck.start()
-		# Block unless we receive a signal
-		self.clck._thread.join()
+		while not self.stop:
+			self.clck.tick()
+
+		self.clck.stop()
+
def sig_handler(self, signum, frame):
    	log.info("Signal %d received" % signum)
    	if signum == signal.SIGINT:
-			self.clck.stop()
+			self.stop = True
if __name__ == '__main__':
    app = Application()
diff --git a/src/target/trx_toolkit/fake_trx.py b/src/target/trx_toolkit/fake_trx.py
index 711ad21..12d5b77 100755
--- a/src/target/trx_toolkit/fake_trx.py
+++ b/src/target/trx_toolkit/fake_trx.py
@@ -397,7 +397,7 @@
    	self.trx_list = TRXList()
# Init shared clock generator
-		self.clck_gen = CLCKGen([], sched_rr_prio = None if self.argv.sched_rr_prio is None else self.argv.sched_rr_prio + 1)
+		self.clck_gen = CLCKGen([])
    	# This method will be called on each TDMA frame
    	self.clck_gen.clck_handler = self.clck_handler
@@ -459,7 +459,7 @@
    			log.error("Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.argv.sched_rr_prio))
# Compose list of to be monitored sockets
-		sock_list = []
+		sock_list = [self.clck_gen._timerfd]
    	for trx in self.trx_list.trx_list:
    		sock_list.append(trx.ctrl_if.sock)
    		sock_list.append(trx.data_if.sock)
@@ -469,6 +469,10 @@
    		# Wait until we get any data on any socket
    		r_event, _, _ = select.select(sock_list, [], [])
+			# clock is priority
+			if self.clck_gen._timerfd in r_event:
+				self.clck_gen.tick()
+
    		# Iterate over all transceivers
    		for trx in self.trx_list.trx_list:
    			# DATA interface
@@ -479,7 +483,7 @@
    			if trx.ctrl_if.sock in r_event:
    				trx.ctrl_if.handle_rx()
-	# This method will be called by the clock thread
+	# This method will be called by the clock generator
    def clck_handler(self, fn):
    	# We assume that this list is immutable at run-time
    	for trx in self.trx_list.trx_list:
diff --git a/src/target/trx_toolkit/test_clck_gen.py b/src/target/trx_toolkit/test_clck_gen.py
index 92e5e55..3eaa4de 100644
--- a/src/target/trx_toolkit/test_clck_gen.py
+++ b/src/target/trx_toolkit/test_clck_gen.py
@@ -14,7 +14,6 @@
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
-import threading
 import time
 import unittest
@@ -32,23 +31,25 @@
    	ntick = 200  # ~ 1 s
    	tick = 0
    	tstart = tend = None
-		done = threading.Event()
+		done = False
def _(fn):
-			nonlocal tick, tstart, tend
+			nonlocal tick, tstart, tend, done
    		if tick == 0:
    			tstart = time.monotonic()
    		if tick == ntick:
    			tend = time.monotonic()
-				done.set()
+				done = True
    		tick += 1
    		time.sleep(clck.ctr_interval / 2)
    	clck.clck_handler = _
clck.start()
    	try:
-			ok = done.wait(10)
-			self.assertTrue(ok, "clck_gen stuck")
+			t0 = time.monotonic()
+			while not done:
+				self.assertTrue(time.monotonic() - t0 < 10,  "clck_gen stuck")
+				clck.tick()
self.assertIsNotNone(tstart)
    		self.assertIsNotNone(tend)
-- 
To view, visit https://gerrit.osmocom.org/c/osmocom-bb/+/40047?usp=email
To unsubscribe, or for help writing mail filters, visit https://gerrit.osmocom.org/settings?usp=email

Gerrit-MessageType: newchange
Gerrit-Project: osmocom-bb
Gerrit-Branch: master
Gerrit-Change-Id: Iaa675c95059ec8ccfad667f69984d5a7f608c249
Gerrit-Change-Number: 40047
Gerrit-PatchSet: 1
Gerrit-Owner: kirr kirr@nexedi.com
Gerrit-CC: fixeria vyanitskiy@sysmocom.de
Gerrit-CC: osmith osmith@sysmocom.de
Gerrit-CC: pespin pespin@sysmocom.de



    

2026

2025

2024

2023

2022

[M] Change in osmocom-bb[master]: trx_toolkit/clck_gen: Don't use threads because Python GIL is latency...