kirr has uploaded this change for review.

View Change

trx_toolkit/clck_gen: Don't use threads because Python GIL is latency killer

Currently running fake_trx.py with just 1 BTS and 1 ccch_scan shows a
bit high CPU usage on 2 cores and regular time overrun in CLCKgen:

1 BTS + 1 ccch_scan:

2 threads: first ~ 85-90% CPU, second ~ 50% CPU; regular time overrun in CLCKgen

Trying to run fake_trx.py with 1 BTS and 2 ccch_scan shows even more CPU
usage and non-stop CLCKGen time overruns:

1 BTS + 2 ccch_scan:

2 threads: first ~ 100% CPU, second ~ 70% CPU; time overrun all the time

with BTS also emitting something like the following all the time:

20250416102139774 <0006> scheduler_trx.c:591 We were 42 FN faster than TRX, compensating
20250416102140358 <0006> scheduler_trx.c:591 We were 24 FN faster than TRX, compensating

Profiling fake_trx.py for 1 BTS + 1 ccch_scan case via perf (see
https://docs.python.org/3/howto/perf_profiling.html) as

$ perf record -g python -X perf fake_trx.py --trx M2@127.0.0.1:7700 --trx M3@127.0.0.1:8700

shows that besides in-kernel work related to send/recv/select syscalls
the system is also releasing/reacquiring GIL frequently

http://navytux.spb.ru/~kirr/osmo/fake_trx/gil-pingpong.html (search for "gil" and "Thread" there)

Which is understandable because in current architecture there are 2
threads:

- T1 is running CLCKGen loop + TX work
- T2 is running FakeTRX loop + RX work

Unfortunately, even though the functions that take/release the GIL are
not very huge themselves in the profile, still the GIL creates a huge
problem about latency, and fake_trx problems of having CLCKGen time
overrun are directly related to latency, because it means that CLCKGen
could not be woken up in time and missed 2 GSM frames to forward queued
burst timely.

The GIL creates latency problems because of the following: in py2 the
GIL was working like this: the main interpreter evaluator was checking
whether there is a GIL taking request every 100 instructions, and if
there was such a request from another python thread, it was making the
switch. However in py3 they did change the way how GIL works and
instead of switching every 100 instructions, the GIL was made to
preemptively switch every 5 _milliseconds_ to avoid unnecessary
switches. This helped throughput of computation-heavy workloads, but
harmed the workloads that are latency sensitive.

Please see the following presentation by David Beazley with overview on
how GIL works on py2 and py3:

https://speakerdeck.com/dabeaz/understanding-the-python-gil
https://dabeaz.com/GIL/

That presentation actually mentions the problem of GIL latency on py3
and further explains it on

https://www.dabeaz.com/blog/2010/02/revisiting-thread-priorities-and-new.html

that problem was further filed to Python issue tracker at the same 2010 in

https://bugs.python.org/issue7946

but was never addressed despite several propositional patches. In the
end it was canceled by the original reported ten years later

https://bugs.python.org/issue7946#msg377865

So despite the many years of this problem being known and not addressed
despite several patches proposed, there is, practically, no hope that it
will be addressed at all. And what happens in fake_trx case is that
CLCKGen thread can be prevented from timely woken up if main loop thread
is e.g. busy in decoding received bursts. That's why we see lots of time
overruns, because even due to the Python GIL only we can be practically
wasting 5 milliseconds. Not to mention that there are more problems
related to threads and frequently switching in between them, because of
e.g. 2 cores need to ping pong data in between their caches and similar
things.

-> cancel all that threading to avoid latency hits and organize CLCKGen
timer to be hooked into the sole IO loop by the way of timerfd system call.

This help latency and CPU usage significantly. After this patch:

1 BTS + 1 ccch_scan:

1 thread ~ 55% CPU; no CLCKGen time overruns

1 BTS + 2 ccch_scan:

1 thread ~ 65% CPU; seldom CLCKGen time overruns

Unfortunately os.timerfd_create() & friends are only available starting
from Python 3.13 . If this poses a problem it can be easily solved by
doing those timerfd related system calls in Cython, after we switch most
of the codebase to Cython later.

Change-Id: Iaa675c95059ec8ccfad667f69984d5a7f608c249
---
M src/target/trx_toolkit/.gitignore
M src/target/trx_toolkit/clck_gen.py
M src/target/trx_toolkit/fake_trx.py
M src/target/trx_toolkit/test_clck_gen.py
4 files changed, 63 insertions(+), 65 deletions(-)

git pull ssh://gerrit.osmocom.org:29418/osmocom-bb refs/changes/47/40047/1
diff --git a/src/target/trx_toolkit/.gitignore b/src/target/trx_toolkit/.gitignore
index 749ccda..dc634e2 100644
--- a/src/target/trx_toolkit/.gitignore
+++ b/src/target/trx_toolkit/.gitignore
@@ -2,3 +2,5 @@
__pycache__/
*.py[cod]
*$py.class
+/perf.data*
+/flamegraph.html
diff --git a/src/target/trx_toolkit/clck_gen.py b/src/target/trx_toolkit/clck_gen.py
index f769c3e..a223572 100755
--- a/src/target/trx_toolkit/clck_gen.py
+++ b/src/target/trx_toolkit/clck_gen.py
@@ -21,25 +21,26 @@
APP_CR_HOLDERS = [("2017-2019", "Vadim Yanitskiy <axilirator@gmail.com>")]

import logging as log
-import threading
import time
import signal
import os
+import sys

from app_common import ApplicationBase
from udp_link import UDPLink
from gsm_shared import *

+
+ns = 1e-9
+us = 1e-6
+
+
class CLCKGen:
# GSM TDMA definitions
SEC_DELAY_US = 1000 * 1000
GSM_FRAME_US = 4615.0

- def __init__(self, clck_links, clck_start = 0, ind_period = 102, sched_rr_prio = None):
- # This event is needed to control the thread
- self._breaker = threading.Event()
- self._thread = None
-
+ def __init__(self, clck_links, clck_start = 0, ind_period = 102):
self.clck_links = clck_links
self.ind_period = ind_period
self.clck_start = clck_start
@@ -47,71 +48,56 @@
# Calculate counter time
self.ctr_interval = self.GSM_FRAME_US
self.ctr_interval /= self.SEC_DELAY_US
+ self._t_tick = int(self.ctr_interval // ns)
+
+ # Initialize timer fd
+ self._timerfd = os.timerfd_create(time.CLOCK_MONOTONIC)

# (Optional) clock consumer
self.clck_handler = None

- # RR Scheduler priority of thread. None = don't set it.
- self.sched_rr_prio = sched_rr_prio
+ def __del__(self):
+ os.close(self._timerfd)

@property
def running(self):
- if self._thread is None:
- return False
- return self._thread.is_alive()
+ t_next, _ = os.timerfd_gettime_ns(self._timerfd)
+ return (t_next != 0)

def start(self):
- # Make sure we won't start two threads
- assert(self._thread is None)
-
# (Re)set the clock counter
self.clck_src = self.clck_start

- # Initialize and start a new thread
- self._thread = threading.Thread(target = self._worker)
- self._thread.daemon = True
- self._thread.start()
+ # start timer fd
+ os.timerfd_settime_ns(self._timerfd, initial=self._t_tick, interval=self._t_tick)

def stop(self):
- # No thread, no problem ;)
- if self._thread is None:
- return
+ # stop timer fd
+ os.timerfd_settime_ns(self._timerfd, initial=0, interval=0)

- # Stop the thread first
- self._breaker.set()
- self._thread.join()
-
- # Free memory, reset breaker
- self._thread = None
- self._breaker.clear()
-
- def _worker(self):
- if self.sched_rr_prio is not None:
- sched_param = os.sched_param(self.sched_rr_prio)
- try:
- log.info("CLCKGen: Setting real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio))
- os.sched_setscheduler(0, os.SCHED_RR, sched_param)
- except OSError:
- log.error("CLCKGen: Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.sched_rr_prio))
+ # tick must be called periodically by CLCKGen user.
+ #
+ # It waits for the next GSM frame to happen and emits corresponding clock indication at that time.
+ # It also runs attached .clck_handler if there is one.
+ #
+ # It is possible to use .tick in both blocking and non-blocking ways:
+ #
+ # - without extra care .tick will block waiting for the next GSM frame as explained above,
+ # - client code can also poll/select on ._timerfd to wait for GSM frame.
+ # After ._timerfd becomes ready it is guaranteed that the next .tick call will not block.
+ def tick(self):
# run .send_clck_ind() every .ctr_interval
- # be careful not to accumulate timing error when organizing the clock loop
- ns = 1e-9
- us = 1e-6
- t_tick = int(self.ctr_interval // ns)
- t_next = time.monotonic_ns()
- while 1:
- t_next += t_tick
- t = time.monotonic_ns()
- dt = (t_next - t)
- if dt < 0:
- log.warning("CLCKGen: time overrun by %dus; resetting the clock" % (dt * ns // us))
- t_next = time.monotonic_ns()
- dt = 0
+ # NOTE timerfd is careful not to accumulate timing error when organizing the clock loop

- if self._breaker.wait(dt * ns):
- break
+ _ = os.read(self._timerfd, 8)
+ assert len(_) == 8, len(_)
+ ticks = int.from_bytes(_, byteorder=sys.byteorder)
+ assert ticks > 0, ticks
+ if ticks > 1:
+ log.warning("CLCKGen: time overrun by %dus; resetting the clock" % ((ticks-1)*self._t_tick * ns // us))
+ # (the kernel does clock reset by itself)

- self.send_clck_ind()
+ self.send_clck_ind()

def send_clck_ind(self):
# We don't need to send so often
@@ -135,6 +121,8 @@
# Just a wrapper for independent usage
class Application(ApplicationBase):
def __init__(self):
+ self.stop = False
+
# Print copyright
self.app_print_copyright(APP_CR_HOLDERS)

@@ -150,13 +138,16 @@
self.clck = CLCKGen([self.link], ind_period = 51)
self.clck.start()

- # Block unless we receive a signal
- self.clck._thread.join()
+ while not self.stop:
+ self.clck.tick()
+
+ self.clck.stop()
+

def sig_handler(self, signum, frame):
log.info("Signal %d received" % signum)
if signum == signal.SIGINT:
- self.clck.stop()
+ self.stop = True

if __name__ == '__main__':
app = Application()
diff --git a/src/target/trx_toolkit/fake_trx.py b/src/target/trx_toolkit/fake_trx.py
index 711ad21..12d5b77 100755
--- a/src/target/trx_toolkit/fake_trx.py
+++ b/src/target/trx_toolkit/fake_trx.py
@@ -397,7 +397,7 @@
self.trx_list = TRXList()

# Init shared clock generator
- self.clck_gen = CLCKGen([], sched_rr_prio = None if self.argv.sched_rr_prio is None else self.argv.sched_rr_prio + 1)
+ self.clck_gen = CLCKGen([])
# This method will be called on each TDMA frame
self.clck_gen.clck_handler = self.clck_handler

@@ -459,7 +459,7 @@
log.error("Failed to set real time process scheduler to SCHED_RR, priority %u" % (self.argv.sched_rr_prio))

# Compose list of to be monitored sockets
- sock_list = []
+ sock_list = [self.clck_gen._timerfd]
for trx in self.trx_list.trx_list:
sock_list.append(trx.ctrl_if.sock)
sock_list.append(trx.data_if.sock)
@@ -469,6 +469,10 @@
# Wait until we get any data on any socket
r_event, _, _ = select.select(sock_list, [], [])

+ # clock is priority
+ if self.clck_gen._timerfd in r_event:
+ self.clck_gen.tick()
+
# Iterate over all transceivers
for trx in self.trx_list.trx_list:
# DATA interface
@@ -479,7 +483,7 @@
if trx.ctrl_if.sock in r_event:
trx.ctrl_if.handle_rx()

- # This method will be called by the clock thread
+ # This method will be called by the clock generator
def clck_handler(self, fn):
# We assume that this list is immutable at run-time
for trx in self.trx_list.trx_list:
diff --git a/src/target/trx_toolkit/test_clck_gen.py b/src/target/trx_toolkit/test_clck_gen.py
index 92e5e55..3eaa4de 100644
--- a/src/target/trx_toolkit/test_clck_gen.py
+++ b/src/target/trx_toolkit/test_clck_gen.py
@@ -14,7 +14,6 @@
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

-import threading
import time
import unittest

@@ -32,23 +31,25 @@
ntick = 200 # ~ 1 s
tick = 0
tstart = tend = None
- done = threading.Event()
+ done = False

def _(fn):
- nonlocal tick, tstart, tend
+ nonlocal tick, tstart, tend, done
if tick == 0:
tstart = time.monotonic()
if tick == ntick:
tend = time.monotonic()
- done.set()
+ done = True
tick += 1
time.sleep(clck.ctr_interval / 2)
clck.clck_handler = _

clck.start()
try:
- ok = done.wait(10)
- self.assertTrue(ok, "clck_gen stuck")
+ t0 = time.monotonic()
+ while not done:
+ self.assertTrue(time.monotonic() - t0 < 10, "clck_gen stuck")
+ clck.tick()

self.assertIsNotNone(tstart)
self.assertIsNotNone(tend)

To view, visit change 40047. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-MessageType: newchange
Gerrit-Project: osmocom-bb
Gerrit-Branch: master
Gerrit-Change-Id: Iaa675c95059ec8ccfad667f69984d5a7f608c249
Gerrit-Change-Number: 40047
Gerrit-PatchSet: 1
Gerrit-Owner: kirr <kirr@nexedi.com>
Gerrit-CC: fixeria <vyanitskiy@sysmocom.de>
Gerrit-CC: osmith <osmith@sysmocom.de>
Gerrit-CC: pespin <pespin@sysmocom.de>