 
            Hi!
From time to time we've been discussing logging related stalling of osmocom
programs, particularly in the context of the stderr log target.
In general, when we log to stderr, we perform a blocking, buffered write. So any slow output on stderr will slow down the entire osmocom program.
I've taken some time to play with this using bpftrace and usdt (static userspace probes) in order to determine the latency of log statements in osmocom programs.
My test program doesn't do anything else but an endless LOGP() of verying length (from 1..512 characters). It was executed on a Lenovo Thinkpad X260 (i7-6600U, 16GB RAM, SATA SSD with btrfs on top of dm-crypt).
I've compared * launching the program in uxterm * launching the program in gnome-terminal * launching the program in screen (detached) * launching the program in screen (attached; printing via uxterm) * launching the program with 2>/dev/null * launching the program with 2>/file/on/btrfs/on/ssd * launching the program from systemd (with journald) * replacing stderr logging with gsmtap log target (without wqueue)
The resulting histograms of the time spent for writing the logs are below.
The git repo containing the test program, the libosmocore usdt patch and the bpftrace program can be found at git://git.osmocom.org/laforge/osmo-logbench.git
== launching the program in uxterm
@usecs: [2, 4) 30 | | [4, 8) 199963 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 82294 |@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 6897 |@ | [32, 64) 1649 | | [64, 128) 57 | | [128, 256) 221 | | [256, 512) 8588 |@@ | [512, 1K) 13062 |@@@ | [1K, 2K) 144 | | [2K, 4K) 14 | | [4K, 8K) 3 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 28 | |
== launching the program in gnome-terminal
@usecs: [2, 4) 32387 |@@@ | [4, 8) 457668 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 44408 |@@@@@ | [16, 32) 466 | | [32, 64) 63 | | [64, 128) 9 | | [128, 256) 3 | | [256, 512) 53 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 1 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 39 | | [64K, 128K) 125 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
So we can see that gnome-terminal tends to be faster on most stderr-writes, but at the expens of introducing a few long-delays in the 32..128ms bucket.
== launching the program in screen (detached)
@usecs: [2, 4) 107649 |@@@@@@@@ | [4, 8) 699672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 15262 |@ | [16, 32) 622 | | [32, 64) 5840 | | [64, 128) 42728 |@@@ | [128, 256) 14000 |@ | [256, 512) 68 | | [512, 1K) 3 | | [1K, 2K) 0 | | [2K, 4K) 1 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 50 | |
* launching the program in screen (attached; printing via uxterm)
@usecs: [2, 4) 2 | | [4, 8) 67887 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 30378 |@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 58881 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 5188 |@@@ | [64, 128) 301 | | [128, 256) 61 | | [256, 512) 3963 |@@@ | [512, 1K) 7603 |@@@@@ | [1K, 2K) 407 | | [2K, 4K) 42 | | [4K, 8K) 20 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
so intrestingly, attaching the screen will significantly slow down stderr. It seems screen doesn't do sufficient buffering internally, or it blocks the osmocom program until not only it has updated its internal buffer, but also written to the output of the attached uxterm?
== launching the program with 2>/dev/null
@usecs: [1] 2202080 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 4) 26966 | | [4, 8) 1107 | | [8, 16) 305 | | [16, 32) 10 | | [32, 64) 6 | | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 1 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 153 | |
unsurprisingly, this turns out to be rather fast
== launching the program with 2>/file/on/btrfs/on/ssd
@usecs: [2, 4) 2275464 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 265612 |@@@@@@ | [8, 16) 9384 | | [16, 32) 368 | | [32, 64) 4 | | [64, 128) 24 | | [128, 256) 3 | | [256, 512) 1 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 95 | |
I've tried creating some load using a cocnrrent dd of the underlying block device, but couldn't make out any difference. I guess this shows how good the I/O subsystem is. Yes, some writes are slower than writing to /dev/null, but
== launching the program from systemd (with journald)
[2, 4) 1060607 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 75133 |@@@ | [8, 16) 724 | | [16, 32) 107 | | [32, 64) 8 | | [64, 128) 8 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 38 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
This is also pretty fast. There are some writes in the 64..128ms range, though. Those are likely going to cause problems in timing critical programs like osmo-bts, osmo-trx or osmo-mgw.
* replacing stderr logging with gsmtap log target (without wqueue)
@usecs: [4, 8) 776978 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 1771 | | [16, 32) 60 | | [32, 64) 0 | | [64, 128) 1 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 0 | | [2G, 4G) 82 | |
Ignoring the strange (and certainly bogus) artefacts in the 2..4s bin at the bottom, writes to the GSMTAP UDP socket tend to be quite fast, much faster than anything via stderr. Even faster than stderr-to-dev-null.
== conclusion
all-in-all, it looks not too bad at least on a relatively unloaded couple-of-years-old laptop. I would have expected worse.
We can definitely see that there are non-neglectable delays created by both systemd/journald as well as gnome-terminal.
One idea would be to play with setvbuf(3) to increase stderr buffering inside libc.
The other idea would be to do our own buffering using osmo_wqueue() on the raw, unbuffered stderr-fd, which would then be in non-blocking mode and handled vis osmocom select. We'd have to add some kind of limit for that buffer in order to avoid OOM situations.