Hi!
From time to time we've been discussing logging related stalling of osmocom
programs, particularly in the context of the stderr log target.
In general, when we log to stderr, we perform a blocking, buffered write. So any slow output on stderr will slow down the entire osmocom program.
I've taken some time to play with this using bpftrace and usdt (static userspace probes) in order to determine the latency of log statements in osmocom programs.
My test program doesn't do anything else but an endless LOGP() of verying length (from 1..512 characters). It was executed on a Lenovo Thinkpad X260 (i7-6600U, 16GB RAM, SATA SSD with btrfs on top of dm-crypt).
I've compared * launching the program in uxterm * launching the program in gnome-terminal * launching the program in screen (detached) * launching the program in screen (attached; printing via uxterm) * launching the program with 2>/dev/null * launching the program with 2>/file/on/btrfs/on/ssd * launching the program from systemd (with journald) * replacing stderr logging with gsmtap log target (without wqueue)
The resulting histograms of the time spent for writing the logs are below.
The git repo containing the test program, the libosmocore usdt patch and the bpftrace program can be found at git://git.osmocom.org/laforge/osmo-logbench.git
== launching the program in uxterm
@usecs: [2, 4) 30 | | [4, 8) 199963 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 82294 |@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 6897 |@ | [32, 64) 1649 | | [64, 128) 57 | | [128, 256) 221 | | [256, 512) 8588 |@@ | [512, 1K) 13062 |@@@ | [1K, 2K) 144 | | [2K, 4K) 14 | | [4K, 8K) 3 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 28 | |
== launching the program in gnome-terminal
@usecs: [2, 4) 32387 |@@@ | [4, 8) 457668 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 44408 |@@@@@ | [16, 32) 466 | | [32, 64) 63 | | [64, 128) 9 | | [128, 256) 3 | | [256, 512) 53 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 1 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 39 | | [64K, 128K) 125 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
So we can see that gnome-terminal tends to be faster on most stderr-writes, but at the expens of introducing a few long-delays in the 32..128ms bucket.
== launching the program in screen (detached)
@usecs: [2, 4) 107649 |@@@@@@@@ | [4, 8) 699672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 15262 |@ | [16, 32) 622 | | [32, 64) 5840 | | [64, 128) 42728 |@@@ | [128, 256) 14000 |@ | [256, 512) 68 | | [512, 1K) 3 | | [1K, 2K) 0 | | [2K, 4K) 1 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 50 | |
* launching the program in screen (attached; printing via uxterm)
@usecs: [2, 4) 2 | | [4, 8) 67887 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 30378 |@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 58881 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 5188 |@@@ | [64, 128) 301 | | [128, 256) 61 | | [256, 512) 3963 |@@@ | [512, 1K) 7603 |@@@@@ | [1K, 2K) 407 | | [2K, 4K) 42 | | [4K, 8K) 20 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
so intrestingly, attaching the screen will significantly slow down stderr. It seems screen doesn't do sufficient buffering internally, or it blocks the osmocom program until not only it has updated its internal buffer, but also written to the output of the attached uxterm?
== launching the program with 2>/dev/null
@usecs: [1] 2202080 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 4) 26966 | | [4, 8) 1107 | | [8, 16) 305 | | [16, 32) 10 | | [32, 64) 6 | | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 1 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 153 | |
unsurprisingly, this turns out to be rather fast
== launching the program with 2>/file/on/btrfs/on/ssd
@usecs: [2, 4) 2275464 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 265612 |@@@@@@ | [8, 16) 9384 | | [16, 32) 368 | | [32, 64) 4 | | [64, 128) 24 | | [128, 256) 3 | | [256, 512) 1 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 95 | |
I've tried creating some load using a cocnrrent dd of the underlying block device, but couldn't make out any difference. I guess this shows how good the I/O subsystem is. Yes, some writes are slower than writing to /dev/null, but
== launching the program from systemd (with journald)
[2, 4) 1060607 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 75133 |@@@ | [8, 16) 724 | | [16, 32) 107 | | [32, 64) 8 | | [64, 128) 8 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 38 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
This is also pretty fast. There are some writes in the 64..128ms range, though. Those are likely going to cause problems in timing critical programs like osmo-bts, osmo-trx or osmo-mgw.
* replacing stderr logging with gsmtap log target (without wqueue)
@usecs: [4, 8) 776978 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 1771 | | [16, 32) 60 | | [32, 64) 0 | | [64, 128) 1 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 0 | | [2G, 4G) 82 | |
Ignoring the strange (and certainly bogus) artefacts in the 2..4s bin at the bottom, writes to the GSMTAP UDP socket tend to be quite fast, much faster than anything via stderr. Even faster than stderr-to-dev-null.
== conclusion
all-in-all, it looks not too bad at least on a relatively unloaded couple-of-years-old laptop. I would have expected worse.
We can definitely see that there are non-neglectable delays created by both systemd/journald as well as gnome-terminal.
One idea would be to play with setvbuf(3) to increase stderr buffering inside libc.
The other idea would be to do our own buffering using osmo_wqueue() on the raw, unbuffered stderr-fd, which would then be in non-blocking mode and handled vis osmocom select. We'd have to add some kind of limit for that buffer in order to avoid OOM situations.
On Sat, Sep 26, 2020 at 07:50:14PM +0200, Harald Welte wrote:
One idea would be to play with setvbuf(3) to increase stderr buffering inside libc.
I just tried; at least for gnome-terminal it did not at all change the results, even if a 1MByte buffer was used in setvbuf(). I thoght this could be related to our use of fflush() in stderr logging. But even when removing the fflush() I could still see those 64..128ms stalls when running the test program via gnome-terminal. Not sure what's going on, but definitely we can stop looking at setvbuf() as a solution.
Dear list,
I did some more experiments.
On Sat, Sep 26, 2020 at 07:50:14PM +0200, Harald Welte wrote:
== launching the program in gnome-terminal
@usecs: [2, 4) 32387 |@@@ | [4, 8) 457668 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 44408 |@@@@@ | [16, 32) 466 | | [32, 64) 63 | | [64, 128) 9 | | [128, 256) 3 | | [256, 512) 53 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 1 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 39 | | [64K, 128K) 125 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 1 | |
So we can see that gnome-terminal tends to be faster on most stderr-writes, but at the expens of introducing a few long-delays in the 32..128ms bucket.
The other idea would be to do our own buffering using osmo_wqueue() on the raw, unbuffered stderr-fd, which would then be in non-blocking mode and handled vis osmocom select. We'd have to add some kind of limit for that buffer in order to avoid OOM situations.
I just implemented that. Results are promising.
This is with gnome-terminal where above we had the significant number of 64..128ms blocking delays. Now:
@usecs: [1] 404920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 4) 308383 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 3191 | | [8, 16) 1434 | | [16, 32) 1438 | | [32, 64) 65 | | [64, 128) 22 | | [128, 256) 9 | | [256, 512) 9 | |
so no more than 512us for any log output line (which is within the process, allocating a msgb + enqueueing it to the write_queue). We have to be a bit careful here, in order to prevent re-entrancy problems. For example, write_queue currently wants to log if the queue is full, which obviosly causes infinite recursion.
We make logging more expensive this way, as we add a msgb allocation + memcpy to each log statement. At the same time, we remove the potential of any write-blocking, which is probably worth it.
There are some other side-effects, though: stderr-logging no longer works for small programs tat don't ever call osmo_select_main(). So maybe we need to support both the old blocking as well as the new non-blocking method?
But how do we make sure everyone gets the best default behavior? Start with the blocking/buffered I/O and switch to the write_queue backend on the first call to osmo_select_main()?
What do you think?
Hi,
LGTM. Indeed we should keep allowing use of logging API without requiring osmo_select_main(). I submitted several comments to the related gerrit patch you submitted.
Dear list,
Vadim requested to also compare syslog based logging. In theory, I would have expected it to perform similar to gsmtap, given that it also doesn't do anything else but sending UDP packets.
However, the performance looks slightly worse than with gsmtap:
@usecs: [2, 4) 1167 | | [4, 8) 2056285 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 8403 | | [16, 32) 1880 | | [32, 64) 6 | | [64, 128) 6 | | [128, 256) 0 | | [256, 512) 2 | | [512, 1K) 1 | | [1K, 2K) 738 | | [2K, 4K) 2 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 49 | |
Note the significant count of samples in the 1..2ms bucket. That's still quite a lot, compared to the usual 4..8us.
Reminder: With gsmtap loggingwe had no log lines with more than 128us delay.
The results are reproducible, I just double-checked that gsmtap defintely outperforms syslog in terms of the maximum delay/latency caused by logging.
Vadim also requested comparison of performance with his systemd-journald log target in teh same benchmark.
Pleae find the results below:
systemd-journald (raw=false)
[4, 8) 1427 | | [8, 16) 331242 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 95378 |@@@@@@@@@@@@@@ | [32, 64) 2874 | | [64, 128) 449 | | [128, 256) 48 | | [256, 512) 7 | | [512, 1K) 1 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 0 | | [2G, 4G) 81 | |
systemd-journald (raw=true)
[4, 8) 5477 | | [8, 16) 576683 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 143985 |@@@@@@@@@@@@ | [32, 64) 3000 | | [64, 128) 772 | | [128, 256) 107 | | [256, 512) 8 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 1 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 0 | | [64M, 128M) 0 | | [128M, 256M) 0 | | [256M, 512M) 0 | | [512M, 1G) 0 | | [1G, 2G) 0 | | [2G, 4G) 48 | |
So we can see that both of these are actually quite good: Up to 1ms for almost all log writes in the raw=false case, and more or less the same order of magnitude for the raw case. Both are much better than blocking stderr + systemd picking up stderr (where latencies up to 128ms are observed at times.
On Sat, Sep 26, 2020 at 09:52:51PM +0200, Harald Welte wrote:
The other idea would be to do our own buffering using osmo_wqueue() on
[...]
I just implemented that. Results are promising.
I'm wondering, when logging to a wqueue, you are measuring the timing of adding the log statement to the queue? Or also the write operation via select() later?
Working off the logging wqueue could also delay handling pending incoming packets? So maybe it just moves latency, at least in high load situations with a lot of incoming work pending?
When we do a select() for, say, an incoming message, and that creates a lot of interesting logging. We'll only actually see the logging on some output in subsequent select() loop iterations, right?
And if we get a program crash in between? Then the stderr (for example) won't show the final logging leading up to the crash, right?
So maybe there should be some trivial cfg or cmdline switch that changes over to non-wqueue logging at least for debugging?
~N
Hi,
Neels I agree with you, I already raised same concerns as comments in the gerrit patch.
Hi Neels,
On Mon, Sep 28, 2020 at 10:25:48PM +0200, Neels Hofmeyr wrote:
On Sat, Sep 26, 2020 at 09:52:51PM +0200, Harald Welte wrote:
The other idea would be to do our own buffering using osmo_wqueue() on
[...]
I just implemented that. Results are promising.
I'm wondering, when logging to a wqueue, you are measuring the timing of adding the log statement to the queue? Or also the write operation via select() later?
I'm only measuring the delay of adding the log statement to the queue.
Working off the logging wqueue could also delay handling pending incoming packets? So maybe it just moves latency, at least in high load situations with a lot of incoming work pending?
For sure, that write will take some time (later). However, * we only process one log write per select() loop, so we have some kind of fair sharing of opportunity with incoming packets * I would definitely expect those writes to take less time, as contrary to right now, the file descriptors are non-blocking.
I'm happy to do some additional benchmarking if you'd like, but I a non-blocking write - as the name implies - doesn't block and hence we shouldn't see any significant delays. I would expect this to be in the < 1ms order of magnitude, compared to the multiple-ms to multiple hundreds of ms category. That is of course assuming that the CPU has time to schedule the process again after the syscall completes. But we have to assume that is the case; if your CPU is overloaded then of course you will see higher latencies, no matter what.
When we do a select() for, say, an incoming message, and that creates a lot of interesting logging. We'll only actually see the logging on some output in subsequent select() loop iterations, right?
yes, that's correct. but only one log statements per select() loop iteration to ensure there is no starving by logging.
And if we get a program crash in between? Then the stderr (for example) won't show the final logging leading up to the crash, right?
correct.
So maybe there should be some trivial cfg or cmdline switch that changes over to non-wqueue logging at least for debugging?
fine with me.
one more follow-up:
On Tue, Sep 29, 2020 at 11:04:49AM +0200, Harald Welte wrote:
And if we get a program crash in between? Then the stderr (for example) won't show the final logging leading up to the crash, right?
correct.
The only way around that would be to pus the job of 'flushing the queue' to a separate process. So the log messages would be sent over a socket (or in shared memory, if you want to optimize further), and that separate process then would take care of flushing/draining the queue even after the process exited.
To get back down to earth, I think the stderr/file write_queue would usually only contain one element to begin with. Only in the exceptional case of some stalling, we'd have some log messages queued up.
The fact that we currently only stall occasionally when using blocking writes shows us how infrequent the existing stdlib buffer (or pipe buffer) is insufficient. That gives us an idea about how infrequent we would have a significant size of a write queue (in excess of the current stdlib or pipe buffering).
In the "ideal world" we would probably have a system that is somewhere in between of tht two extremes of * old behavior: always a synchronous, potentially blocking write * new behavior: always asynchronous, queued non-blocking write
So what I think would be good is:
* If the write_queue is empty, attempt a synchronous, non-blocking write to stderr/file. * Only if that write returns short (in the extreme case '0'), enqueue the [remainder of the] log message to the write_queue for delayed storage.
This way we should get the best of both worlds: More or less unmodified behavior for 99% of all cases, while avoiding any delays in those few situations where right now we are stalling/blocking due to a full stdio buffer.
Regards, Harald
Hi,
On 9/29/20 11:26 AM, Harald Welte wrote:
So what I think would be good is:
- If the write_queue is empty, attempt a synchronous, non-blocking write to stderr/file.
- Only if that write returns short (in the extreme case '0'), enqueue the [remainder of the] log message to the write_queue for delayed storage.
Fine with this approach, I was going to suggest that too while reading your last comments. The important point here is changing the fd to be non-blocking, all the rest is basically about how to keep ordering and delays at minimum once we hit a would-block situation. In any case, I'd gor for: when pushing stuff to the fd from the queue, also push until short-write is hit, not only 1 msgb per select().
This way we should get the best of both worlds: More or less unmodified behavior for 99% of all cases, while avoiding any delays in those few situations where right now we are stalling/blocking due to a full stdio buffer.
ACK.