Attention is currently required from: iedemam, laforge, fixeria, pespin.
neels has posted comments on this change. (
https://gerrit.osmocom.org/c/osmo-bsc/+/27081
)
Change subject: stats: new trackers for lchan life duration
......................................................................
Patch Set 15:
(9 comments)
Patchset:
PS15:
The patch is turning out nicely, but we still have a couple of problems.
Some of those are my fault of giving recommendations in code review without thinking it
through.
I hope I am not causing frustration by "always having yet more comments".
File src/osmo-bsc/lchan_fsm.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/c0c177bc_13624a22
PS15, Line 208: if (lchan->active_ms.cfg.rate_ctr) {
I understand now: the rate_ctr is set only for SDCCH and TCH type.
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/085bc7b3_caa11933
PS15, Line 209: osmo_time_cc_cleanup(&lchan->active_ms);
osmo_time_cc_cleanup() is not necessary, the forget_sum configuration does this
implicitly
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/9a4b8bc7_c242c7e9
PS15, Line 465: uint
Is it a valid type from <stdint.h>? Never saw it
before. […]
I like to use "uint" in private projects, but there i typedef
it manually as unsigned int. Indeed i've never seen it in osmocom code. it is
certainly not from stdint.h, wondering why this compiles.
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/f5fed5b1_a399fc61
PS15, Line 478: counter_id = -1;
This should go to the 'default' branch of the
'switch' statement below.
could, yes, but totally fine here IMHO.
But in fact, I would rather see counters for all channel types added, instead of counting
"only" TCH and SDCCH. Those are the most interesting ones of course, but why not
do all of them while we're at it.
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/724984ee_b0144023
PS15, Line 479: switch (lchan->type) {
the lchan->type changes during operation of osmo-bsc, it is not known at the time of
lchan_fsm_alloc() yet. For example, a TCH/F timeslot can be used as type SDCCH when all
dedicated SDCCH timeslots are exhausted, or a dynamic timeslot can be used as TCH/F or
TCH/H depending.
I think I would do this:
This here, lchan_fsm_alloc(), happens "only once" when a BTS starts up.
Configure here the lchan->active_ms.cfg, i.e. set the gran_usec and forget_sum_usec,
unconditionally. But here do not set a rate_ctr yet (keep it NULL). So each lchan now has
an accumulator for the time that it is active -- as long as the rate_ctr is NULL, all it
does is increment its total_sum, no harm done. So we can always set the flag to true or
false, for any and all lchan types, regardless of the rate_ctr being present.
Just before setting the flag to 'true' in lchan_on_fully_established(), point the
lchan->active_ms.cfg.rate_ctr to the counter matching the lchan->type, because only
then do we know the actual lchan->type in use. (We already know the type a few FSM
states earlier, but we only really care upon fully established, when the flag goes true.)
lchan->active_ms always directs new counter increments to the rate_ctr it currently
points at.
When setting the rate_ctr, do take care to set it to NULL if the lchan type has no
counter. (personally I would add counters for all types while at it, but it's not
mandatory)
The active_ms time_cc does never need to be reset, no need to call osmo_time_cc_cleanup().
forget_sum takes care that leftover timings are discarded; especially when forget_sum_usec
= 1 it's guaranteed that we instantly forget surplus timings from previous times of
flag == true. As soon as newly accumulated sums reach the threshold, a counter increment
is triggered, to whichever rate counter it currently points at.
(Like this, we also have in active_ms.total_sum a sum of how long each individual lchan
has been active since the BTS started operating, across all lchan->types it has
operated as. We could maybe print that in the VTY output, as a nice-to-know value, not
sure if it bears much meaning though, besides which lchans are the favorites.)
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/b461f303_2281ae3f
PS15, Line 493: .gran_usec = 1*1000,
Sorry for my earlier comment, I've come to realize what exactly I recommended by
mentioning 1 ms granularity. By setting gran_usec to one millisecond, we will actually
schedule an osmo_timer to increment a rate counter every single millisecond that the flag
is true. Modern machines should be able to handle that, but I fear it might make a
difference in load. 1000 times per second for every active lchan feels like a lot for me
as a human.
With gran_usec == 1000 and flag == true, every millisecond we re-schedule an osmo_timer to
come up in the main select() loop. It will increment the rate counter, and schedule
another timer to "interrupt" the select() loop a millisecond later. Mind you it
is fine if the select() loop skips a couple of milliseconds here or there from handling
messages, the rate counter will still be incremented correctly. But we would load the main
select() loop a lot less with a larger granularity.
What is a meaningful precision you would like to see in the metrics? If one second is too
large, I guess a tenth of a second (gran_usec = 100000) is sufficient for a humanly
noticeable time scale?
(Would be interesting to benchmark that, see if this granularity does indeed make any
difference at all on the CPU load. So far it's just a gut feeling of mine.)
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/63b93209_f57327e9
PS15, Line 498: .T_forget_sum = -18,
timers X16, X17, X18 are defined to configure a different counter (the all_allocated
counters). important: these X timers will override above gran_usec, forget_sum_usec
settings!
I think it is sufficient to not define any T_* and always use the hardcoded configuration
here.
If you'd like to make the active_ms configurable from VTY, we should define different
X numbers.
Related:
https://osmocom.org/projects/cellular-infrastructure/wiki/List_of_Timer_num…
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/d7f488bb_41c99b30
PS15, Line 562: /* Stop the timekeeper, which triggers a report */
the comment suggests that the report is triggered only now. In fact a rate_ctr increment
is triggered every time gran_usec elapses the rounding threshold while the flag is set
'true', and not really when the flag changes to 'false'.
--
To view, visit
https://gerrit.osmocom.org/c/osmo-bsc/+/27081
To unsubscribe, or for help writing mail filters, visit
https://gerrit.osmocom.org/settings
Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: I1b0670c47cb5e0b7776eda89d1e71545ba0e3347
Gerrit-Change-Number: 27081
Gerrit-PatchSet: 15
Gerrit-Owner: iedemam <michael(a)kapsulate.com>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: laforge <laforge(a)osmocom.org>
Gerrit-Reviewer: neels <nhofmeyr(a)sysmocom.de>
Gerrit-CC: fixeria <vyanitskiy(a)sysmocom.de>
Gerrit-CC: pespin <pespin(a)sysmocom.de>
Gerrit-Attention: iedemam <michael(a)kapsulate.com>
Gerrit-Attention: laforge <laforge(a)osmocom.org>
Gerrit-Attention: fixeria <vyanitskiy(a)sysmocom.de>
Gerrit-Attention: pespin <pespin(a)sysmocom.de>
Gerrit-Comment-Date: Tue, 08 Mar 2022 22:35:27 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: fixeria <vyanitskiy(a)sysmocom.de>
Gerrit-MessageType: comment