Attention is currently required from: iedemam.
neels has posted comments on this change. (
https://gerrit.osmocom.org/c/osmo-bsc/+/27081
)
Change subject: stats: New stats for lchan life duration.
......................................................................
Patch Set 8:
(11 comments)
Patchset:
PS8:
Hi Michael!
Good to see a patch from outside of sysmocom.
And what a patch it is. Reporting time durations in statistics is a head-spinning
mathematically challenging topic. When reviewing the patch I went from "oh,
that's simple" and "why didn't I think of this and went for all the
effort with osmo_time_cc?" at first, all the way to "wait, no, how does this
work again??" and to "oh that's why I wrote osmo_time_cc". This stuff
can be really confusing.
Let me try to break down what I figured out so far:
This current patch has these flaws:
- (there should be no averaging, see below, but) The N to average over should be the
released-count, not the currently-active-count, because the millis value is added only
later upon channel release.
In stats, when we report the TCH_ACTIVE_MILLISECONDS_TOTAL and the BTS_CTR_CHAN_ACT_TCH,
that gives an inaccurate metric, because N increments a long time before the millis
increment.
The metric takes a dip upon each channel activation,
and un-dips on each channel release.
If there were a metric counting releases, the value would be much more accurate, updated
only when a channel releases with the correct duration value.
- The value written on vty_out reflects an average over the process lifetime, not the
current situation. A useful value would gradually "forget" the values of past
time periods, literature seems to suggest averaging over an hour or 15 minutes. Instead
osmocom offers rate_ctr. See below.
- For the ACTIVE_MILLISECONDS_TOTAL, a rate_ctr would be better suited. A stat item is
designed for current absolute values that can decrease, like how many lchans are currently
active. A rate counter reports increments of a value over-time, like how many releases
have occured over time (per second, per minute, etc); this over-time is very desirable.
If we report the nr of channel releases as a rate counter as well as the sum of active
milliseconds as a rate counter, a stats engine can form an accurate average over any time
period it chooses.
However, think that a phone call is taking 2 hours to complete, then the trivial
implementation would report those 2 hours worth of milliseconds only when the call is hung
up. Or extrapolated, if *all* channels are occupied with calls going on for 2 hours, your
dashboard will only flag a high Erlang value two hours after your entire network has
become congested. With osmo_time_cc, the milliseconds count would gradually increase as
the call continues, even before it is hung up, and the metric would also stay accurate the
entire time (no dips from chan act).
No dips because no averaging:
IIUC for an Erlang we would not actually average, but simply add up all lchans.
We could put a separate osmo_time_cc for each lchan in place, and report the sum of those
per BTS and chan type.
Then if we get 10 minutes of increase within one minute, that's 10 Erlangs, or 10 TCH
were active concurrently. I think this is what we should go for. From explaining this,
I've actually come to appreciate the simple yet powerful nature of the Erlang unit,
had not realized this before.
In detail, I would:
- add a new rate counter per BTS and per chan type; each gradually reports the seconds of
activity per second, added up over all lchans of that type. E.g. BTS_ACTIVE_TIME_TCH,
_SDCCH,...
- we add one time_cc per lchan, and make all of them accumulate to the same BTS-wide rate
counter
(one per channel type).
- in lchan_st_established_onenter() we set the time_cc = true, on established_onleave() to
false.
Nothing else is required. The patch might actually become simpler this way.
(maybe there could also be separate rate counters for each TRX, and also globally over the
entire BSC; a bit of implementation detail would be involved there, skipping that for
now.)
I would advise against using the current patch as-is. It would provide an improved value,
but there would still be the flaws I pointed out, and eventually we'd probably need to
fix those and have compat issues with existing setups.
What do you think? Am I making sense?
(I could also imagine taking over, but not sure how soon that would work out; would need
to ask about our current priorities).
I also wrote a bunch of code review on details of this patch. I'll keep some for
sharing, in case you're interested in the style differences that come up.
File include/osmocom/bsc/gsm_data.h:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/55df34cc_7eee50f1
PS8, Line 1130: unsigned long long gsm_lchan_active_duration_ms(const struct gsm_lchan
*lchan);
(idea: use uint64_t for a fixed size independent from arch.
uint32_t spans 49 days of ms, uint64_t half a billion years)
File src/osmo-bsc/bts.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/9e38af99_1efeb800
PS8, Line 1514: "Cummulative number of active milliseconds on TCH chans",
("Cumulative")
doc reads a bit weird, as if the milliseconds are active, also leaves a lot of guessing
about the nature of the metric.
suggestion:
"Average lchan activity duration per TCH lchan activation in milliseconds"
File src/osmo-bsc/bts_trx_vty.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/d479f6ec_6c688a6f
PS8, Line 574: uint duration_s = (gsm_lchan_active_duration_ms(lchan) + 500) / 1000;
the +500 acts as a round(), yet the reported descriptions below suggest that floor()
should be used instead. IOW please drop the 500.
File src/osmo-bsc/bts_vty.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/d3a14304_f147c16c
PS8, Line 3826: uint64_t activations_tch = rate_ctr_group_get_ctr(bts->bts_ctrs,
BTS_CTR_CHAN_ACT_TCH)->current;
osmocom style dictates that local variables are declared at the start of the function (or
scope)
File src/osmo-bsc/gsm_data.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/6b1bdb85_9746e7b5
PS8, Line 348: /* Get duration of active time for this lchan in milliseconds */
Rather:
/* If the lchan is currently active, return the duration since activation in milliseconds.
Otherwise return 0. */
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/73cb38ff_8091843b
PS8, Line 352: if (lchan->activate.concluded) {
osmocom style asks for early-exit:
struct timespec now;
if (!lchan->activate.concluded)
return 0;
osmo_clock_gettime
...
return elapsed.tv_sec * ...
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/cdcce2f6_c8642f83
PS8, Line 357: duration = elapsed.tv_sec * 1000LL + elapsed.tv_nsec / 1000000;
the LL on the 1000 has no effect, the first operand defines the integer size for
calculation. So you want one of these:
(uint64_t)elapsed.tv_sec * 1000 + ...
or
1000LL * elapsed.tv_sec + ...
(i prefer the first)
File src/osmo-bsc/lchan_fsm.c:
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/af32bb7e_b5fe7283
PS8, Line 1810:
You should rather put this bit in a new lchan_st_established_onleave() function.
This is the single place that is absolutely guaranteed to be called whenever the lchan
leaves the active state.
If you want to also count the time that the lchan release process takes (including any
error timeouts that may be), then you'd instead put this in
lchan_st_unused_onenter().
In lchan_fsm.c:1628, add a new onleave cb:
[LCHAN_ST_ESTABLISHED] = {
.onleave = lchan_st_established_onleave,
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/ad4f3c23_f4db6493
PS8, Line 1811: /* Add active milliseconds to cummulative counts per channel type */
/* Report how long this lchan was active, to add to the average activity duration metric.
*/
https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/e0fc54f9_b83bf4f8
PS8, Line 1817: LOG_LCHAN(lchan, LOGL_INFO, "GSM_LCHAN_TCH was active for %llu
milliseconds\n", duration_ms);
i would rather just have a single LOG_LCHAN() that uses gsm_chan_t_name() above the
switch, because then you don't need to duplicate which names we use for channel types
in logging, and you can drop the weird 'default:' case.
Also osmo_int_to_float_str() is nice.
/* "TCH/H was active for 1.23 seconds" */
LOG_LCHAN("%s was active for %s seconds", gsm_chan_t_name(lchan),
osmo_int_to_float_str_c(OTC_SELECT, millis, 3);
switch(lchan->type) {
osmo_stat_item_inc...
break
--
To view, visit
https://gerrit.osmocom.org/c/osmo-bsc/+/27081
To unsubscribe, or for help writing mail filters, visit
https://gerrit.osmocom.org/settings
Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: I1b0670c47cb5e0b7776eda89d1e71545ba0e3347
Gerrit-Change-Number: 27081
Gerrit-PatchSet: 8
Gerrit-Owner: iedemam <michael(a)kapsulate.com>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: neels <nhofmeyr(a)sysmocom.de>
Gerrit-CC: laforge <laforge(a)osmocom.org>
Gerrit-CC: pespin <pespin(a)sysmocom.de>
Gerrit-Attention: iedemam <michael(a)kapsulate.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 18:42:19 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Gerrit-MessageType: comment