Attention is currently required from: iedemam.
11 comments:
Patchset:
Hi Michael!
Good to see a patch from outside of sysmocom.
And what a patch it is. Reporting time durations in statistics is a head-spinning mathematically challenging topic. When reviewing the patch I went from "oh, that's simple" and "why didn't I think of this and went for all the effort with osmo_time_cc?" at first, all the way to "wait, no, how does this work again??" and to "oh that's why I wrote osmo_time_cc". This stuff can be really confusing.
Let me try to break down what I figured out so far:
This current patch has these flaws:
- (there should be no averaging, see below, but) The N to average over should be the released-count, not the currently-active-count, because the millis value is added only later upon channel release.
In stats, when we report the TCH_ACTIVE_MILLISECONDS_TOTAL and the BTS_CTR_CHAN_ACT_TCH, that gives an inaccurate metric, because N increments a long time before the millis increment.
The metric takes a dip upon each channel activation,
and un-dips on each channel release.
If there were a metric counting releases, the value would be much more accurate, updated only when a channel releases with the correct duration value.
If we report the nr of channel releases as a rate counter as well as the sum of active milliseconds as a rate counter, a stats engine can form an accurate average over any time period it chooses.
However, think that a phone call is taking 2 hours to complete, then the trivial implementation would report those 2 hours worth of milliseconds only when the call is hung up. Or extrapolated, if *all* channels are occupied with calls going on for 2 hours, your dashboard will only flag a high Erlang value two hours after your entire network has become congested. With osmo_time_cc, the milliseconds count would gradually increase as the call continues, even before it is hung up, and the metric would also stay accurate the entire time (no dips from chan act).
No dips because no averaging:
IIUC for an Erlang we would not actually average, but simply add up all lchans.
We could put a separate osmo_time_cc for each lchan in place, and report the sum of those per BTS and chan type.
Then if we get 10 minutes of increase within one minute, that's 10 Erlangs, or 10 TCH were active concurrently. I think this is what we should go for. From explaining this, I've actually come to appreciate the simple yet powerful nature of the Erlang unit, had not realized this before.
In detail, I would:
- add a new rate counter per BTS and per chan type; each gradually reports the seconds of activity per second, added up over all lchans of that type. E.g. BTS_ACTIVE_TIME_TCH, _SDCCH,...
- we add one time_cc per lchan, and make all of them accumulate to the same BTS-wide rate counter
(one per channel type).
- in lchan_st_established_onenter() we set the time_cc = true, on established_onleave() to false.
Nothing else is required. The patch might actually become simpler this way.
(maybe there could also be separate rate counters for each TRX, and also globally over the entire BSC; a bit of implementation detail would be involved there, skipping that for now.)
I would advise against using the current patch as-is. It would provide an improved value, but there would still be the flaws I pointed out, and eventually we'd probably need to fix those and have compat issues with existing setups.
What do you think? Am I making sense?
(I could also imagine taking over, but not sure how soon that would work out; would need to ask about our current priorities).
I also wrote a bunch of code review on details of this patch. I'll keep some for sharing, in case you're interested in the style differences that come up.
File include/osmocom/bsc/gsm_data.h:
Patch Set #8, Line 1130: unsigned long long gsm_lchan_active_duration_ms(const struct gsm_lchan *lchan);
(idea: use uint64_t for a fixed size independent from arch.
uint32_t spans 49 days of ms, uint64_t half a billion years)
File src/osmo-bsc/bts.c:
Patch Set #8, Line 1514: "Cummulative number of active milliseconds on TCH chans",
("Cumulative")
doc reads a bit weird, as if the milliseconds are active, also leaves a lot of guessing about the nature of the metric.
suggestion:
"Average lchan activity duration per TCH lchan activation in milliseconds"
File src/osmo-bsc/bts_trx_vty.c:
Patch Set #8, Line 574: uint duration_s = (gsm_lchan_active_duration_ms(lchan) + 500) / 1000;
the +500 acts as a round(), yet the reported descriptions below suggest that floor() should be used instead. IOW please drop the 500.
File src/osmo-bsc/bts_vty.c:
Patch Set #8, Line 3826: uint64_t activations_tch = rate_ctr_group_get_ctr(bts->bts_ctrs, BTS_CTR_CHAN_ACT_TCH)->current;
osmocom style dictates that local variables are declared at the start of the function (or scope)
File src/osmo-bsc/gsm_data.c:
Patch Set #8, Line 348: /* Get duration of active time for this lchan in milliseconds */
Rather:
/* If the lchan is currently active, return the duration since activation in milliseconds. Otherwise return 0. */
Patch Set #8, Line 352: if (lchan->activate.concluded) {
osmocom style asks for early-exit:
struct timespec now;
if (!lchan->activate.concluded)
return 0;
osmo_clock_gettime
...
return elapsed.tv_sec * ...
Patch Set #8, Line 357: duration = elapsed.tv_sec * 1000LL + elapsed.tv_nsec / 1000000;
the LL on the 1000 has no effect, the first operand defines the integer size for calculation. So you want one of these:
(uint64_t)elapsed.tv_sec * 1000 + ...
or
1000LL * elapsed.tv_sec + ...
(i prefer the first)
File src/osmo-bsc/lchan_fsm.c:
You should rather put this bit in a new lchan_st_established_onleave() function.
This is the single place that is absolutely guaranteed to be called whenever the lchan leaves the active state.
If you want to also count the time that the lchan release process takes (including any error timeouts that may be), then you'd instead put this in lchan_st_unused_onenter().
In lchan_fsm.c:1628, add a new onleave cb:
[LCHAN_ST_ESTABLISHED] = {
.onleave = lchan_st_established_onleave,
Patch Set #8, Line 1811: /* Add active milliseconds to cummulative counts per channel type */
/* Report how long this lchan was active, to add to the average activity duration metric. */
Patch Set #8, Line 1817: LOG_LCHAN(lchan, LOGL_INFO, "GSM_LCHAN_TCH was active for %llu milliseconds\n", duration_ms);
i would rather just have a single LOG_LCHAN() that uses gsm_chan_t_name() above the switch, because then you don't need to duplicate which names we use for channel types in logging, and you can drop the weird 'default:' case.
Also osmo_int_to_float_str() is nice.
/* "TCH/H was active for 1.23 seconds" */
LOG_LCHAN("%s was active for %s seconds", gsm_chan_t_name(lchan),
osmo_int_to_float_str_c(OTC_SELECT, millis, 3);
switch(lchan->type) {
osmo_stat_item_inc...
break
To view, visit change 27081. To unsubscribe, or for help writing mail filters, visit settings.