Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

historical

neels has posted comments on this change. ( https://gerrit.osmocom.org/c/osmo-bsc/+/25973 )

Change subject: add time_cc API: cumlative counter for time, reported as rate_ctr
......................................................................

Patch Set 1:

> > Well maybe then the question is why are you using rate_ctr and not stat_items here, it really confuses me.
> 
> At least at first sight, I agree.  The resulting metric computed by this  new code base renders a single value which matches better a state_item than a rate_ctr. Any particular argument to go for rate_ctr, Neels?

The decision to use a rate_ctr is based on discussion with the customer,
and it also makes a lot of sence in practice.

Logically, a stat_item is not actually a good choice. We can of course report the total time of all-allocated, and thus get for example the complete amount of seconds that all SDCCH channels were allocated since osmo-bsc started. But it's not interesting to get an arbitrary amount of time of all-allocated since forever; instead, it is important to qualify in which period of elapsed time this amount was accumulated. A rate_ctr is well suited since it also provides the "per time" aspect. All rate_ctr stats reflect a number-of-events-per-time. For all_allocated, it is the number of seconds that all channels were allocated per a given amount of time. For example, if the VTY shows all_allocated:sdcch of 10/min, it means all channels were allocated for 10 seconds of the last minute. For a stat item, getting this "per time" part is a complex problem.

When reporting as a stat_item, we open a new dimension of options:
The spec defines different reporting periods, suggesting at least the options of 5 minutes, 15 minutes, 30 minutes, 60 minutes. We could periodically clear the stat item based on user config.
The customer requesting this feature already implements these reporting periods outside of osmo-bsc, based on stats received from osmo-bsc. So instead of introducing these reporting periods to osmo-bsc and choose some method of adding a per-time aspect to stat_item, it is best to just trigger a count for each second of all-allocated-channels.

> simply a counter value changing over time.

When I started on it, I thought it would take half an hour.
When thinking about the exact implementation, the options and complexity unfolded...
This patch is the result that ensures correct counts with minimal complexity.

> So I'm not really following on why you need all this infrastructure sorry,

I would appreciate if your criticism could be qualified as well as constructive.
What do you mean by "all this"? What do you suggest instead?

> this all looks super complicated for no reason (I'm able to see). Maybe someone else can also shed some light on it.

It's straightforward:

The aim is to report for how many seconds per given time period all channels of a type were allocated.
To achieve that, we need to count free/allocated lchans.
When a count reveals that all chans of type X are allocated, we set a flag to true.
Based on that flag, a time counter increments. The flag-per-time counter is generalized API (time_cc).
In order to periodically report that time counter to stats, an osmo_timer is involved.

I am open to simplifications, if possible.

There are some additional options to configure time_cc with different granularity,
and to allow tweaking the counter precision vs response time.
These options aren't strictly necessary. I think they make sense to keep time_cc generally useful.

> So the question remains: Should the result be exposed as rate_ctr or as stat_item?

We could do both, in fact. All the complex parts are already implemented and working correctly.

Next to the rate_ctr, we can just add a stat_item to time_cc, and publish the time count as stat item. But then we need to define the time periods and exact meaning of the stat_item values.
I encourage you to practically imagine the solution and you should see how the problem is not as trivial as it sounds at first. It is easy to add the stat_item, as soon as it is clear which value the stat_item should reflect. We already have a value implemented that counts all seconds where all channels were allocated since osmo-bsc started. But does it make sense to publish that as stat_item?

Here are the various ideas I had before we decided for a rate_ctr as the simplest and most effective solution:

"
I am thinking about the allAvailable{TCH,SDCCH}AllocatedTime indicators:

In 3GPP TS 52.402, there is a defined Granularity Period, which is configurable,
and suggested to have at least the settings of 5, 15, 30, 60 minutes.
The allAvailableXxxAllocatedTime indicators are defined as cumulative counter (CC),
which I interpret as the number of seconds that all channels of the given kind were occupied.

A "problem" is that the meaning of this cumulative value depends on the Granularity Period.
For example, if the granularity period is 30 minutes, a cumulative value of 5 minutes for
"all channels allocated" means that the cell was congested roughly 17% of the time.
If the granularity period is only 5 minutes, then the exact same value means 100% congestion.
So it appears to me that it is less confusing / more meaningful to report the value in % of time?

Looking at details of how to implement this, it appears that we need to first introduce this concept
of a Granularity Period to our statistics API. We have a stats reporter interval, which is usually
a lot shorter than 5 minutes. Also this interval so far only affects the times at which an independently
defined value will become reported. IIUC we so far don't have any values that are dependent on the
reporting interval itself, where some cumulative counter value gets reset to zero whenever a reporting
period has elapsed.

Here are my ideas to implement such cumulative counters:

variant 1:
Internally, we clearly define a Granularity Period, as described in the spec. Let's say it is set to 5 minutes.
This Granularity Period is implemented completely independently from the stats reporting period.
At first, the cumulative counter is zero. For the next 5 minutes, we add up the times (in seconds) where all
channels were occupied. When the five minutes have elapsed, we "push" the cumulative value to a stat item and
reset the counter. So only one value will be published in a stat item every 5 minutes, and the value does not
change while we are busy accumulating the counter value for the next 5 minutes.
This seems most spec conforming. But this also seems kind of low resolution / slowly responsive.
The 5 minute period would be independent from the stat reporting period, i.e. there would be N stat reporting
periods where the stat does not change at all, e.g. for 5 minutes, and only then would we get a sum of the last
5 minutes, again staying fixed on the dashboard for the next 5 minutes.

variant 2:
We have two rate counters, one incrementing for each second where all channels were occupied (A), one incrementing
for each second where at least one channel was still available (B). These get reported continuously and also degrade
as rate counters do. Comparing one to the other, e.g. A / (A + B), gives a continuous indication of congestion rate.
So the value will gradually rise and fall as the seconds pass, and we don't have to wait five minutes to see that
congestion has occured.

variant 2b:
It should actually suffice to have only one rate counter incrementing for each second where all channels were occupied.
Since rate counters implicitly count events per second, per minute, per hour, we can see that e.g. a rate of
60 per minute means that we have been continuously congested for the last minute.

variant 3:
We introduce a new kind of cumulative stat item which gets reset to zero whenever a stat reporting period has elapsed.
We have two such stat items, one counting the seconds congested (A), one counting seconds not congested (B),
and a meaningful statistic comes from comparing A to A+B. (the reporting period may then fluctuate without ill effects)

variant 3b:
Such new cumulative stat item as in 3 may always implicitly report percent compared to the elapsed reporting period.

variant 3c:
just use a normal stat item, and introduce some callback function that can be set up to clear the stat item to zero
every time the stat report has been sent out.

For variant 2 (rate counters), we don't need to introduce configuration of a granularity period, nor invent a new kind
of stat item. But this is also the farthest away from how the performance indicator is defined in the spec.

We could also implement mutiple variants. To me it would make sense to implement both variant 1 and 2b,
to have a most spec conforming stat item that reports less frequently, as well as a "running congestion counter" as
a rate counter that continuously shows a curve of congestion seen per time.
"

-- 
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/25973
To unsubscribe, or for help writing mail filters, visit https://gerrit.osmocom.org/settings

Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: Icdd36f27cb54b2e1b940c9e6404ba9dd3692a310
Gerrit-Change-Number: 25973
Gerrit-PatchSet: 1
Gerrit-Owner: neels <nhofmeyr at sysmocom.de>
Gerrit-Reviewer: Jenkins Builder
Gerrit-CC: laforge <laforge at osmocom.org>
Gerrit-CC: pespin <pespin at sysmocom.de>
Gerrit-Comment-Date: Mon, 01 Nov 2021 12:32:21 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/gerrit-log/attachments/20211101/203b29f4/attachment.htm>