Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

historical

neels has posted comments on this change. ( https://gerrit.osmocom.org/c/osmo-bsc/+/25973 )

Change subject: add time_cc API: cumlative counter for time, reported as rate_ctr
......................................................................

Patch Set 1:

Sorry for writing so much, but it seems necessary...

First off, still open point, orthogonal to rate_ctr vs stat_item decision:

Should I remove the configurability features and reduce to one fixed counting behavior? e.g. fix to granularity of seconds, and to the round() rounding scheme? (The customer expressed that either round() or ceil() would be suitable, and floor() is not desirable) ... I think it makes more sense to keep that configurability stuff, now that the code works correctly already. It is a bit of feature creep, the only two arguments to keep it is handwavy "maybe useful at some point in the future"/"maybe some user likes it idk", and a concrete "it would require investing even more effort to remove the features"

> Well the question would then be: Can one still use the same external tools (grafana, elastic search, etc) with rate_ctr? I'm not sure how are those exported over statsd.

Yes, of course! The customer expressed the preferred way of reporting would be a rate counter.
Let me explain why, hopefully making more sense this time:

In the stats exporting, the main difference is that

- a stat_item is exported as the current value in each report. stat_item makes sense for values that rise and (possibly) fall, and where you want to read the actual current value, like number of active cells or say CPU load in percent, or uptime.

- a rate_ctr is exported as nr of increments since the last stat report. Makes sense for counting spike events over time. It is suitable where one is interested in increments of a value per time, to see how busy a constantly rising value currently is, rather than the absolute value itself.

A stat_item used for this cumulative time counter looks like a slope ramping up, like a staircase: staying a horizontal line at times of no chan exhaustion, and rising at times of chan exhaustion; taken to relative infinity, the value would at some point wrap the integer range. Imagine osmo-bsc ran for weeks, then the value could be a flat line at say 123000, and inc to 123001 when chan exhaustion occurs. For displaying such graph, if you're showing the entire y axis range, the increment is hardly visible. You need to zoom in on the y axis range say 123000 to 123100 to even be able to see that channels are currently exhausted. Or you need to employ math to graph the gradient of the line instead, so that you get 0 for non-exhausted times and spikes for exhausted times. We're interested in the gradient's spikes, i.e. in the "1", and not in the "123000" baseline.

A rate_ctr looks like a city skyline, flat line at value 0 for no exhaustion, with spikes at times where chan exhaustion occurs, going back to zero when exhaustion is over.
IOW it already *is* the gradient of the chan exhaustion time counter, which is exactly the interesting information: the number of exhausted seconds since the last stat report.

A stat_item *would* make sense if it reported, say, the current percent of time where channels are exhausted. If you see the stat showing 100, you know the channels are currently all exhausted for all of the time. So something where the current value is the interesting metric. This however introduces complex design decisions: over what amount of time do we calculate the percentage? should that be configurable? when/how do we degrade the percentage when exhaustion is over?

A rate_ctr *is* the simplest, least convoluted and true way of passing the actual useful information to an external stats tool, "and letting grafana figure it out" if the user wants some exhaustion percentage graph; and letting the user's infrastructure figure out whether to evaluate exhaustion over 5/15/30/60 minutes as the spec suggests, without actually introducing these choices to the osmo-bsc code base.

> In general I think the main difference on how we see it, is that your focus is to have it look nice when using VTY

Not at all. Looking useful on the VTY is just a side argument.
It is a compelling argument nevertheless, the main point being to visualize to you that a rate counter is the proper design choice. Read: "even the VTY output becomes more useful".
If I want to quickly check channel exhaustion without graphana, a stat item is very much harder to interpret than a rate counter: you need to repeatedly watch the value change. A rate counter gives you instant information about the gradient, as explained earlier.

I have asked a number of times, but you have still not explained to me how a stat_item value should be designed in a useful and simple way. As I'm pointing out, a forever rising value is not very useful.

> while my point is that it should in first place be usable for external tools.

It *is* more useful to external tools as a rate counter.

> Moreover, I have the feeling you are just abusing the rate_ctr infrastructure with some logic
> just to get some output in VTY which you can understand (rate_ctr is aimed at tick events, not counting time).

Please understand that this is exactly what a rate counter is designed for.
We are interested in the current gradient, not the current value.
The "tick event" here being "channel exhaustion occured for one entire second". Do not be confused by the fact that we are counting time over time. Time is involved twice in this metric! The metric is: "exhausted time, over time". Not simply "time since X", like e.g. uptime would be. That is an important difference that needs to be acknowledged.

I started out a long time ago thinking that a stat item would be best, and the math about it as well as a customer discussion convinced me otherwise. I would appreciate if you could acknowledge these arguments, and, if the argument is flawed, actually suggest a detailed way of reporting as stat item in a useful way. What I am reading so far is merely generally brushing over my argument, and i read a dismissive tone, hope I'm wrong there. I would appreciate if we could keep this technical and detailed.

I'm happy to change this and make it more useful, if there is a compelling argument to do so. Haven't seen one yet. What part am I not getting?

-- 
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/25973
To unsubscribe, or for help writing mail filters, visit https://gerrit.osmocom.org/settings

Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: Icdd36f27cb54b2e1b940c9e6404ba9dd3692a310
Gerrit-Change-Number: 25973
Gerrit-PatchSet: 1
Gerrit-Owner: neels <nhofmeyr at sysmocom.de>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: laforge <laforge at osmocom.org>
Gerrit-CC: pespin <pespin at sysmocom.de>
Gerrit-Comment-Date: Thu, 04 Nov 2021 11:27:46 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/gerrit-log/attachments/20211104/794389f0/attachment.htm>