After 35c3 and talking about statistics, it has become apparent to me that we lack a good way of monitoring channel usage / availability; TL;DR: I think we need min/max aggregators that sync with the stats push/poll time period.
This is just an idea I'm getting, not planning on implementing anything now, nor have I really done my homework and tried to achieve useful stats. Has anyone discussed this before / created an issue / solved it in a different way?
Here goes...
IIUC we have counters that we can poll/push in a given time period, so that the data we get out makes sense and has no "holes" or "overlaps" in it.
However, we also have highly volatile numbers that are extremely interesting for an operator to see: how many channels of which kind are currently still available?
I asked around and one solution I heard is to poll the CTRL interface once per second and aggregate min/max values before sending on to statistics once per minute. That could be considered close enough, but we can do better.
Polling momentary values has holes in it, and doesn't scale well. If, e.g., one new channel request comes in while at the same time another channel is released, we might for a short time hit a situation of no more channels being available, and the polling might just miss that and would show more available channels than we factually had. If hypothetically scaling up such a situation: we might actually have turned down 5 channel requests while the polled number still shows available lchans at all times. So, we should allow: - seeing peak usage - in a pushing-stats fashion - that is still useful when sampled only, say, once per minute.
One idea would be to push out a new number as soon as channel availability changes, but that again doesn't scale well (might generate too many events when monitoring a large number of cells).
So I'm thinking that we should aggregate the minimum-available lchan counts within osmo-bsc per stats timeframe.
There should be separate minimum-available numbers for each lchan kind.
I guess minimum-available is more useful than maximum-used lchans, but we could also provide both.
Also, if I want to find out how many lchans I need to add to provide adequate service, it would be good to somehow determine the maximum number of "concurrent" turned down channel requests. We probably already have that in a per-second moving average? But here again, if I sample a per-second moving average only once per minute, I will miss the maximum value that this per-second value has reached in that minute. This probably also needs a think-over from a practical "I want useful stats" POV.
Or am I missing something?
Thanks,
~N
On 30/01/2019 14:43, Neels Hofmeyr wrote:
I asked around and one solution I heard is to poll the CTRL interface once per second and aggregate min/max values before sending on to statistics once per minute. That could be considered close enough, but we can do better.
So I do that, except actually poll the vty, (yes, I know...) and much less often.. I actually only poll channels in use every 60 seconds. which gives a kind of idea of general usage patterns during the day, but is rather useless for detecting how often we experience saturation.
Polling momentary values has holes in it, and doesn't scale well. If, e.g., one new channel request comes in while at the same time another channel is released, we might for a short time hit a situation of no more channels being available, and the polling might just miss that and would show more available channels than we factually had. If hypothetically scaling up such a situation: we might actually have turned down 5 channel requests while the polled number still shows available lchans at all times. So, we should allow:
So, in order to compensate somewhat for what I just described, I poll the "no channel" counter and that gives me an idea of how many chan requests were rejected in the period. I only do this every 5 mins though.
OpenBSC# show statistics Channel Requests : 1 total, 0 no channel ^^^^^^^^^^^ this one.
One idea would be to push out a new number as soon as channel availability changes, but that again doesn't scale well (might generate too many events when monitoring a large number of cells).
Yes, I think so.
So I'm thinking that we should aggregate the minimum-available lchan counts within osmo-bsc per stats timeframe.
There should be separate minimum-available numbers for each lchan kind.
yep. that would be great.
In general I have a very basic to zero knowledge of the "science" of stats collection, KPI etc.
But I imagine the industry has a standard? Maybe we can follow it?
I will take a look at the KPI talk from OsmoDevCon again https://media.ccc.de/v/SE8HRK
I guess minimum-available is more useful than maximum-used lchans, but we could also provide both.
I do think there's something to be said for counters that count "error" situations, like no chan available, then you know that this happened, without trying to constantly count channels in use and then having to be concerned about that micro-second between chan release and chan request that may or may not overlap. - actually I do not know how big that window is, maybe more than a uSecond :)
Also, if I want to find out how many lchans I need to add to provide adequate service, it would be good to somehow determine the maximum number of "concurrent" turned down channel requests.
Maybe "what was the duration of complete saturation" might be a good question.
I'll try to come up with a list of "questions" like that.