Right now I am
wondering about fault recovery, i.e. what should the
trx do once it has detected missing data? Whatever happens has a low
chance of fixing the situation, once triggered, the condition
persists. This is also indicated by the fact that the logged "diff"
value is the *same* value in subsequent loggings, i.e. the
trx does not recover / rewind / adjust timing to get back to normal.
This is a
very "dangerous" area. In a system like GSM, where there are
performance figures specified as part of the spec conformance, we should
be very careful about plastering over bugs like this.
Any system (hardware + software) must be able to handle processing of
all samples at any given point in time. If it can't handle this, it
introduces bit errors which, if they happen frequently/reproducibly,
will for sure degrade performance of the base station.
So the "right" solution is to find the issue and solve it, not to
"recover" by simply continuing with increased BER and degraded
performance.
If the system just magically recovers, I'm afraid people will put this
into production operation without understanding the gravity of the
problem, or that there is one at all.
I am in violent agreement, but the process did NOT exit, and sometimes
it DID recover, and except for the log messages I would not have seen
the problem but for sporadic outage and other problems.
I was just wondering what the thinking had been on handling this
particular condition, apart from logging a LOG message...
i.e. what is the best thing to do, when the error is detected?
Mind you, I am just starting to learn how the trx is doing it's job, and
thinking of what should happen next when this condition has occured.
Is this something that *could* happen ( without broken hw ) and
is it meaningful to continue to repeat the error??
Perhaps, the "jump" in timestamp has that effect on the "rest" of
trx.
What if the timestamp is screwed up on its way from Lime to trx??
I think ftrace with irqsoff, preemptoff, wakeup_rt tracers could be one
option to debug this further. If there's a correlation between time
with irqs/preemption disabled around the time of your "high latency
bursts", that would be a very clear message.
Debugging, and my confusion will
rise to a higher level...
* maybe the polling interval (bInterval) in the
endpoint descriptors is
set too low?
Hmm, my crude measurements indicate trx retrieving is cause, not
lack of
data.
I'm not sure I understand yet how you reach that conclusion? It would
be interesting to get some kind of watermarks of the amount of "used"
libusb USB transfers inside LimeSuite. Maybe it's also worth increasing
them or their size?
Well, possible explanations..
1. The limesdr sometimes fail to deliver a significant amount of packets,
since the time "jumps" a large amount.
2. The trx / hw / linux fails to read packets, causing the lime to be unable
to deliver the data, until trx / hw / linux becomes responsive again.
3. ???
It look like my crude tests shows that the trx can loop and get data at
170 uSec,
but sometimes does not come back within 100 times that, why?
To me 2. seems most probable....but I will see if I can check in LimeSuite.
Also, tests with Limesdr and *other* applications can give clues....
* maybe there's some other process using too much
cpu / pre-empting
osmo-trx?
Yes it looks like that
What about modifying osmo-trx to
simply read and discard the samples,
rather than processing them. Do you still get the overruns then?
I'll
check....
>> Your test seem to be looking at the second
part. You can use a
>> CLOCK_MONOTONIC time source to take timestamps, as you indicated.
I
modified for MONOTONIC, no obvious change....
is tells you
how much CPU time a given process has consumed. It is
not an absolut/reference
clock. At least my understanding was that you
wanted to take "absolute" timestamps. CLOCK_MONOTONIC_RAW is probably
the best candidate for that.
The fight goes on....
Regards,
Harald
Regards,
Gullik