Ok, some more experiments.
I have made a small table that logs Linux time diffs in nanoseconds each
time
LMSDevice is called. From my first test this indicates that on this
particular
platform, within the last 8 calls time can be as low as 170 uS, i.e. a value
of roughly 170000. But I also get times up to 44625499, i.e. 4.46 mS,
and the values in the table can either look like:
$2 = {32345376, 28481917, 16771791, 15794875, 16805792, 17252958,
44625499, 33037584}
indicating several calls after one another had long times
$4 = {198750, 179625, 33702624, 16127416, 27990666, 16007875, 13552168,
16100124}
or the latest and 2'nd latest are low, but follow a sequence of long times.
Thus, we are not dealing with a single interruption of short latency,
but an extended period of
long latency / interference. Once the condition occurs, I get 100's of
logs of time mismatch,
so, it does not recover. Right now I am wondering about fault recovery,
i.e. what should
the trx do once it has detected missing data? Whatever happens has a low
chance of
fixing the situation, once triggered, the condition persists. This is
also indicated by the
fact that the logged "diff" value is the *same* value in subsequent
loggings, i.e. the
trx does not recover / rewind / adjust timing to get back to normal.
Are you running the osmo-trx process with real-time priority (SCHED_RR)?
I tried that with no obvious effect.....
What is the CPU load? Please note on a multi-core
system the
interesting bit is not the average load over all CPU cores, but the
maximum load of any one of the cores.
"Normal" load is trx process taking
80 - 100 % out of 4 cpus, i.e. htop
shows
4 cpus each with 20-25% load. trx seems to spread its threads over all
cpus.
Correct. This is a problem we've been observing on
a variety of
platforms for quite some time. Some samples are lost.
* maybe the polling interval (bInterval) in the endpoint descriptors is
set too low?
Hmm, my crude measurements indicate trx retrieving is cause, not
lack of
data.
* maybe the number / size of bulk-in USB transfers
(URBs) is
insufficient and/or thery are not re-submitted fast enough.
* maybe there's some other process using too much cpu / pre-empting
osmo-trx?
Yes it looks like that
Your test seem to be looking at the second part. You
can use a
CLOCK_MONOTONIC time source to take timestamps, as you indicated.
I used
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,
&start_time);
Maybe I should refine my test....
Thanx for your comments,
Gullik