I have now moved the Limesdr mini back and forth between the Orange Pi Zero and a Supermicro (i386). On the Pi Zero I get the following on the console:
Tue Jan 22 20:36:43 2019 DMAIN <0000> Transceiver.cpp:1039 [tid=3043130448] ClockInterface: sending IND CLOCK 321960 Tue Jan 22 20:36:44 2019 DMAIN <0000> Transceiver.cpp:1039 [tid=3043130448] ClockInterface: sending IND CLOCK 322176 Tue Jan 22 20:36:45 2019 DDEV <0002> LMSDevice.cpp:600 [tid=3043130448] chan 0 recv buffer of len 2500 expect 894a7c got 895e68 (895e68) diff=13ec Tue Jan 22 20:36:45 2019 DDEV <0002> LMSDevice.cpp:600 [tid=3043130448] chan 0 recv buffer of len 2500 expect 895440 got 897024 (897024) diff=1be4 Tue Jan 22 20:36:45 2019 DDEV <0002> LMSDevice.cpp:600 [tid=3043130448] chan 0 recv buffer of len 2500 expect 895e04 got 897de4 (897de4) diff=1fe0
The DDEV message can continue for a long time, and typically then ends with the transceiver exiting. It can however sometimes recover, and the first message of recovery is a IND Clock message, after that the transceiver restarts, and all is again well for some undefined time.
I have followed your suggestions, and rebuilt --with-neon-vfpv4 , and I have enabled debugging. If I understand correctly, the function is readSamples, and is the actual input from the Lime rx. The length seems always correct, since I do not get that log entry, but rather that the time has "slipped", i.e. the LMS api has not delivered anything for "diff" time, or the timestamp received has "jumped" in the Lime.
This indicates to me that this specific arm cpu in combination with limesdr mini and the software "drops data". I will gladly try to debug or narrow this down, but ask for some suggestions on how to proceed. One thought is to save a timestamp each time through readSamples and compare to some constant to determine that the problem is NOT that we are unable to read as fast as required. reading 2500 samples would take 2.3 mS if I understand this, so we need to cal readsamples at about this rate....
Possible causes could be something *else* locking out the program.
Hi Gullik,
On Wed, Jan 23, 2019 at 12:21:49AM +0100, Gullik Webjorn wrote:
I have followed your suggestions, and rebuilt --with-neon-vfpv4 , and I have enabled debugging.
Are you running the osmo-trx process with real-time priority (SCHED_RR)? What is the CPU load? Please note on a multi-core system the interesting bit is not the average load over all CPU cores, but the maximum load of any one of the cores.
The length seems always correct, since I do not get that log entry, but rather that the time has "slipped", i.e. the LMS api has not delivered anything for "diff" time, or the timestamp received has "jumped" in the Lime. [...] This indicates to me that this specific arm cpu in combination with limesdr mini and the software "drops data". I will gladly try to debug or narrow this down, but ask for some suggestions on how to proceed.
Correct. This is a problem we've been observing on a variety of platforms for quite some time. Some samples are lost.
* maybe the polling interval (bInterval) in the endpoint descriptors is set too low? * maybe the number / size of bulk-in USB transfers (URBs) is insufficient and/or thery are not re-submitted fast enough. * maybe there's some other process using too much cpu / pre-empting osmo-trx? * maybe there's some [buggy?] driver used on this system that disables/masks interrupts or otherwise causes high scheduler latencies, by disabling pre-emption or the like? * maybe there's some bios/firmware/management-mode code that can interrupt normal OS processing without the OS even knowing about it. * maybe there's some power management (cpu speed throttling, thermal throttling, ...) interfering?
I'm not familiar with the inner workings of LimeSuite, but any program that would expect to achieve high performance on libusb should (IMHO) be using the asynchronous API of libusb, and it should make sure there are always multiple URBs submitted at any given point in time, so that the kernel can handle data from the USB device without interruption.
IF I read LimeSuite correctly, they are submitting 16 URBs for read (USB_MAX_CONTEXTS returned by GetBuffersCount() used by ReceivePacketsLoop() which in turn calls the somewhat interestingly named method dataPort->BeginDataReading() for each of the buffers.
https://elinux.org/images/c/c8/Debugging_Methodologies_for_Realtime_Issues_i... is a good introduction, it may be a bit dated.
One thought is to save a timestamp each time through readSamples and compare to some constant to determine that the problem is NOT that we are unable to read as fast as required. reading 2500 samples would take 2.3 mS if I understand this, so we need to cal readsamples at about this rate....
The data can be lost (at least)
* between the USB device and the USB host, if the bus is overloaded or somehow the kernel / hardware cannot handle/schedule transfers fast enough, or * between kernel and userspace
Your test seem to be looking at the second part. You can use a CLOCK_MONOTONIC time source to take timestamps, as you indicated.
Possible causes could be something *else* locking out the program.
yes. That's why in the normal mode of operation you usuall start with running osmo-trx with SCHED_RR / realtime prirority. This way normal tasks that run with regular priority are not going to interfere anymore. But then, that leaves tons of kernel/driver code, and hardware/bios/firmware...
It would be great to sched more light on this, but it likely needs very thorough analysis across all layers of a system.
Ok, some more experiments. I have made a small table that logs Linux time diffs in nanoseconds each time LMSDevice is called. From my first test this indicates that on this particular platform, within the last 8 calls time can be as low as 170 uS, i.e. a value of roughly 170000. But I also get times up to 44625499, i.e. 4.46 mS, and the values in the table can either look like:
$2 = {32345376, 28481917, 16771791, 15794875, 16805792, 17252958, 44625499, 33037584}
indicating several calls after one another had long times
$4 = {198750, 179625, 33702624, 16127416, 27990666, 16007875, 13552168, 16100124}
or the latest and 2'nd latest are low, but follow a sequence of long times.
Thus, we are not dealing with a single interruption of short latency, but an extended period of long latency / interference. Once the condition occurs, I get 100's of logs of time mismatch, so, it does not recover. Right now I am wondering about fault recovery, i.e. what should the trx do once it has detected missing data? Whatever happens has a low chance of fixing the situation, once triggered, the condition persists. This is also indicated by the fact that the logged "diff" value is the *same* value in subsequent loggings, i.e. the trx does not recover / rewind / adjust timing to get back to normal.
Are you running the osmo-trx process with real-time priority (SCHED_RR)? I tried that with no obvious effect.....
What is the CPU load? Please note on a multi-core system the interesting bit is not the average load over all CPU cores, but the maximum load of any one of the cores.
"Normal" load is trx process taking 80 - 100 % out of 4 cpus, i.e. htop shows 4 cpus each with 20-25% load. trx seems to spread its threads over all cpus.
Correct. This is a problem we've been observing on a variety of platforms for quite some time. Some samples are lost.
- maybe the polling interval (bInterval) in the endpoint descriptors is set too low?
Hmm, my crude measurements indicate trx retrieving is cause, not lack of data.
- maybe the number / size of bulk-in USB transfers (URBs) is insufficient and/or thery are not re-submitted fast enough.
- maybe there's some other process using too much cpu / pre-empting osmo-trx?
Yes it looks like that
Your test seem to be looking at the second part. You can use a CLOCK_MONOTONIC time source to take timestamps, as you indicated.
I used
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time);
Maybe I should refine my test....
Thanx for your comments, Gullik
Just to make sure:
We are reading 270 kbit/sec from the radio. Each sample is I + Q, 16 bits, 4 bytes With the Lime, samples per symbol is 4, so we are getting 4 bytes 4 * 270 k.
So, we need to read 4.333 Mbytes / second. We read these 2500 bytes at the time, i.e. 1733 "packets" per second, each packet at 576 uS interval (average).
Is this what to expect? Gullik
Hi Gullik,
I'm not the expert on osmo-trx, but as nobody else has followed-up so far:
On Wed, Jan 23, 2019 at 01:44:09PM +0100, Gullik Webjorn wrote:
We are reading 270 kbit/sec from the radio. Each sample is I + Q, 16 bits, 4 bytes
To be more precise: 270 ksymbols/sec. Every sample is 16bit I+Q and hence four bytes in total.
With the Lime, samples per symbol is 4, so we are getting 4 bytes 4 * 270 k.
indeed.
So, we need to read 4.333 Mbytes / second.
also seems correct as per my understanding.
We read these 2500 bytes at the time, i.e. 1733 "packets" per second, each packet at 576 uS interval (average).
I cannot comment on that, but if those are the figures you found in the code, then that's it.
I have tried to investigate *where* the type #3339 bug occurs. I was thinking in terms either Lime hw / fw / driver drops packets, or something screws up the timestamp leading to the perception of lost data.
I have added some debug printouts, where LMS_GetStreamStatus is called, and as the error occurs, I also get dropped packets. The explanation from the API is that droppedPackets = "Number of dropped packets by HW."
To me this indicates that the program / usb driver / usb is not emptying the on board fifo frequently enough, and that this is the cause of the problem ( as Harald indicated he suspected )
At least I have convinced myself that: 1 Packets are lost within the Lime board ( since it reports them) and not in the higher levels of code. 2 That it is indeed "lowlevel" i.e. Lime driver or usb driver bug, or just the fact we do not get time to process quick enough, which I feel we should have plenty of cpu to compensate for.
Gullik
And yes, we should not confuse bits and symbols :-)
Findings so far:
I changed timing the RXlower thread, measuring the time it takes from exiting the readSamples function in Transceiver, to the entry just before the call to the LimeSuite, and these are the 32 last values in uSeconds :
$1 = {236, 190, 267, 4226, 4173, 5822, 3881, 56079, 7598, 7085, 4959, 7832, 7808, 7772, 5079, 222, 199, 222, 202, 201, 191, 209, 210, 198, 211, 188, 203, 175, 239, 177, 215, 215} ( the index is 15 )
The 3 first samples, and the 17 last, shows that it is typically possible to execute all code in RXlower ( not including recvStream ) in about 200 uSeconds.
There are 12 values in sequence beginning with 4.226 mS and including 56.079 mS, where "nonblocking" code is interrupted by other activity in the Orange Pi Zero / Armbian 9. Whatever this is, it runs at greater priority than -19, and causes the FIFO on board the LimeSDR mini to fill, until we get a report of "dropped by HW".
From this point the error is not recoverable, and can only be cleared by Transceiver restarting ( with new timestamps ).
The conclusion is that there is no problem with Transceiver ( except for detection of this condition) and as long as the Transceiver process gets enough cpu, all works as expected. However "something runs for 122 mS" severely degrading the latency. There IS processing time available for Transceiver, but the reduction in available cpu cycles is to severe for proper operation.
The 2500 bytes read corresponds to 625 bits, i.e. 4 slots, for a total of 2.307 mS, which is the rate we must keep on average to keep the LimeSDR happy, we can tolerate several "misses", since we can complete the loop in c:a 200 uS, but if the condition persists to long, the LimeSDR will overwrite it's FIFO, log dropped packets, and Transceiver will hit error #3339 or exit. IF Transceiver *always* got about 200 uS of running time each 2.3 mS we would be fine.....
No on to finding the culprit....or just change HW / OS.....this combo has lousy RT characteristics.....
Regards, Gullik
On 2019-01-29 12:20, Gullik Webjorn wrote:
I have tried to investigate *where* the type #3339 bug occurs. I was thinking in terms either Lime hw / fw / driver drops packets, or something screws up the timestamp leading to the perception of lost data.
I have added some debug printouts, where LMS_GetStreamStatus is called, and as the error occurs, I also get dropped packets. The explanation from the API is that droppedPackets = "Number of dropped packets by HW."
To me this indicates that the program / usb driver / usb is not emptying the on board fifo frequently enough, and that this is the cause of the problem ( as Harald indicated he suspected )
At least I have convinced myself that: 1 Packets are lost within the Lime board ( since it reports them) and not in the higher levels of code. 2 That it is indeed "lowlevel" i.e. Lime driver or usb driver bug, or just the fact we do not get time to process quick enough, which I feel we should have plenty of cpu to compensate for.
Gullik
And yes, we should not confuse bits and symbols :-)
Hi Gullik,
thanks a lot for your excellent analysis and for sharing it here.
It is by far the most in-depth study of LimeSDR<->osmo-trx related problems that I've seen. Great.
Now the interesting question of course is how to anylyze this further.
Linux has received *tons* of tracing mechanisms during the last decade, [un]fortuantely I've only read docs/tutorials and never had to use any of it so far. But there for sure are tools that allow to analyze where the latency is coming from.
It may be some particularly nasty driver on your platform. It may be thermal throttling due to insufficient cooling of the CPU, ...
I guess you already deactivated virtually anything that's not required on the system and running in the background? I'm not so much worried about other background tasks at normal priorities, but more about what kind of peripherals they might interact through which device drivers
Another idea is to exclude any influence of the block layer by ensuring that there are no log files written either directly from osmo-trx, nor indirectly via stdio redirection / syslog or the like. Otherwise the occasional flush to the block device might be a possible cause.
On Tue, Feb 05, 2019 at 04:54:49PM +0100, Gullik Webjorn wrote:
No on to finding the culprit....or just change HW / OS.....this combo has lousy RT characteristics.....
People have observed the same pattern also on other hardware, so finding a way to systematically debug this kind of issue will help not only on your particular platform today.
Regards, Harald
Closing this thread.
On 2019-02-05 19:03, Harald Welte wrote:
It may be some particularly nasty driver on your platform. It may be thermal throttling due to insufficient cooling of the CPU, ...
Regards, Harald
Well, now the puzzle is solved for this platform. The Orange Pi Zero in conjunction with Armbian has a cpu temperature governor. Reducing the max cpu speed down to 816 Mhz, the cpu utilization goes up, to 200% ( out of 400 ). This lowers the temperature by c:a 10 C. When the cpu temperature exceeds some value, the governor downshifts clock, and execution of Transceiver is disturbed, and I am now convinced this is the cause of the problem with this platform.
It does not make complete sense from a mathematical point of view, but I do not have the insight into the inner workings of the "governor". Reducing the maximum rate and increasing the minimum rate I have reached stability over night. A heat sink has also been added to the cpu. Using armbianmonitor -m, the critical temperature seems to be c:a 65 C.
MY earlier observations is that the bug was hit "randomly" , sometimes after 30 seconds, sometimes not for several hours. This randomness is most probably due to room temperature, drafts from entering or exiting the lab and air flow around the system.
With regards to osmo-bts software, the changes I propose are:
1) ALARM LOG in
void LMSDevice::update_stream_stats(size_t chan, bool * underrun, bool * overrun) { lms_stream_status_t status; if (LMS_GetStreamStatus(&m_lms_stream_rx[chan], &status) == 0) { if (status.underrun > m_last_rx_underruns[chan]) *underrun = true; m_last_rx_underruns[chan] = status.underrun;
if (status.overrun > m_last_rx_overruns[chan]) *overrun = true; m_last_rx_overruns[chan] = status.overrun; *// if the radio drops packets it is good information to know it, this is a FATAL condition with regards to stable operation of Transceiver this + 4 more lines** ** if (status.droppedPackets != 0) {** ** dropped = dropped + status.droppedPackets;** ** LOGC(DDEV, ALERT) << "Dropped " << status.droppedPackets << " Total dropped " << dropped;** ** }* } }
2) Possibly causing this condition to either cause Transceiver exit or controlled restart ( which happens anyway, but many seconds later )
Now on to "balancing LimeSDR up/down links"......
Regards,
Gullik
On Wed, Jan 23, 2019 at 01:05:58PM +0100, Gullik Webjorn wrote:
platform, within the last 8 calls time can be as low as 170 uS, i.e. a value of roughly 170000. But I also get times up to 44625499, i.e. 4.46 mS,
the question is what is the target/expected value here, i.e. how many samples at which sample rate do we expect to read in every call, and what's the resulting interval?
Right now I am wondering about fault recovery, i.e. what should the trx do once it has detected missing data? Whatever happens has a low chance of fixing the situation, once triggered, the condition persists. This is also indicated by the fact that the logged "diff" value is the *same* value in subsequent loggings, i.e. the trx does not recover / rewind / adjust timing to get back to normal.
This is a very "dangerous" area. In a system like GSM, where there are performance figures specified as part of the spec conformance, we should be very careful about plastering over bugs like this.
Any system (hardware + software) must be able to handle processing of all samples at any given point in time. If it can't handle this, it introduces bit errors which, if they happen frequently/reproducibly, will for sure degrade performance of the base station.
So the "right" solution is to find the issue and solve it, not to "recover" by simply continuing with increased BER and degraded performance.
If the system just magically recovers, I'm afraid people will put this into production operation without understanding the gravity of the problem, or that there is one at all.
Are you running the osmo-trx process with real-time priority (SCHED_RR)? I tried that with no obvious effect.....
I think ftrace with irqsoff, preemptoff, wakeup_rt tracers could be one option to debug this further. If there's a correlation between time with irqs/preemption disabled around the time of your "high latency bursts", that would be a very clear message.
- maybe the polling interval (bInterval) in the endpoint descriptors is set too low?
Hmm, my crude measurements indicate trx retrieving is cause, not lack of data.
I'm not sure I understand yet how you reach that conclusion? It would be interesting to get some kind of watermarks of the amount of "used" libusb USB transfers inside LimeSuite. Maybe it's also worth increasing them or their size?
- maybe there's some other process using too much cpu / pre-empting osmo-trx?
Yes it looks like that
What about modifying osmo-trx to simply read and discard the samples, rather than processing them. Do you still get the overruns then?
Your test seem to be looking at the second part. You can use a CLOCK_MONOTONIC time source to take timestamps, as you indicated.
I used
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time);Maybe I should refine my test....
This tells you how much CPU time a given process has consumed. It is not an absolut/reference clock. At least my understanding was that you wanted to take "absolute" timestamps. CLOCK_MONOTONIC_RAW is probably the best candidate for that.
Regards, Harald
Right now I am wondering about fault recovery, i.e. what should the trx do once it has detected missing data? Whatever happens has a low chance of fixing the situation, once triggered, the condition persists. This is also indicated by the fact that the logged "diff" value is the *same* value in subsequent loggings, i.e. the trx does not recover / rewind / adjust timing to get back to normal.
This is a very "dangerous" area. In a system like GSM, where there are performance figures specified as part of the spec conformance, we should be very careful about plastering over bugs like this.
Any system (hardware + software) must be able to handle processing of all samples at any given point in time. If it can't handle this, it introduces bit errors which, if they happen frequently/reproducibly, will for sure degrade performance of the base station.
So the "right" solution is to find the issue and solve it, not to "recover" by simply continuing with increased BER and degraded performance.
If the system just magically recovers, I'm afraid people will put this into production operation without understanding the gravity of the problem, or that there is one at all.
I am in violent agreement, but the process did NOT exit, and sometimes it DID recover, and except for the log messages I would not have seen the problem but for sporadic outage and other problems. I was just wondering what the thinking had been on handling this particular condition, apart from logging a LOG message...
i.e. what is the best thing to do, when the error is detected?
Mind you, I am just starting to learn how the trx is doing it's job, and thinking of what should happen next when this condition has occured.
Is this something that *could* happen ( without broken hw ) and is it meaningful to continue to repeat the error??
Perhaps, the "jump" in timestamp has that effect on the "rest" of trx. What if the timestamp is screwed up on its way from Lime to trx??
I think ftrace with irqsoff, preemptoff, wakeup_rt tracers could be one option to debug this further. If there's a correlation between time with irqs/preemption disabled around the time of your "high latency bursts", that would be a very clear message.
Debugging, and my confusion will rise to a higher level...
- maybe the polling interval (bInterval) in the endpoint descriptors is set too low?
Hmm, my crude measurements indicate trx retrieving is cause, not lack of data.
I'm not sure I understand yet how you reach that conclusion? It would be interesting to get some kind of watermarks of the amount of "used" libusb USB transfers inside LimeSuite. Maybe it's also worth increasing them or their size?
Well, possible explanations..
1. The limesdr sometimes fail to deliver a significant amount of packets, since the time "jumps" a large amount.
2. The trx / hw / linux fails to read packets, causing the lime to be unable to deliver the data, until trx / hw / linux becomes responsive again.
3. ???
It look like my crude tests shows that the trx can loop and get data at 170 uSec, but sometimes does not come back within 100 times that, why? To me 2. seems most probable....but I will see if I can check in LimeSuite.
Also, tests with Limesdr and *other* applications can give clues....
- maybe there's some other process using too much cpu / pre-empting osmo-trx?
Yes it looks like that
What about modifying osmo-trx to simply read and discard the samples, rather than processing them. Do you still get the overruns then?
I'll check....
Your test seem to be looking at the second part. You can use a CLOCK_MONOTONIC time source to take timestamps, as you indicated.
I modified for MONOTONIC, no obvious change....
is tells you how much CPU time a given process has consumed. It is
not an absolut/reference clock. At least my understanding was that you wanted to take "absolute" timestamps. CLOCK_MONOTONIC_RAW is probably the best candidate for that.
The fight goes on....
Regards, Harald
Regards, Gullik