On Sat, Dec 22, 2018 at 10:00:50AM +0100, Gullik Webjorn wrote:
At 3 am the trx stopped again. This time it exited (itself) after logging large amounts of packet loss,
The interesting question is: Was there some kind of cron job or other activity running at 3am on that system, which could cause a system load high enough to make the flow between B100, kernel USB stack, libusb, UHD and osmo-trx-uhd interrupt?
Something like this is likely the root cause of the problem.
Sure, osmo-trx could "plaster around" it by having a more elegant recovery mechanism, but failing fast due to exit and letting osmo-trx-uhd respawn (normally executed via systemd) isn't actually all too bad.
What's definitely a real problem that needs immediate fixing is if we somehow get stack with osmo-trx continuing to run, but failing to transmit a valid signal wihout exit + respawn.
Regards, Harald