Good day!
I was struggling with an occasional crash of the rtl-sdr tools on windows when closing the device after pressing CTRL-C. Both with the versions as provided with VCPKG and when directly compiling the latest version in MSVC (with latest of libusb). It is a bit difficult to spot a crash at close as usually it is silent but you can see it running it from MSVC or in a debugger. It does not happen always but frequent enough.
Also, consider this example for the binary versions from Jan 2 (https://ftp.osmocom.org/binaries/windows/rtl-sdr/):
C:>rtl_adsb.exe Found 1 device(s): 0: Realtek, RTL2838UHIDIR, SN: 00000001
Using device 0: Generic RTL2832U OEM Found Rafael Micro R820T tuner .... Signal caught, exiting!
User cancel, exiting...
C:>echo %ERRORLEVEL% -1073741819
After investigation of the libusb logs and experimentation I think this crash can be avoided by 1) defining zerotv in rtlsdr_read_async as 1 ms instead of zero (see libusb/libusb#1043) or 2) include a short call to Sleep to allow async operations to finish. Please find below a patch for the latter fix for your consideration.
More information, here: https://github.com/osmocom/rtl-sdr/pull/18 and here: https://github.com/libusb/libusb/issues/1043 Thanks for your great work on the rtlsdr library and kind regards.
Author: jvde.github jvde.github@gmail.com Date: Sat Jan 8 13:30:34 2022 +0100
force wait state after cancel of usb transfer and before handling usb events
diff --git a/src/librtlsdr.c b/src/librtlsdr.c index 0146298..2682d77 100644 --- a/src/librtlsdr.c +++ b/src/librtlsdr.c @@ -1930,6 +1930,9 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, /* handle events after canceling * to allow transfer status to * propagate */ +#ifdef _WIN32 + Sleep(1); +#endif
libusb_handle_events_timeout_completed(dev->ctx,
&zerotv, NULL); if (r < 0)
Hi Jasper,
thanks for your patch, I've just merged it.
Best Regards, Steve
Thank you Steve.
For the async to be really watertight we probably cannot assume that all the callbacks are executed directly after the cancels without delay. I have been playing with a variation below that waits for all callbacks to be finished based on counting the number of transfers cancelled. This requires a bit more testing as has more code changes.
For now the short Sleep has solved my issues on Windows. with device close. Thank you for merging it.
Kind regards, Jasper
Author: jvde.github jvde.github@gmail.com Date: Mon Jan 10 21:40:48 2022 +0100
rework async_wait to ensure all events handled at close
diff --git a/src/librtlsdr.c b/src/librtlsdr.c index 2682d77..a4ae212 100644 --- a/src/librtlsdr.c +++ b/src/librtlsdr.c @@ -125,6 +125,7 @@ struct rtlsdr_dev { int dev_lost; int driver_active; unsigned int xfer_errors; + int xfer_cb_cancel_count; };
void rtlsdr_set_gpio_bit(rtlsdr_dev_t *dev, uint8_t gpio, int val); @@ -1706,8 +1707,11 @@ static void LIBUSB_CALL _libusb_callback(struct libusb_transfer *xfer) dev->cb(xfer->buffer, xfer->actual_length, dev->cb_ctx);
libusb_submit_transfer(xfer); /* resubmit transfer */ + dev->xfer_errors = 0; - } else if (LIBUSB_TRANSFER_CANCELLED != xfer->status) { + } else if (LIBUSB_TRANSFER_CANCELLED == xfer->status) { + dev->xfer_cb_cancel_count++; + } else { #ifndef _WIN32 if (LIBUSB_TRANSFER_ERROR == xfer->status) dev->xfer_errors++; @@ -1853,9 +1857,9 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, { unsigned int i; int r = 0; + int ret = 0; + int cancel_count = 0; struct timeval tv = { 1, 0 }; - struct timeval zerotv = { 0, 0 }; - enum rtlsdr_async_status next_status = RTLSDR_INACTIVE;
if (!dev) return -1; @@ -1865,6 +1869,7 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx,
dev->async_status = RTLSDR_RUNNING; dev->async_cancel = 0; + dev->xfer_cb_cancel_count = 0; /* number of cancelled transfers */
dev->cb = cb; dev->cb_ctx = ctx; @@ -1881,6 +1886,7 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx,
_rtlsdr_alloc_async_buffers(dev);
+ /* creating initial submits */ for(i = 0; i < dev->xfer_buf_num; ++i) { libusb_fill_bulk_transfer(dev->xfer[i], dev->devh, @@ -1890,7 +1896,6 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, _libusb_callback, (void *)dev, BULK_TIMEOUT); - r = libusb_submit_transfer(dev->xfer[i]); if (r < 0) { fprintf(stderr, "Failed to submit transfer %i\n" @@ -1900,62 +1905,44 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, "echo 0 > /sys/module/usbcore" "/parameters/usbfs_memory_mb\n", i); dev->async_status = RTLSDR_CANCELING; + ret = r; break; } }
- while (RTLSDR_INACTIVE != dev->async_status) { - r = libusb_handle_events_timeout_completed(dev->ctx, &tv, - &dev->async_cancel); + /* handling events until device is cancelled */ + while (RTLSDR_CANCELING != dev->async_status) { + r = libusb_handle_events_timeout_completed(dev->ctx, &tv, &dev->async_cancel); + if (r < 0) { - /*fprintf(stderr, "handle_events returned: %d\n", r);*/ + if (r == LIBUSB_ERROR_INTERRUPTED) /* stray signal */ continue; + ret = r; break; } + }
- if (RTLSDR_CANCELING == dev->async_status) { - next_status = RTLSDR_INACTIVE; - - if (!dev->xfer) - break; - - for(i = 0; i < dev->xfer_buf_num; ++i) { - if (!dev->xfer[i]) - continue; - - if (LIBUSB_TRANSFER_CANCELLED != - dev->xfer[i]->status) { - r = libusb_cancel_transfer(dev->xfer[i]); - /* handle events after canceling - * to allow transfer status to - * propagate */ -#ifdef _WIN32 - Sleep(1); -#endif - libusb_handle_events_timeout_completed(dev->ctx, - &zerotv, NULL); - if (r < 0) - continue; + /* cancel and count all transfers */ + for(i = 0; i < dev->xfer_buf_num; ++i) {
- next_status = RTLSDR_CANCELING; - } - } + if(dev->xfer[i]) {
- if (dev->dev_lost || RTLSDR_INACTIVE == next_status) { - /* handle any events that still need to - * be handled before exiting after we - * just cancelled all transfers */ - libusb_handle_events_timeout_completed(dev->ctx, - &zerotv, NULL); - break; - } + r = libusb_cancel_transfer(dev->xfer[i]); + if (r == 0) cancel_count++; } }
+ /* handle events but expect at least "cancelled" CB calls with LIBUSB_TRANSFER_CANCELLED */ + do + { + libusb_handle_events_timeout(dev->ctx, &tv); + } + while (dev->xfer_cb_cancel_count != cancel_count); + _rtlsdr_free_async_buffers(dev);
- dev->async_status = next_status; + dev->async_status = RTLSDR_INACTIVE;
return r; }
Hi Jasper,
It was interesting to read your below, because I have just been tackling more or less exactly the same issue in libhackrf [0].
It looks like you're on the right track for fixing it.
In our case there were two orthogonal problems. One was that we weren't waiting for all the cancellations to complete, which you've addressed temporarily with the sleep, and are now tackling properly by counting out the cancelled transfers.
Our other problem was that libhackrf was running libusb_cancel_transfer from the API call that stops the capture - which gets called from a different thread, so there was a race in which the USB handling thread could resubmit a transfer that had just completed while the cancellations were going on.
I've just looked at librtlsdr and it seems it's immune to the latter problem, because rtlsdr_cancel_async() just sets a flag, and the actual cancellation is done from the same thread that handles libusb events after the handling has been interrupted.
That approach is a lot simpler, but does mean that you have a delay involved, because libusb_handle_events_timeout_completed() has to either see an event or time out, before it will notice the flag and break out of its loop.
That delay could be eliminated though by having rtlsdr_cancel_async() call libusb_interrupt_event_handler() after setting async_cancel = 1. That might be a change worth making, although I guess the delay is short if there is plenty of data flowing.
I'm going to try adopting librtlsdr's approach for libhackrf, as it seems a lot simpler.
Regards,
Martin
[0] https://github.com/greatscottgadgets/hackrf/pull/1029
On Mon, Jan 10, 2022 at 09:08:48PM -0000, Jasper wrote:
Thank you Steve.
For the async to be really watertight we probably cannot assume that all the callbacks are executed directly after the cancels without delay. I have been playing with a variation below that waits for all callbacks to be finished based on counting the number of transfers cancelled. This requires a bit more testing as has more code changes.
For now the short Sleep has solved my issues on Windows. with device close. Thank you for merging it.
Kind regards, Jasper
Author: jvde.github jvde.github@gmail.com Date: Mon Jan 10 21:40:48 2022 +0100
rework async_wait to ensure all events handled at close
diff --git a/src/librtlsdr.c b/src/librtlsdr.c index 2682d77..a4ae212 100644 --- a/src/librtlsdr.c +++ b/src/librtlsdr.c @@ -125,6 +125,7 @@ struct rtlsdr_dev { int dev_lost; int driver_active; unsigned int xfer_errors;
- int xfer_cb_cancel_count;
};
void rtlsdr_set_gpio_bit(rtlsdr_dev_t *dev, uint8_t gpio, int val); @@ -1706,8 +1707,11 @@ static void LIBUSB_CALL _libusb_callback(struct libusb_transfer *xfer) dev->cb(xfer->buffer, xfer->actual_length, dev->cb_ctx);
libusb_submit_transfer(xfer); /* resubmit transfer */
- dev->xfer_errors = 0;
- } else if (LIBUSB_TRANSFER_CANCELLED != xfer->status) {
- } else if (LIBUSB_TRANSFER_CANCELLED == xfer->status) {
dev->xfer_cb_cancel_count++;
- } else {
#ifndef _WIN32 if (LIBUSB_TRANSFER_ERROR == xfer->status) dev->xfer_errors++; @@ -1853,9 +1857,9 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, { unsigned int i; int r = 0;
- int ret = 0;
- int cancel_count = 0; struct timeval tv = { 1, 0 };
struct timeval zerotv = { 0, 0 };
enum rtlsdr_async_status next_status = RTLSDR_INACTIVE;
if (!dev) return -1;
@@ -1865,6 +1869,7 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx,
dev->async_status = RTLSDR_RUNNING; dev->async_cancel = 0;
dev->xfer_cb_cancel_count = 0; /* number of cancelled transfers */
dev->cb = cb; dev->cb_ctx = ctx;
@@ -1881,6 +1886,7 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx,
_rtlsdr_alloc_async_buffers(dev);
- /* creating initial submits */ for(i = 0; i < dev->xfer_buf_num; ++i) { libusb_fill_bulk_transfer(dev->xfer[i], dev->devh,
@@ -1890,7 +1896,6 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, _libusb_callback, (void *)dev, BULK_TIMEOUT);
- r = libusb_submit_transfer(dev->xfer[i]); if (r < 0) { fprintf(stderr, "Failed to submit transfer %i\n"
@@ -1900,62 +1905,44 @@ int rtlsdr_read_async(rtlsdr_dev_t *dev, rtlsdr_read_async_cb_t cb, void *ctx, "echo 0 > /sys/module/usbcore" "/parameters/usbfs_memory_mb\n", i); dev->async_status = RTLSDR_CANCELING;
} }ret = r; break;
- while (RTLSDR_INACTIVE != dev->async_status) {
r = libusb_handle_events_timeout_completed(dev->ctx, &tv,
&dev->async_cancel);
- /* handling events until device is cancelled */
- while (RTLSDR_CANCELING != dev->async_status) {
r = libusb_handle_events_timeout_completed(dev->ctx, &tv, &dev->async_cancel);
- if (r < 0) {
/*fprintf(stderr, "handle_events returned: %d\n", r);*/
if (r == LIBUSB_ERROR_INTERRUPTED) /* stray signal */ continue;
}ret = r; break;
- }
if (RTLSDR_CANCELING == dev->async_status) {
next_status = RTLSDR_INACTIVE;
if (!dev->xfer)
break;
for(i = 0; i < dev->xfer_buf_num; ++i) {
if (!dev->xfer[i])
continue;
if (LIBUSB_TRANSFER_CANCELLED !=
dev->xfer[i]->status) {
r = libusb_cancel_transfer(dev->xfer[i]);
/* handle events after canceling
* to allow transfer status to
* propagate */
-#ifdef _WIN32
Sleep(1);
-#endif
libusb_handle_events_timeout_completed(dev->ctx,
&zerotv, NULL);
if (r < 0)
continue;
- /* cancel and count all transfers */
- for(i = 0; i < dev->xfer_buf_num; ++i) {
next_status = RTLSDR_CANCELING;
}
}
if(dev->xfer[i]) {
if (dev->dev_lost || RTLSDR_INACTIVE == next_status) {
/* handle any events that still need to
* be handled before exiting after we
* just cancelled all transfers */
libusb_handle_events_timeout_completed(dev->ctx,
&zerotv, NULL);
break;
}
r = libusb_cancel_transfer(dev->xfer[i]);
if (r == 0) cancel_count++;
} }
/* handle events but expect at least "cancelled" CB calls with LIBUSB_TRANSFER_CANCELLED */
do
{
libusb_handle_events_timeout(dev->ctx, &tv);
}
while (dev->xfer_cb_cancel_count != cancel_count);
_rtlsdr_free_async_buffers(dev);
- dev->async_status = next_status;
dev->async_status = RTLSDR_INACTIVE;
return r;
}
Hi Martin,
Many thanks for sharing the insights.
For my understanding, librtlsdr uses libusb_handle_events_timeout_completed which should terminate early if the dev->async_cancel flag is set. Is there still a purpose to call libusb_interrupt_event_handler in this case or is it required to guarantee the handler inspects the flag immediately, before any timeout or a new event (i.e. like a notification)? That did not became clear to me from the libusb documentation.
The easiest way I can see now to assess we can safely free up the transfers is to count the live transfers (increase a counter by one if there was a successful resubmit) and reduce by one at the start of a libusb callback. It is then safe to terminate if the counter is zero. I have implemented that here: https://github.com/jvde-github/rtl-sdr/commit/325d47f1f3c1d9fbece98770348283...
Seems to work really well but given the various versions of libusb and supported platforms around, I am hesitant to submit as patch until it solves a reported problem.
Thanks, Jasper
On Sun, Jan 23, 2022 at 06:08:17PM -0000, Jasper wrote:
For my understanding, librtlsdr uses libusb_handle_events_timeout_completed which should terminate early if the dev->async_cancel flag is set. Is there still a purpose to call libusb_interrupt_event_handler in this case or is it required to guarantee the handler inspects the flag immediately, before any timeout or a new event (i.e. like a notification)? That did not became clear to me from the libusb documentation.
Just passing libusb_handle_events_completed() a pointer to the flag doesn't give it any way to be notified or interrupted when that flag is changed. It only gives it the ability to look at that flag when it chooses to - which will be when it's just handled some event and is deciding whether to wait for another.
Calling libusb_interrupt_event_handler() will interrupt whatever poll() or similar call libusb_handle_events_completed() is blocked in, causing it to inspect the flag and return if it's set.
Otherwise, the flag will only be inspected at the next iteration of the event loop (i.e. when some USB event occurs, or a timeout expires). In practice that may happen soon anyway, but I still think it makes sense to interrupt the handler to ensure there is no unnecessary delay.
The easiest way I can see now to assess we can safely free up the transfers is to count the live transfers (increase a counter by one if there was a successful resubmit) and reduce by one at the start of a libusb callback. It is then safe to terminate if the counter is zero.
The counting approach seems reasonable to me.
In my case I kept a boolean flag for each individual transfer, and signalled completion once all of the transfers were flagged as finished. This was partly because when debugging I wanted to be able to see the state of each transfer.
But in retrospect a count would suffice, and is much simpler, so I think I'll probably adopt that approach too.
I have implemented that here: https://github.com/jvde-github/rtl-sdr/commit/325d47f1f3c1d9fbece98770348283...
I've looked at this version and I see one potential failure mode.
It's the same as one of the problems I tackled on libhackrf - the one I previously said you were immune to, due to cancellation and event handling being in the same thread.
On closer inspection though, I think it can still go wrong as follows:
When libusb_cancel_transfer() is called, it's possible that one of the transfers has just completed at the kernel level - but not yet had its completion callback called, because the event handling was interrupted before that happened.
In this case libusb_cancel_transfer() returns LIBUSB_ERROR_NOT_FOUND, because that transfer is no longer available to be cancelled.
You don't check the result of libusb_cancel_transfer(). But actually even if you did, there isn't anything very helpful you can do with the result at that stage. The real problem occurs later - read on.
Next, when libusb_handle_events_timeout_completed() is called again at line 1930, to await the cancellation callbacks, the completion of that transfer is still pending and now gets processed - so _libusb_callback() gets called.
The status seen by _libusb_callback() will be LIBUSB_TRANSFER_COMPLETED, so the transfer gets resubmitted and the transfer count reincremented.
This then repeats indefinitely - there is no logic in the callback to prevent a completed transfer being resubmitted, so the same transfer will be constantly resubmitted and re-run, and live_transfers will never reach zero.
My suggested fix would be to check dev->async_cancel in the callback, and not resubmit a transfer if that flag is set.
Seems to work really well but given the various versions of libusb and supported platforms around, I am hesitant to submit as patch until it solves a reported problem.
I am very much of the opinion that once concurrency issues are identified, they want fixing proactively fixed even if the potential failures aren't being observed in practice.
When failures do occur they can be extremely hard to track down because of the timing dependency. It's taken multiple reports of intermittent, hard-to-reproduce problems, over a long period, for us to zero in on these issues in libhackrf.
We got to a point where only one diligent reporter could reproduce a failure, that seemed to only show up with their particular distro, hardware, software versions etc.
Eventually I managed to reproduce it myself, but still had to drill down through all the layers of gqrx, gr, osmosdr & libhackrf to figure out what was going wrong.
Having spotted these problems, I say let's fix them properly while we have the chance.
Regards,
Martin
Thanks Martin. This is helpful. I will do some more automatic testing (starting/stopping) on a few systems and then submit a patch for consideration.
Regards, Jasper