Request for Comment: Creating a binary format output for rtl_power

historical

Hi,

I missed this discussion on a topic I value.

I would propose moving towards audiovideo containers like mkv, so that
norms for things like tags and channels can develop, and algorithms for
stream compression be used when relevant.  Standardization is always a work
in progress, always helpful to everyone, and gets more people working on
each other's problems.

rtl_power_fftw also has a binary output format see
https://github.com/AD-Vega/rtl-power-fftw/pull/11

soapy_power has a binary format output with a lot of proposed contributions
i am unfortunately involved with but didn't have the capacity to take over
when xmikos tired out https://github.com/xmikos/soapy_power/pulls

One thing I tend to desire of spectrum storage formats is the ability to
store raw i/q logs for parts of the recording alongside them, or with a
glitchy device like rtlsdr the serial number and usb packetlog (see
https://github.com/keenerd/rtl-sdr/compare/master...xloem:logfile-official
) with a way to mark synchronization of events.

On Fri, Jan 3, 2020, 2:51 PM Bill Gaylord <chibill110 at gmail.com> wrote:

> I will be starting by defining by format then writing a converter to
> convert between text csv and the binary format. I will also try to write a
> stream based converter that you can pipe the output of rtl_power into the
> write the binary format directly.
> I figure this is more universal then trying to make a binary format in the
> application itself.
>
> On Thu, Jan 2, 2020 at 7:30 PM Hayati Ayguen <h_ayguen at web.de> wrote:
>
>>
>> Hi,
>>
>> i'd agree that having text encoding + compression is far from ideal.
>>
>> However, another aspect/goad might be following:
>> have the main data readable binary from gnuplot.
>>
>> see
>> http://gnuplot.sourceforge.net/docs_4.2/node103.html
>>
>>
>> kind regards,
>> Hayati
>>
>>
>> Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
>> > Hi Abhishek,
>> >
>> > On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
>> >> In practice you will find that [text format+ compression] will be
>> >> fairly close to [binary format + compression] in final size.
>> >
>> > Is that so? Color me surprised! While certainly any dictionary-based
>> > compressor could find the bytes that make up the individual digits and
>> > compress them to an average of a little less than 4b, that'd still be
>> > worse than the 8b you need to represent any number 0-255, for example.
>> > And if your dictionary allows for variable-length words, like an LZ(W)
>> > kind of algorithm, the compression ratio should saturate pretty early.
>> >
>> > Now, I haven't worked with the specific text data coming out of
>> > rtl_power, so I'd be very interested in the results!
>> > Bill, could you compress a few of your textual rtl_power output files
>> > (using gzip --best, and maybe xz) for us and tell us the how many
>> > numbers were in the original files and how many bytes are the resulting
>> > files in length?
>> >
>> > (zstd: would be very interesting to have a detached dictionary, because
>> > I presume the dictionary overhead to be non-negligible with large
>> > numbers of smaller observation files)
>> > (BTW, tar is the worst format under the sun to compress many small
>> > files; it pads every file to 512B; of course, zeros compress nicely,
>> > but suddenly your shortest codeword is a useless padding symbol and
>> > that has a measurable compressed file size effect)
>> >
>> >> Compression obviously will reduce random access to data, so if your
>> >> intended use involves seeking randomly around in the data, things get
>> >> tricky.
>> >
>> > Indeed, that's what I'd have to bring forward: A "compressor" based on
>> > simply converting the tabular text data to binary format would be not
>> > so far away, worst case, from an actual entropy encoder, but allow for
>> > random seeks, AND be faster. I honestly don't see the downsides of
>> > that!
>> >
>> >> If the format is intended to be shared with other people, and/or
>> >> manage large collections of such data, either use hdf5[1] or look
>> >> into it for inspiration.
>> >
>> > Yep; or other formats. GNU Radio, for example, simply uses raw binary
>> > numbers packed end-to-end; there's the SigMF project which strives to
>> > provide metadata (sample format, acquisition time, and other
>> > parameters) in a separate file. It's JSON lying next to your data file.
>> > Whether or not that's useful to you...
>> >
>> >> If that sounds too complicated, then protobufs[2] might be another
>> >> option. Both hdf5 and protobufs benefit from having readily available
>> >> code for access to the data for later analysis, and from having
>> >> hundreds of man-years of bugfixes and portability fixes behind them.
>> >
>> > Yeah, but a protobuf that's mostly a buffer of ints really is only
>> > binary numbers right after each other, plus a header that you define
>> > yourself. It's a good idea to let some library like protobuf handle
>> > that, I agree!
>> >
>> >> Again, depending on the use case, another option might be to store
>> >> the data in a sqlite[3] database file, its an underrated option for
>> >> large amounts of data: here the binary conversion to and fro can be
>> >> handled by the sqlite tools themselves, and you have access to the
>> >> data in a fully random-access fashion. There are ways for sqlite to
>> >> do online compression of data as well[4], incase you find the
>> >> standard size reduction from going to binary isn't enough.
>> >
>> > Not quite sure how well sqlite handles compression of BLOBs, or are you
>> > suggesting you insert samples as values individually?
>> >
>> > By the way, I think "so much data it becomes a burden to my server" is
>> > actually not covered by what sqlite is designed to do. There might be
>> > more optimized databases for that.
>> >
>> >> Greg brought up endianness, then theres framing(indicating the start
>> > if each "row" of data, in case a few bytes got corrupted on disk, thus
>> > allowing the rest to still be recovered)
>> >
>> > Which should be no problem, seeing that rows are fixed-length,
>> >
>> >> versioning (if you change the data format, how do you have the
>> > reading code still remain able to read the older format)
>> >
>> > Important point, imho, but this feels like a one-off format, so really,
>> > might be a bit of overengineering. Anyways, never hurts to simply have
>> > a header field that says "version". Do that!
>> >
>> >> debugging (its very hard to be 100% sure that a binary data you
>> > intended to write, typically there will be no error messages from the
>> > code and no crashes - just wrong and misleading results due to
>> > miswritten/misread data), etc.
>> >
>> > In my experience, writing textual data is way harder to keep
>> > consistent, due to the non-fixed amount of bytes required per word.
>> >
>> > Best regards,
>> > Marcus
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/osmocom-sdr/attachments/20200309/e2c05535/attachment.htm>