Request for Comment: Creating a binary format output for rtl_power

historical

I will be starting by defining by format then writing a converter to
convert between text csv and the binary format. I will also try to write a
stream based converter that you can pipe the output of rtl_power into the
write the binary format directly.
I figure this is more universal then trying to make a binary format in the
application itself.

On Thu, Jan 2, 2020 at 7:30 PM Hayati Ayguen <h_ayguen at web.de> wrote:

>
> Hi,
>
> i'd agree that having text encoding + compression is far from ideal.
>
> However, another aspect/goad might be following:
> have the main data readable binary from gnuplot.
>
> see
> http://gnuplot.sourceforge.net/docs_4.2/node103.html
>
>
> kind regards,
> Hayati
>
>
> Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
> > Hi Abhishek,
> >
> > On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
> >> In practice you will find that [text format+ compression] will be
> >> fairly close to [binary format + compression] in final size.
> >
> > Is that so? Color me surprised! While certainly any dictionary-based
> > compressor could find the bytes that make up the individual digits and
> > compress them to an average of a little less than 4b, that'd still be
> > worse than the 8b you need to represent any number 0-255, for example.
> > And if your dictionary allows for variable-length words, like an LZ(W)
> > kind of algorithm, the compression ratio should saturate pretty early.
> >
> > Now, I haven't worked with the specific text data coming out of
> > rtl_power, so I'd be very interested in the results!
> > Bill, could you compress a few of your textual rtl_power output files
> > (using gzip --best, and maybe xz) for us and tell us the how many
> > numbers were in the original files and how many bytes are the resulting
> > files in length?
> >
> > (zstd: would be very interesting to have a detached dictionary, because
> > I presume the dictionary overhead to be non-negligible with large
> > numbers of smaller observation files)
> > (BTW, tar is the worst format under the sun to compress many small
> > files; it pads every file to 512B; of course, zeros compress nicely,
> > but suddenly your shortest codeword is a useless padding symbol and
> > that has a measurable compressed file size effect)
> >
> >> Compression obviously will reduce random access to data, so if your
> >> intended use involves seeking randomly around in the data, things get
> >> tricky.
> >
> > Indeed, that's what I'd have to bring forward: A "compressor" based on
> > simply converting the tabular text data to binary format would be not
> > so far away, worst case, from an actual entropy encoder, but allow for
> > random seeks, AND be faster. I honestly don't see the downsides of
> > that!
> >
> >> If the format is intended to be shared with other people, and/or
> >> manage large collections of such data, either use hdf5[1] or look
> >> into it for inspiration.
> >
> > Yep; or other formats. GNU Radio, for example, simply uses raw binary
> > numbers packed end-to-end; there's the SigMF project which strives to
> > provide metadata (sample format, acquisition time, and other
> > parameters) in a separate file. It's JSON lying next to your data file.
> > Whether or not that's useful to you...
> >
> >> If that sounds too complicated, then protobufs[2] might be another
> >> option. Both hdf5 and protobufs benefit from having readily available
> >> code for access to the data for later analysis, and from having
> >> hundreds of man-years of bugfixes and portability fixes behind them.
> >
> > Yeah, but a protobuf that's mostly a buffer of ints really is only
> > binary numbers right after each other, plus a header that you define
> > yourself. It's a good idea to let some library like protobuf handle
> > that, I agree!
> >
> >> Again, depending on the use case, another option might be to store
> >> the data in a sqlite[3] database file, its an underrated option for
> >> large amounts of data: here the binary conversion to and fro can be
> >> handled by the sqlite tools themselves, and you have access to the
> >> data in a fully random-access fashion. There are ways for sqlite to
> >> do online compression of data as well[4], incase you find the
> >> standard size reduction from going to binary isn't enough.
> >
> > Not quite sure how well sqlite handles compression of BLOBs, or are you
> > suggesting you insert samples as values individually?
> >
> > By the way, I think "so much data it becomes a burden to my server" is
> > actually not covered by what sqlite is designed to do. There might be
> > more optimized databases for that.
> >
> >> Greg brought up endianness, then theres framing(indicating the start
> > if each "row" of data, in case a few bytes got corrupted on disk, thus
> > allowing the rest to still be recovered)
> >
> > Which should be no problem, seeing that rows are fixed-length,
> >
> >> versioning (if you change the data format, how do you have the
> > reading code still remain able to read the older format)
> >
> > Important point, imho, but this feels like a one-off format, so really,
> > might be a bit of overengineering. Anyways, never hurts to simply have
> > a header field that says "version". Do that!
> >
> >> debugging (its very hard to be 100% sure that a binary data you
> > intended to write, typically there will be no error messages from the
> > code and no crashes - just wrong and misleading results due to
> > miswritten/misread data), etc.
> >
> > In my experience, writing textual data is way harder to keep
> > consistent, due to the non-fixed amount of bytes required per word.
> >
> > Best regards,
> > Marcus
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/osmocom-sdr/attachments/20200103/4e3369a0/attachment.htm>