Request for Comment: Creating a binary format output for rtl_power

historical

Hi,

i'd agree that having text encoding + compression is far from ideal.

However, another aspect/goad might be following:
have the main data readable binary from gnuplot.

see
http://gnuplot.sourceforge.net/docs_4.2/node103.html

kind regards,
Hayati

Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
> Hi Abhishek,
>
> On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
>> In practice you will find that [text format+ compression] will be
>> fairly close to [binary format + compression] in final size.
>
> Is that so? Color me surprised! While certainly any dictionary-based
> compressor could find the bytes that make up the individual digits and
> compress them to an average of a little less than 4b, that'd still be
> worse than the 8b you need to represent any number 0-255, for example.
> And if your dictionary allows for variable-length words, like an LZ(W)
> kind of algorithm, the compression ratio should saturate pretty early.
>
> Now, I haven't worked with the specific text data coming out of
> rtl_power, so I'd be very interested in the results!
> Bill, could you compress a few of your textual rtl_power output files
> (using gzip --best, and maybe xz) for us and tell us the how many
> numbers were in the original files and how many bytes are the resulting
> files in length?
>
> (zstd: would be very interesting to have a detached dictionary, because
> I presume the dictionary overhead to be non-negligible with large
> numbers of smaller observation files)
> (BTW, tar is the worst format under the sun to compress many small
> files; it pads every file to 512B; of course, zeros compress nicely,
> but suddenly your shortest codeword is a useless padding symbol and
> that has a measurable compressed file size effect)
>
>> Compression obviously will reduce random access to data, so if your
>> intended use involves seeking randomly around in the data, things get
>> tricky.
>
> Indeed, that's what I'd have to bring forward: A "compressor" based on
> simply converting the tabular text data to binary format would be not
> so far away, worst case, from an actual entropy encoder, but allow for
> random seeks, AND be faster. I honestly don't see the downsides of
> that!
>
>> If the format is intended to be shared with other people, and/or
>> manage large collections of such data, either use hdf5[1] or look
>> into it for inspiration.
>
> Yep; or other formats. GNU Radio, for example, simply uses raw binary
> numbers packed end-to-end; there's the SigMF project which strives to
> provide metadata (sample format, acquisition time, and other
> parameters) in a separate file. It's JSON lying next to your data file.
> Whether or not that's useful to you...
>
>> If that sounds too complicated, then protobufs[2] might be another
>> option. Both hdf5 and protobufs benefit from having readily available
>> code for access to the data for later analysis, and from having
>> hundreds of man-years of bugfixes and portability fixes behind them.
>
> Yeah, but a protobuf that's mostly a buffer of ints really is only
> binary numbers right after each other, plus a header that you define
> yourself. It's a good idea to let some library like protobuf handle
> that, I agree!
>
>> Again, depending on the use case, another option might be to store
>> the data in a sqlite[3] database file, its an underrated option for
>> large amounts of data: here the binary conversion to and fro can be
>> handled by the sqlite tools themselves, and you have access to the
>> data in a fully random-access fashion. There are ways for sqlite to
>> do online compression of data as well[4], incase you find the
>> standard size reduction from going to binary isn't enough.
>
> Not quite sure how well sqlite handles compression of BLOBs, or are you
> suggesting you insert samples as values individually?
>
> By the way, I think "so much data it becomes a burden to my server" is
> actually not covered by what sqlite is designed to do. There might be
> more optimized databases for that.
>
>> Greg brought up endianness, then theres framing(indicating the start
> if each "row" of data, in case a few bytes got corrupted on disk, thus
> allowing the rest to still be recovered)
>
> Which should be no problem, seeing that rows are fixed-length,
>
>> versioning (if you change the data format, how do you have the
> reading code still remain able to read the older format)
>
> Important point, imho, but this feels like a one-off format, so really,
> might be a bit of overengineering. Anyways, never hurts to simply have
> a header field that says "version". Do that!
>
>> debugging (its very hard to be 100% sure that a binary data you
> intended to write, typically there will be no error messages from the
> code and no crashes - just wrong and misleading results due to
> miswritten/misread data), etc.
>
> In my experience, writing textual data is way harder to keep
> consistent, due to the non-fixed amount of bytes required per word.
>
> Best regards,
> Marcus
>