<div dir="ltr">I will be starting by defining by format then writing a converter to convert between text csv and the binary format. I will also try to write a stream based converter that you can pipe the output of rtl_power into the write the binary format directly. <div>I figure this is more universal then trying to make a binary format in the application itself.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 2, 2020 at 7:30 PM Hayati Ayguen <<a href="mailto:h_ayguen@web.de">h_ayguen@web.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

Hi,<br>

<br>

i'd agree that having text encoding + compression is far from ideal.<br>

<br>

However, another aspect/goad might be following:<br>

have the main data readable binary from gnuplot.<br>

<br>

see<br>

<a href="http://gnuplot.sourceforge.net/docs_4.2/node103.html" rel="noreferrer" target="_blank">http://gnuplot.sourceforge.net/docs_4.2/node103.html</a><br>

<br>

<br>

kind regards,<br>

Hayati<br>

<br>

<br>

Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):<br>

> Hi Abhishek,<br>

><br>

> On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:<br>

>> In practice you will find that [text format+ compression] will be<br>

>> fairly close to [binary format + compression] in final size.<br>

><br>

> Is that so? Color me surprised! While certainly any dictionary-based<br>

> compressor could find the bytes that make up the individual digits and<br>

> compress them to an average of a little less than 4b, that'd still be<br>

> worse than the 8b you need to represent any number 0-255, for example.<br>

> And if your dictionary allows for variable-length words, like an LZ(W)<br>

> kind of algorithm, the compression ratio should saturate pretty early.<br>

><br>

> Now, I haven't worked with the specific text data coming out of<br>

> rtl_power, so I'd be very interested in the results!<br>

> Bill, could you compress a few of your textual rtl_power output files<br>

> (using gzip --best, and maybe xz) for us and tell us the how many<br>

> numbers were in the original files and how many bytes are the resulting<br>

> files in length?<br>

><br>

> (zstd: would be very interesting to have a detached dictionary, because<br>

> I presume the dictionary overhead to be non-negligible with large<br>

> numbers of smaller observation files)<br>

> (BTW, tar is the worst format under the sun to compress many small<br>

> files; it pads every file to 512B; of course, zeros compress nicely,<br>

> but suddenly your shortest codeword is a useless padding symbol and<br>

> that has a measurable compressed file size effect)<br>

><br>

>> Compression obviously will reduce random access to data, so if your<br>

>> intended use involves seeking randomly around in the data, things get<br>

>> tricky.<br>

><br>

> Indeed, that's what I'd have to bring forward: A "compressor" based on<br>

> simply converting the tabular text data to binary format would be not<br>

> so far away, worst case, from an actual entropy encoder, but allow for<br>

> random seeks, AND be faster. I honestly don't see the downsides of<br>

> that!<br>

><br>

>> If the format is intended to be shared with other people, and/or<br>

>> manage large collections of such data, either use hdf5[1] or look<br>

>> into it for inspiration.<br>

><br>

> Yep; or other formats. GNU Radio, for example, simply uses raw binary<br>

> numbers packed end-to-end; there's the SigMF project which strives to<br>

> provide metadata (sample format, acquisition time, and other<br>

> parameters) in a separate file. It's JSON lying next to your data file.<br>

> Whether or not that's useful to you...<br>

><br>

>> If that sounds too complicated, then protobufs[2] might be another<br>

>> option. Both hdf5 and protobufs benefit from having readily available<br>

>> code for access to the data for later analysis, and from having<br>

>> hundreds of man-years of bugfixes and portability fixes behind them.<br>

><br>

> Yeah, but a protobuf that's mostly a buffer of ints really is only<br>

> binary numbers right after each other, plus a header that you define<br>

> yourself. It's a good idea to let some library like protobuf handle<br>

> that, I agree!<br>

><br>

>> Again, depending on the use case, another option might be to store<br>

>> the data in a sqlite[3] database file, its an underrated option for<br>

>> large amounts of data: here the binary conversion to and fro can be<br>

>> handled by the sqlite tools themselves, and you have access to the<br>

>> data in a fully random-access fashion. There are ways for sqlite to<br>

>> do online compression of data as well[4], incase you find the<br>

>> standard size reduction from going to binary isn't enough.<br>

><br>

> Not quite sure how well sqlite handles compression of BLOBs, or are you<br>

> suggesting you insert samples as values individually?<br>

><br>

> By the way, I think "so much data it becomes a burden to my server" is<br>

> actually not covered by what sqlite is designed to do. There might be<br>

> more optimized databases for that.<br>

><br>

>> Greg brought up endianness, then theres framing(indicating the start<br>

> if each "row" of data, in case a few bytes got corrupted on disk, thus<br>

> allowing the rest to still be recovered)<br>

><br>

> Which should be no problem, seeing that rows are fixed-length,<br>

><br>

>> versioning (if you change the data format, how do you have the<br>

> reading code still remain able to read the older format)<br>

><br>

> Important point, imho, but this feels like a one-off format, so really,<br>

> might be a bit of overengineering. Anyways, never hurts to simply have<br>

> a header field that says "version". Do that!<br>

><br>

>> debugging (its very hard to be 100% sure that a binary data you<br>

> intended to write, typically there will be no error messages from the<br>

> code and no crashes - just wrong and misleading results due to<br>

> miswritten/misread data), etc.<br>

><br>

> In my experience, writing textual data is way harder to keep<br>

> consistent, due to the non-fixed amount of bytes required per word.<br>

><br>

> Best regards,<br>

> Marcus<br>

><br>

</blockquote></div>