Hi,
i'd agree that having text encoding + compression is far from ideal.
However, another aspect/goad might be following:
have the main data readable binary from gnuplot.
see
Hi Abhishek,
On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
In practice you will find that [text format+
compression] will be
fairly close to [binary format + compression] in final size.
Is that so? Color me surprised! While certainly any dictionary-based
compressor could find the bytes that make up the individual digits and
compress them to an average of a little less than 4b, that'd still be
worse than the 8b you need to represent any number 0-255, for example.
And if your dictionary allows for variable-length words, like an LZ(W)
kind of algorithm, the compression ratio should saturate pretty early.
Now, I haven't worked with the specific text data coming out of
rtl_power, so I'd be very interested in the results!
Bill, could you compress a few of your textual rtl_power output files
(using gzip --best, and maybe xz) for us and tell us the how many
numbers were in the original files and how many bytes are the resulting
files in length?
(zstd: would be very interesting to have a detached dictionary, because
I presume the dictionary overhead to be non-negligible with large
numbers of smaller observation files)
(BTW, tar is the worst format under the sun to compress many small
files; it pads every file to 512B; of course, zeros compress nicely,
but suddenly your shortest codeword is a useless padding symbol and
that has a measurable compressed file size effect)
Compression obviously will reduce random access
to data, so if your
intended use involves seeking randomly around in the data, things get
tricky.
Indeed, that's what I'd have to bring forward: A "compressor" based on
simply converting the tabular text data to binary format would be not
so far away, worst case, from an actual entropy encoder, but allow for
random seeks, AND be faster. I honestly don't see the downsides of
that!
If the format is intended to be shared with other
people, and/or
manage large collections of such data, either use hdf5[1] or look
into it for inspiration.
Yep; or other formats. GNU Radio, for example, simply uses raw binary
numbers packed end-to-end; there's the SigMF project which strives to
provide metadata (sample format, acquisition time, and other
parameters) in a separate file. It's JSON lying next to your data file.
Whether or not that's useful to you...
If that sounds too complicated, then protobufs[2]
might be another
option. Both hdf5 and protobufs benefit from having readily available
code for access to the data for later analysis, and from having
hundreds of man-years of bugfixes and portability fixes behind them.
Yeah, but a protobuf that's mostly a buffer of ints really is only
binary numbers right after each other, plus a header that you define
yourself. It's a good idea to let some library like protobuf handle
that, I agree!
Again, depending on the use case, another option
might be to store
the data in a sqlite[3] database file, its an underrated option for
large amounts of data: here the binary conversion to and fro can be
handled by the sqlite tools themselves, and you have access to the
data in a fully random-access fashion. There are ways for sqlite to
do online compression of data as well[4], incase you find the
standard size reduction from going to binary isn't enough.
Not quite sure how well sqlite handles compression of BLOBs, or are you
suggesting you insert samples as values individually?
By the way, I think "so much data it becomes a burden to my server" is
actually not covered by what sqlite is designed to do. There might be
more optimized databases for that.
Greg brought up endianness, then theres
framing(indicating the start
if each "row" of data, in case a few bytes
got corrupted on disk, thus
allowing the rest to still be recovered)
Which should be no problem, seeing that rows are fixed-length,
versioning (if you change the data format, how do
you have the
reading code still remain able to read the older format)
Important point, imho, but this feels like a one-off format, so really,
might be a bit of overengineering. Anyways, never hurts to simply have
a header field that says "version". Do that!
debugging (its very hard to be 100% sure that a
binary data you
intended to write, typically there will be no error messages from
the
code and no crashes - just wrong and misleading results due to
miswritten/misread data), etc.
In my experience, writing textual data is way harder to keep
consistent, due to the non-fixed amount of bytes required per word.
Best regards,
Marcus