The design of such a format would depend on the end goal. You mentioned
that you wanted to save space on your server. A streaming compression of
the current textual format should work quite well, and even better if we do
a sort of a preprocessing of it to "delta encode" the date: i.e, if a
series if, say: [ 241, 234, 221, 201, 100, 43, 0, -10...], convert it to [
241, -7, -13, -20, -101, -57, -43, -10...].
If the use only involves streaming access to data (i.e, process each entry
and move on), the compression works really well, and most compression
formats allow both, streaming while writing, and streaming while reading
(meaning one won't need to fully decompress a file on the filesystem before
using it); look at tools like zcat, zstdcat, etc.
In practice you will find that [text format+ compression] will be fairly
close to [binary format + compression] in final size.
Compression obviously will reduce random access to data, so if your
intended use involves seeking randomly around in the data, things get
tricky.
If the format is intended to be shared with other people, and/or manage
large collections of such data, either use hdf5[1] or look into it for
inspiration.
If that sounds too complicated, then protobufs[2] might be another option.
Both hdf5 and protobufs benefit from having readily available code for
access to the data for later analysis, and from having hundreds of
man-years of bugfixes and portability fixes behind them.
Again, depending on the use case, another option might be to store the data
in a sqlite[3] database file, its an underrated option for large amounts of
data: here the binary conversion to and fro can be handled by the sqlite
tools themselves, and you have access to the data in a fully random-access
fashion. There are ways for sqlite to do online compression of data as
well[4], incase you find the standard size reduction from going to binary
isn't enough.
Having dealt with custom binary formats for large amounts of data quite a
bit, I tend to recommend against doing that if standard options can instead
be used: there are a large number of pitfalls in going binary, both when
writing the data, as well as when reading it back; Greg brought up
endianness, then theres framing(indicating the start if each "row" of data,
in case a few bytes got corrupted on disk, thus allowing the rest to still
be recovered), versioning (if you change the data format, how do you have
the reading code still remain able to read the older format), debugging
(its very hard to be 100% sure that a binary data you intended to write,
typically there will be no error messages from the code and no crashes -
just wrong and misleading results due to miswritten/misread data), etc.
Just my 0.02USD.
-Abhishek
[1]:
https://www.neonscience.org/about-hdf5
[2]:
https://github.com/protocolbuffers/protobuf
[3]:
https://www.sqlite.org/about.html ("Think of SQLite not as a
replacement for Oracle but as a replacement for fopen()")
[4]:
https://sqlite.org/zipvfs/doc/trunk/www/readme.wiki
On Fri, Dec 27, 2019 at 1:46 AM Greg Troxel <gdt(a)lexort.com> wrote:
Bill Gaylord <chibill110(a)gmail.com> writes:
Hello,
I am working on making a binary format for rtl_power to save space on
my server for observing the spectrum. I am wondering if anyone else would
be interested in this idea.
I am hoping keep the same format in terms of how the data is organized in
the file but instead of using text use some form of binary format. For
example for the date and time I think using a unix epoch timestamp would
work.
I am hoping to receive any comments or criticism.
My suggestions are:
Think about whether you want the binary format to be portable. This
means being more careful about types and also endianness.
Post a draft file format for comments.
With a binary format for an existing text format, it would be nice to
have programs (pipes ideally) that convert text->binary and
binary->text.
For portability, I think the obvious answer is that the file format
should be independent of CPU type, compiler, and specifically
endiannesss. Thus you are writing a format that could be used over
the network (even if it's only intended to be stored). This means you
have to choose fixed-width types, and you have to store in a particular
endianness. I am 99% sure that the network protocol standard is big
endian, and I think you should follow that. I think there is precedent
in things like sqlite. For pgsql I think so, but there is a culture of
having to use pg_dump to change versions (and as a result pg_dump is
100% satisfactory; see my comment above about converters). I don't know
about mysql.