Hello,
I am working on making a binary format for rtl_power to save space on my server for observing the spectrum. I am wondering if anyone else would be interested in this idea. I am hoping keep the same format in terms of how the data is organized in the file but instead of using text use some form of binary format. For example for the date and time I think using a unix epoch timestamp would work.
I am hoping to receive any comments or criticism.
Thanks, KD9KCK
Bill Gaylord chibill110@gmail.com writes:
Hello,
I am working on making a binary format for rtl_power to save space onmy server for observing the spectrum. I am wondering if anyone else would be interested in this idea. I am hoping keep the same format in terms of how the data is organized in the file but instead of using text use some form of binary format. For example for the date and time I think using a unix epoch timestamp would work.
I am hoping to receive any comments or criticism.
My suggestions are:
Think about whether you want the binary format to be portable. This means being more careful about types and also endianness.
Post a draft file format for comments.
With a binary format for an existing text format, it would be nice to have programs (pipes ideally) that convert text->binary and binary->text.
For portability, I think the obvious answer is that the file format should be independent of CPU type, compiler, and specifically endiannesss. Thus you are writing a format that could be used over the network (even if it's only intended to be stored). This means you have to choose fixed-width types, and you have to store in a particular endianness. I am 99% sure that the network protocol standard is big endian, and I think you should follow that. I think there is precedent in things like sqlite. For pgsql I think so, but there is a culture of having to use pg_dump to change versions (and as a result pg_dump is 100% satisfactory; see my comment above about converters). I don't know about mysql.
The design of such a format would depend on the end goal. You mentioned that you wanted to save space on your server. A streaming compression of the current textual format should work quite well, and even better if we do a sort of a preprocessing of it to "delta encode" the date: i.e, if a series if, say: [ 241, 234, 221, 201, 100, 43, 0, -10...], convert it to [ 241, -7, -13, -20, -101, -57, -43, -10...].
If the use only involves streaming access to data (i.e, process each entry and move on), the compression works really well, and most compression formats allow both, streaming while writing, and streaming while reading (meaning one won't need to fully decompress a file on the filesystem before using it); look at tools like zcat, zstdcat, etc.
In practice you will find that [text format+ compression] will be fairly close to [binary format + compression] in final size. Compression obviously will reduce random access to data, so if your intended use involves seeking randomly around in the data, things get tricky.
If the format is intended to be shared with other people, and/or manage large collections of such data, either use hdf5[1] or look into it for inspiration. If that sounds too complicated, then protobufs[2] might be another option. Both hdf5 and protobufs benefit from having readily available code for access to the data for later analysis, and from having hundreds of man-years of bugfixes and portability fixes behind them.
Again, depending on the use case, another option might be to store the data in a sqlite[3] database file, its an underrated option for large amounts of data: here the binary conversion to and fro can be handled by the sqlite tools themselves, and you have access to the data in a fully random-access fashion. There are ways for sqlite to do online compression of data as well[4], incase you find the standard size reduction from going to binary isn't enough.
Having dealt with custom binary formats for large amounts of data quite a bit, I tend to recommend against doing that if standard options can instead be used: there are a large number of pitfalls in going binary, both when writing the data, as well as when reading it back; Greg brought up endianness, then theres framing(indicating the start if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered), versioning (if you change the data format, how do you have the reading code still remain able to read the older format), debugging (its very hard to be 100% sure that a binary data you intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc.
Just my 0.02USD.
-Abhishek
[1]: https://www.neonscience.org/about-hdf5 [2]: https://github.com/protocolbuffers/protobuf [3]: https://www.sqlite.org/about.html ("Think of SQLite not as a replacement for Oracle but as a replacement for fopen()") [4]: https://sqlite.org/zipvfs/doc/trunk/www/readme.wiki
On Fri, Dec 27, 2019 at 1:46 AM Greg Troxel gdt@lexort.com wrote:
Bill Gaylord chibill110@gmail.com writes:
Hello,
I am working on making a binary format for rtl_power to save space onmy server for observing the spectrum. I am wondering if anyone else would be interested in this idea. I am hoping keep the same format in terms of how the data is organized in the file but instead of using text use some form of binary format. For example for the date and time I think using a unix epoch timestamp would work.
I am hoping to receive any comments or criticism.
My suggestions are:
Think about whether you want the binary format to be portable. This means being more careful about types and also endianness.
Post a draft file format for comments.
With a binary format for an existing text format, it would be nice to have programs (pipes ideally) that convert text->binary and binary->text.
For portability, I think the obvious answer is that the file format should be independent of CPU type, compiler, and specifically endiannesss. Thus you are writing a format that could be used over the network (even if it's only intended to be stored). This means you have to choose fixed-width types, and you have to store in a particular endianness. I am 99% sure that the network protocol standard is big endian, and I think you should follow that. I think there is precedent in things like sqlite. For pgsql I think so, but there is a culture of having to use pg_dump to change versions (and as a result pg_dump is 100% satisfactory; see my comment above about converters). I don't know about mysql.
Abhishek Goyal abgoyal@gmail.com writes:
Having dealt with custom binary formats for large amounts of data quite a bit, I tend to recommend against doing that if standard options can instead be used: there are a large number of pitfalls in going binary, both when writing the data, as well as when reading it back;
A message full of excellent points. I failed to think of hdf5/protobuf and all the rest of the great suggestions. So I will second the notion that a custom binary format is very likely not a good idea.
Hi Abhishek,
On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
In practice you will find that [text format+ compression] will be fairly close to [binary format + compression] in final size.
Is that so? Color me surprised! While certainly any dictionary-based compressor could find the bytes that make up the individual digits and compress them to an average of a little less than 4b, that'd still be worse than the 8b you need to represent any number 0-255, for example. And if your dictionary allows for variable-length words, like an LZ(W) kind of algorithm, the compression ratio should saturate pretty early.
Now, I haven't worked with the specific text data coming out of rtl_power, so I'd be very interested in the results! Bill, could you compress a few of your textual rtl_power output files (using gzip --best, and maybe xz) for us and tell us the how many numbers were in the original files and how many bytes are the resulting files in length?
(zstd: would be very interesting to have a detached dictionary, because I presume the dictionary overhead to be non-negligible with large numbers of smaller observation files) (BTW, tar is the worst format under the sun to compress many small files; it pads every file to 512B; of course, zeros compress nicely, but suddenly your shortest codeword is a useless padding symbol and that has a measurable compressed file size effect)
Compression obviously will reduce random access to data, so if your intended use involves seeking randomly around in the data, things get tricky.
Indeed, that's what I'd have to bring forward: A "compressor" based on simply converting the tabular text data to binary format would be not so far away, worst case, from an actual entropy encoder, but allow for random seeks, AND be faster. I honestly don't see the downsides of that!
If the format is intended to be shared with other people, and/or manage large collections of such data, either use hdf5[1] or look into it for inspiration.
Yep; or other formats. GNU Radio, for example, simply uses raw binary numbers packed end-to-end; there's the SigMF project which strives to provide metadata (sample format, acquisition time, and other parameters) in a separate file. It's JSON lying next to your data file. Whether or not that's useful to you...
If that sounds too complicated, then protobufs[2] might be another option. Both hdf5 and protobufs benefit from having readily available code for access to the data for later analysis, and from having hundreds of man-years of bugfixes and portability fixes behind them.
Yeah, but a protobuf that's mostly a buffer of ints really is only binary numbers right after each other, plus a header that you define yourself. It's a good idea to let some library like protobuf handle that, I agree!
Again, depending on the use case, another option might be to store the data in a sqlite[3] database file, its an underrated option for large amounts of data: here the binary conversion to and fro can be handled by the sqlite tools themselves, and you have access to the data in a fully random-access fashion. There are ways for sqlite to do online compression of data as well[4], incase you find the standard size reduction from going to binary isn't enough.
Not quite sure how well sqlite handles compression of BLOBs, or are you suggesting you insert samples as values individually?
By the way, I think "so much data it becomes a burden to my server" is actually not covered by what sqlite is designed to do. There might be more optimized databases for that.
Greg brought up endianness, then theres framing(indicating the start
if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered)
Which should be no problem, seeing that rows are fixed-length,
versioning (if you change the data format, how do you have the
reading code still remain able to read the older format)
Important point, imho, but this feels like a one-off format, so really, might be a bit of overengineering. Anyways, never hurts to simply have a header field that says "version". Do that!
debugging (its very hard to be 100% sure that a binary data you
intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc.
In my experience, writing textual data is way harder to keep consistent, due to the non-fixed amount of bytes required per word.
Best regards, Marcus
Hi,
i'd agree that having text encoding + compression is far from ideal.
However, another aspect/goad might be following: have the main data readable binary from gnuplot.
see http://gnuplot.sourceforge.net/docs_4.2/node103.html
kind regards, Hayati
Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
Hi Abhishek,
On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
In practice you will find that [text format+ compression] will be fairly close to [binary format + compression] in final size.
Is that so? Color me surprised! While certainly any dictionary-based compressor could find the bytes that make up the individual digits and compress them to an average of a little less than 4b, that'd still be worse than the 8b you need to represent any number 0-255, for example. And if your dictionary allows for variable-length words, like an LZ(W) kind of algorithm, the compression ratio should saturate pretty early.
Now, I haven't worked with the specific text data coming out of rtl_power, so I'd be very interested in the results! Bill, could you compress a few of your textual rtl_power output files (using gzip --best, and maybe xz) for us and tell us the how many numbers were in the original files and how many bytes are the resulting files in length?
(zstd: would be very interesting to have a detached dictionary, because I presume the dictionary overhead to be non-negligible with large numbers of smaller observation files) (BTW, tar is the worst format under the sun to compress many small files; it pads every file to 512B; of course, zeros compress nicely, but suddenly your shortest codeword is a useless padding symbol and that has a measurable compressed file size effect)
Compression obviously will reduce random access to data, so if your intended use involves seeking randomly around in the data, things get tricky.
Indeed, that's what I'd have to bring forward: A "compressor" based on simply converting the tabular text data to binary format would be not so far away, worst case, from an actual entropy encoder, but allow for random seeks, AND be faster. I honestly don't see the downsides of that!
If the format is intended to be shared with other people, and/or manage large collections of such data, either use hdf5[1] or look into it for inspiration.
Yep; or other formats. GNU Radio, for example, simply uses raw binary numbers packed end-to-end; there's the SigMF project which strives to provide metadata (sample format, acquisition time, and other parameters) in a separate file. It's JSON lying next to your data file. Whether or not that's useful to you...
If that sounds too complicated, then protobufs[2] might be another option. Both hdf5 and protobufs benefit from having readily available code for access to the data for later analysis, and from having hundreds of man-years of bugfixes and portability fixes behind them.
Yeah, but a protobuf that's mostly a buffer of ints really is only binary numbers right after each other, plus a header that you define yourself. It's a good idea to let some library like protobuf handle that, I agree!
Again, depending on the use case, another option might be to store the data in a sqlite[3] database file, its an underrated option for large amounts of data: here the binary conversion to and fro can be handled by the sqlite tools themselves, and you have access to the data in a fully random-access fashion. There are ways for sqlite to do online compression of data as well[4], incase you find the standard size reduction from going to binary isn't enough.
Not quite sure how well sqlite handles compression of BLOBs, or are you suggesting you insert samples as values individually?
By the way, I think "so much data it becomes a burden to my server" is actually not covered by what sqlite is designed to do. There might be more optimized databases for that.
Greg brought up endianness, then theres framing(indicating the start
if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered)
Which should be no problem, seeing that rows are fixed-length,
versioning (if you change the data format, how do you have the
reading code still remain able to read the older format)
Important point, imho, but this feels like a one-off format, so really, might be a bit of overengineering. Anyways, never hurts to simply have a header field that says "version". Do that!
debugging (its very hard to be 100% sure that a binary data you
intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc.
In my experience, writing textual data is way harder to keep consistent, due to the non-fixed amount of bytes required per word.
Best regards, Marcus
I will be starting by defining by format then writing a converter to convert between text csv and the binary format. I will also try to write a stream based converter that you can pipe the output of rtl_power into the write the binary format directly. I figure this is more universal then trying to make a binary format in the application itself.
On Thu, Jan 2, 2020 at 7:30 PM Hayati Ayguen h_ayguen@web.de wrote:
Hi,
i'd agree that having text encoding + compression is far from ideal.
However, another aspect/goad might be following: have the main data readable binary from gnuplot.
see http://gnuplot.sourceforge.net/docs_4.2/node103.html
kind regards, Hayati
Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
Hi Abhishek,
On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
In practice you will find that [text format+ compression] will be fairly close to [binary format + compression] in final size.
Is that so? Color me surprised! While certainly any dictionary-based compressor could find the bytes that make up the individual digits and compress them to an average of a little less than 4b, that'd still be worse than the 8b you need to represent any number 0-255, for example. And if your dictionary allows for variable-length words, like an LZ(W) kind of algorithm, the compression ratio should saturate pretty early.
Now, I haven't worked with the specific text data coming out of rtl_power, so I'd be very interested in the results! Bill, could you compress a few of your textual rtl_power output files (using gzip --best, and maybe xz) for us and tell us the how many numbers were in the original files and how many bytes are the resulting files in length?
(zstd: would be very interesting to have a detached dictionary, because I presume the dictionary overhead to be non-negligible with large numbers of smaller observation files) (BTW, tar is the worst format under the sun to compress many small files; it pads every file to 512B; of course, zeros compress nicely, but suddenly your shortest codeword is a useless padding symbol and that has a measurable compressed file size effect)
Compression obviously will reduce random access to data, so if your intended use involves seeking randomly around in the data, things get tricky.
Indeed, that's what I'd have to bring forward: A "compressor" based on simply converting the tabular text data to binary format would be not so far away, worst case, from an actual entropy encoder, but allow for random seeks, AND be faster. I honestly don't see the downsides of that!
If the format is intended to be shared with other people, and/or manage large collections of such data, either use hdf5[1] or look into it for inspiration.
Yep; or other formats. GNU Radio, for example, simply uses raw binary numbers packed end-to-end; there's the SigMF project which strives to provide metadata (sample format, acquisition time, and other parameters) in a separate file. It's JSON lying next to your data file. Whether or not that's useful to you...
If that sounds too complicated, then protobufs[2] might be another option. Both hdf5 and protobufs benefit from having readily available code for access to the data for later analysis, and from having hundreds of man-years of bugfixes and portability fixes behind them.
Yeah, but a protobuf that's mostly a buffer of ints really is only binary numbers right after each other, plus a header that you define yourself. It's a good idea to let some library like protobuf handle that, I agree!
Again, depending on the use case, another option might be to store the data in a sqlite[3] database file, its an underrated option for large amounts of data: here the binary conversion to and fro can be handled by the sqlite tools themselves, and you have access to the data in a fully random-access fashion. There are ways for sqlite to do online compression of data as well[4], incase you find the standard size reduction from going to binary isn't enough.
Not quite sure how well sqlite handles compression of BLOBs, or are you suggesting you insert samples as values individually?
By the way, I think "so much data it becomes a burden to my server" is actually not covered by what sqlite is designed to do. There might be more optimized databases for that.
Greg brought up endianness, then theres framing(indicating the start
if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered)
Which should be no problem, seeing that rows are fixed-length,
versioning (if you change the data format, how do you have the
reading code still remain able to read the older format)
Important point, imho, but this feels like a one-off format, so really, might be a bit of overengineering. Anyways, never hurts to simply have a header field that says "version". Do that!
debugging (its very hard to be 100% sure that a binary data you
intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc.
In my experience, writing textual data is way harder to keep consistent, due to the non-fixed amount of bytes required per word.
Best regards, Marcus
Hi,
I missed this discussion on a topic I value.
I would propose moving towards audiovideo containers like mkv, so that norms for things like tags and channels can develop, and algorithms for stream compression be used when relevant. Standardization is always a work in progress, always helpful to everyone, and gets more people working on each other's problems.
rtl_power_fftw also has a binary output format see https://github.com/AD-Vega/rtl-power-fftw/pull/11
soapy_power has a binary format output with a lot of proposed contributions i am unfortunately involved with but didn't have the capacity to take over when xmikos tired out https://github.com/xmikos/soapy_power/pulls
One thing I tend to desire of spectrum storage formats is the ability to store raw i/q logs for parts of the recording alongside them, or with a glitchy device like rtlsdr the serial number and usb packetlog (see https://github.com/keenerd/rtl-sdr/compare/master...xloem:logfile-official ) with a way to mark synchronization of events.
On Fri, Jan 3, 2020, 2:51 PM Bill Gaylord chibill110@gmail.com wrote:
I will be starting by defining by format then writing a converter to convert between text csv and the binary format. I will also try to write a stream based converter that you can pipe the output of rtl_power into the write the binary format directly. I figure this is more universal then trying to make a binary format in the application itself.
On Thu, Jan 2, 2020 at 7:30 PM Hayati Ayguen h_ayguen@web.de wrote:
Hi,
i'd agree that having text encoding + compression is far from ideal.
However, another aspect/goad might be following: have the main data readable binary from gnuplot.
see http://gnuplot.sourceforge.net/docs_4.2/node103.html
kind regards, Hayati
Am 02.01.2020 um 14:49 schrieb Müller, Marcus (CEL):
Hi Abhishek,
On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote:
In practice you will find that [text format+ compression] will be fairly close to [binary format + compression] in final size.
Is that so? Color me surprised! While certainly any dictionary-based compressor could find the bytes that make up the individual digits and compress them to an average of a little less than 4b, that'd still be worse than the 8b you need to represent any number 0-255, for example. And if your dictionary allows for variable-length words, like an LZ(W) kind of algorithm, the compression ratio should saturate pretty early.
Now, I haven't worked with the specific text data coming out of rtl_power, so I'd be very interested in the results! Bill, could you compress a few of your textual rtl_power output files (using gzip --best, and maybe xz) for us and tell us the how many numbers were in the original files and how many bytes are the resulting files in length?
(zstd: would be very interesting to have a detached dictionary, because I presume the dictionary overhead to be non-negligible with large numbers of smaller observation files) (BTW, tar is the worst format under the sun to compress many small files; it pads every file to 512B; of course, zeros compress nicely, but suddenly your shortest codeword is a useless padding symbol and that has a measurable compressed file size effect)
Compression obviously will reduce random access to data, so if your intended use involves seeking randomly around in the data, things get tricky.
Indeed, that's what I'd have to bring forward: A "compressor" based on simply converting the tabular text data to binary format would be not so far away, worst case, from an actual entropy encoder, but allow for random seeks, AND be faster. I honestly don't see the downsides of that!
If the format is intended to be shared with other people, and/or manage large collections of such data, either use hdf5[1] or look into it for inspiration.
Yep; or other formats. GNU Radio, for example, simply uses raw binary numbers packed end-to-end; there's the SigMF project which strives to provide metadata (sample format, acquisition time, and other parameters) in a separate file. It's JSON lying next to your data file. Whether or not that's useful to you...
If that sounds too complicated, then protobufs[2] might be another option. Both hdf5 and protobufs benefit from having readily available code for access to the data for later analysis, and from having hundreds of man-years of bugfixes and portability fixes behind them.
Yeah, but a protobuf that's mostly a buffer of ints really is only binary numbers right after each other, plus a header that you define yourself. It's a good idea to let some library like protobuf handle that, I agree!
Again, depending on the use case, another option might be to store the data in a sqlite[3] database file, its an underrated option for large amounts of data: here the binary conversion to and fro can be handled by the sqlite tools themselves, and you have access to the data in a fully random-access fashion. There are ways for sqlite to do online compression of data as well[4], incase you find the standard size reduction from going to binary isn't enough.
Not quite sure how well sqlite handles compression of BLOBs, or are you suggesting you insert samples as values individually?
By the way, I think "so much data it becomes a burden to my server" is actually not covered by what sqlite is designed to do. There might be more optimized databases for that.
Greg brought up endianness, then theres framing(indicating the start
if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered)
Which should be no problem, seeing that rows are fixed-length,
versioning (if you change the data format, how do you have the
reading code still remain able to read the older format)
Important point, imho, but this feels like a one-off format, so really, might be a bit of overengineering. Anyways, never hurts to simply have a header field that says "version". Do that!
debugging (its very hard to be 100% sure that a binary data you
intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc.
In my experience, writing textual data is way harder to keep consistent, due to the non-fixed amount of bytes required per word.
Best regards, Marcus