This is merely a historical archive of years 2008-2021, before the migration to mailman3.
A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/osmocom-sdr@lists.osmocom.org/.
Müller, Marcus (CEL) mueller at kit.eduHi Abhishek, On Fri, 2019-12-27 at 20:11 +0530, Abhishek Goyal wrote: > In practice you will find that [text format+ compression] will be > fairly close to [binary format + compression] in final size. Is that so? Color me surprised! While certainly any dictionary-based compressor could find the bytes that make up the individual digits and compress them to an average of a little less than 4b, that'd still be worse than the 8b you need to represent any number 0-255, for example. And if your dictionary allows for variable-length words, like an LZ(W) kind of algorithm, the compression ratio should saturate pretty early. Now, I haven't worked with the specific text data coming out of rtl_power, so I'd be very interested in the results! Bill, could you compress a few of your textual rtl_power output files (using gzip --best, and maybe xz) for us and tell us the how many numbers were in the original files and how many bytes are the resulting files in length? (zstd: would be very interesting to have a detached dictionary, because I presume the dictionary overhead to be non-negligible with large numbers of smaller observation files) (BTW, tar is the worst format under the sun to compress many small files; it pads every file to 512B; of course, zeros compress nicely, but suddenly your shortest codeword is a useless padding symbol and that has a measurable compressed file size effect) > Compression obviously will reduce random access to data, so if your > intended use involves seeking randomly around in the data, things get > tricky. Indeed, that's what I'd have to bring forward: A "compressor" based on simply converting the tabular text data to binary format would be not so far away, worst case, from an actual entropy encoder, but allow for random seeks, AND be faster. I honestly don't see the downsides of that! > If the format is intended to be shared with other people, and/or > manage large collections of such data, either use hdf5[1] or look > into it for inspiration. Yep; or other formats. GNU Radio, for example, simply uses raw binary numbers packed end-to-end; there's the SigMF project which strives to provide metadata (sample format, acquisition time, and other parameters) in a separate file. It's JSON lying next to your data file. Whether or not that's useful to you... > If that sounds too complicated, then protobufs[2] might be another > option. Both hdf5 and protobufs benefit from having readily available > code for access to the data for later analysis, and from having > hundreds of man-years of bugfixes and portability fixes behind them. Yeah, but a protobuf that's mostly a buffer of ints really is only binary numbers right after each other, plus a header that you define yourself. It's a good idea to let some library like protobuf handle that, I agree! > Again, depending on the use case, another option might be to store > the data in a sqlite[3] database file, its an underrated option for > large amounts of data: here the binary conversion to and fro can be > handled by the sqlite tools themselves, and you have access to the > data in a fully random-access fashion. There are ways for sqlite to > do online compression of data as well[4], incase you find the > standard size reduction from going to binary isn't enough. Not quite sure how well sqlite handles compression of BLOBs, or are you suggesting you insert samples as values individually? By the way, I think "so much data it becomes a burden to my server" is actually not covered by what sqlite is designed to do. There might be more optimized databases for that. > Greg brought up endianness, then theres framing(indicating the start if each "row" of data, in case a few bytes got corrupted on disk, thus allowing the rest to still be recovered) Which should be no problem, seeing that rows are fixed-length, > versioning (if you change the data format, how do you have the reading code still remain able to read the older format) Important point, imho, but this feels like a one-off format, so really, might be a bit of overengineering. Anyways, never hurts to simply have a header field that says "version". Do that! > debugging (its very hard to be 100% sure that a binary data you intended to write, typically there will be no error messages from the code and no crashes - just wrong and misleading results due to miswritten/misread data), etc. In my experience, writing textual data is way harder to keep consistent, due to the non-fixed amount of bytes required per word. Best regards, Marcus -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6582 bytes Desc: not available URL: <http://lists.osmocom.org/pipermail/osmocom-sdr/attachments/20200102/e2a337d4/attachment.bin>