Hello All, We have executed profiling test of two versions of algorithm for decoding compressed bitmap of EPDAN.
From the results , we see that performance is better in Tree based decoding (Version2 as given below )
Version 1: Bitmap based decoding as present in current master branch ( Function name osmo_t4_decode ) Version 2: Tree based decoding ( Function name decompress_crbb , as proposed in patch "Decompress the CRBB bitmap using tree based approach"http://lists.osmocom.org/pipermail/osmocom-net-gprs/2016-March/000525.html )
A sample bitmap taken from a real mobile log is used for the test. Host execution: Time taken to decode (micro seconds) Version 1: (Bitmap based decoding) : MIN -17 MAX -19 AVERAGE -17.9 Version 2 (tree based decoding): MIN -4 MAX -13 AVERAGE - 5.2
Target execution: Time taken to decode (micro seconds) Version 1: (Bitmap based decoding) : MIN -277 MAX -583 AVERAGE - 353 Version 2 (tree based decoding): MIN -67 MAX -86 AVERAGE - 69.8
Regards Prasad
On 08 Jun 2016, at 12:36, Prasad Kaup Prasad.Kaup@radisys.com wrote:
Hello All,
Hi!
We have executed profiling test of two versions of algorithm for decoding compressed bitmap of EPDAN.From the results , we see that performance is better in Tree based decoding (Version2 as given below ) Version 1: Bitmap based decoding as present in current master branch ( Function name osmo_t4_decode ) Version 2: Tree based decoding ( Function name decompress_crbb , as proposed in patch “Decompress the CRBB bitmap using tree based approach” )
A sample bitmap taken from a real mobile log is used for the test. Host execution: Time taken to decode (micro seconds) Version 1: (Bitmap based decoding) : MIN -17 MAX -19 AVERAGE -17.9 Version 2 (tree based decoding): MIN -4 MAX -13 AVERAGE - 5.2
Target execution: Time taken to decode (micro seconds) Version 1: (Bitmap based decoding) : MIN -277 MAX -583 AVERAGE - 353 Version 2 (tree based decoding): MIN -67 MAX -86 AVERAGE – 69.8
please release the benchmark and the test data so one can re-measure and see where time is spent.
thank you holger
Hello,
please release the benchmark and the test data so one can re-measure and see where time is spent.
Test data is " 0x40,0x20,0x0a,0x7f,0x79,0x61,0x00,0x49,0x2b,0x4c,0x00,0xe1,0xf2,0xe9,0xfe,0xae,0x7a,0x00,0xf0,0x30,0x3b,0x2b,0x2b"
Timing is collected using gettimeofday call. Below is the data in microseconds.
Bitmap based decoder Tree based decoder Host Target Host Target 17 312 4 86 18 317 4 67 19 289 5 68 18 317 13 68 18 553 5 69 17 294 5 67 18 311 4 68 18 277 4 68 18 278 4 68 18 583 4 69
Regards Prasad
On 09 Jun 2016, at 16:01, Prasad Kaup Prasad.Kaup@radisys.com wrote:
Hello,
please release the benchmark and the test data so one can re-measure and see where time is spent.
Test data is " 0x40,0x20,0x0a,0x7f,0x79,0x61,0x00,0x49,0x2b,0x4c,0x00,0xe1,0xf2,0xe9,0xfe,0xae,0x7a,0x00,0xf0,0x30,0x3b,0x2b,0x2b"
now only the benchmark (sourcecode) is missing. E.g. you can put it into the contrib folder of the pcu and push it to a branch in the radisyst repository?
holger
On 10 Jun 2016, at 17:36, Prasad Kaup Prasad.Kaup@radisys.com wrote:
Hello,
Dear Passad,
now only the benchmark (sourcecode) is missing. E.g. you can put it into the> contrib folder of the pcu and push it to a branch in the radisyst repository?
I have attached the patches with latest master baseline.
you do not seem to understand my intent. I want to run "perf" on your benchmark and compare the two implementations. To be able to do this, you will need to share the code that executes the two different implementations (and the testdata).
So please put the sourcecode for your test/benchmark application into the contrib/ directory and push it into a branch of your osmo-pcu-radisys git repository.
kind regards holger
Note also that UT environment is cross compiled and TBFTest is run on target to get the timing data for target. Regards Prasad
On 10 Jun 2016, at 18:36, Prasad Kaup Prasad.Kaup@radisys.com wrote:
Note also that UT environment is cross compiled and TBFTest is run on target to get the timing data for target.
Will you provide the source + test data for your benchmark? If not why not? A benchmark (and its result) is only useful if it is perfectly clear of what has been measured and compared with each other. The easiest way to make that clear is to provide the source and data (+ explanation on compiling/execution).
kind regards holger
On 10 Jun 2016, at 19:34, Holger Freyther holger@freyther.de wrote:
Will you provide the source + test data for your benchmark? If not why not? A benchmark (and its result) is only useful if it is perfectly clear of what has been measured and compared with each other. The easiest way to make that clear is to provide the source and data (+ explanation on compiling/execution).
After going through the patch again, I see that you instrument a single call of the decoding code? I don't think that this is a good approach to comparing the decoding code. What I expected (and probably Harald as well) is:
* One collects some typical data (e.g. all data used during routing area update, pdp context activation and then opening a common website, e.g. m.heise.de), alternative you can use other compressed bitmaps if you can say how they represent a public page.
* Ones creates a bench_compressed_window.c with at least two paths:
** Call the existing code ** Call the new tree code
Initialize the code tree in both cases and one can use "time" to compare it. E.g. the non-compiling skeleton like this:
struct test_data { const uint8_t *data; const size_t len; const ... expected_result; // (what is the point if it is more quick if it does not work) };
static const struct test_data tests[] = { TEST_DATA(data, NULL),.. };
static void test_old(int rounds, int verify) { while (rounds-- > 0) { int i; for (i = 0; i < ARRAY_SIZE(tests); ++i) { call_old_code(); if (verify) check_result(); } } }
static void test_new(int rounds, int verify) { while (rounds-- > 0) { int i; for (i = 0; i < ARRAY_SIZE(tests); ++i) { call_new_code(); if (verify) check_result(); } } }
int main() { initialize_tree(); ... if (new) test_new(10000, 0) else test_old(10000, 0); }
Hello Holger,
After going through the patch again, I see that you instrument a single call of the decoding code? I don't think that this is a good approach to comparing the decoding code.
We will add more test samples for the profiling test as suggested and share the benchmark.
Regards Prasad
Hello Holger, We have pushed the branch with tree based decoding algorithm and profiling test for comparison of existing method in master over the tree based algorithm developed by Radisys. Branch name: users/pravin/epdan_profiling at http://git.osmocom.org/radisys/osmo-pcu/
The comments suggested in earlier mail are addressed in the branch. Following is the profiling results in micro-seconds. Each of 4 inputs is decoded 10000 times for getting the below data Existing decoding algorithm in Target Input Min Max Avg 1 165 174 276 2 100 101 168 3 16 26 26 4 185 190 312
Tree based decoding algorithm Target Input Min Max Avg 1 65 430 110 2 58 59 98 3 3 4 6 4 96 469 162
Moreover, we also tried the same on host machine below are the results. Existing decoding algorithm in Host Input Min Max Avg 1 5 6 12 2 3 4 8 3 0 1 0 4 5 7 14
Tree based decoding algorithm in Host Input Min Max Avg 1 2 3 7 2 2 3 5 3 0 1 0 4 3 4 9
Regards, Pravin
Hello Holger, In the submitted patch we see that there a bug in computing max value we will correct it and resubmit the patch and results.
Regards, Pravin
Hi All, Following is the profiling results comparison of the decoding algorithms (unit in micro-seconds). Each of the 4 test inputs is decoded 10000 times for getting below data.
Input test In target In host Vector id Tree based Existing Tree based Existing Min Avg Min Avg Min Avg Min Avg 1 66 69.6 164 173.4 2 6.8 5 13.3 2 59 62.3 118 124.8 2 5.5 3 11.5 3 12 13.1 20 22 0 0.5 0 1.3 4 96 102.4 251 264 3 9.2 7 19.8
Important summary: 1. For few other identified test vectors existing algorithm fails functionally to decode whereas tree based decoding algorithm succeeds. 2. Details of these test vectors are in the latest patch available in users/pravin/epdan_profiling at http://git.osmocom.org/radisys/osmo-pcu/ 3. From the above results it shows that Tree based decoding algorithm is better than existing algorithm for the decoding time consumed. 4. The max value is not included because it is abnormally high which occurs at very low frequency like once in 10000 iterations.
The detailed Profiling data follows, In target:
Tree based decoding algorithm: Input test Min Avg Max vector Id 1 66 69.6 651 2 59 62.3 1125 3 12 13.1 259 4 96 102.4 1418
Existing Algorithm: Input test Min Avg Max vector Id 1 164 173.4 1372 2 118 124.8 691 3 20 22 1376 4 251 264 1229
In host:
Tree based decoding algorithm: Input test Min Avg Max Vector id 1 2 6.8 16065 2 2 5.5 20063 3 0 0.5 67 4 3 9.2 16219
Existing Algorithm: Input test Min Avg Max Vector id 1 5 13.3 12042 2 3 11.5 10816 3 0 1.3 5227 4 7 19.8 12078
Regards, Pravin
On 23 Jun 2016, at 16:39, Pravin Kumaravel Manoharan Pravin.Manoharan@radisys.com wrote:
Dear Parvin,
Following is the profiling results comparison of the decoding algorithms (unit in micro-seconds). Each of the 4 test inputs is decoded 10000 times for getting below data.
Important summary:
- For few other identified test vectors existing algorithm fails functionally to decode whereas tree based decoding algorithm succeeds.
- Details of these test vectors are in the latest patch available in users/pravin/epdan_profiling at http://git.osmocom.org/radisys/osmo-pcu/
I used it from the testcase you added on gerrit and re-added osmo_t4_decode to it. More about results later:
- From the above results it shows that Tree based decoding algorithm is better than existing algorithm for the decoding time consumed.
- The max value is not included because it is abnormally high which occurs at very low frequency like once in 10000 iterations.
max. Well without distribution this argument is difficult. With a tree we will "jump" through memory a lot more than with a table. E.g. outside a microbenchmark the numbers might not be that stable.
Anyway on sysmoBTS #1 with nightly build as of today/yesterday and 89ce5adbe881f4292d5bd5bfc85bd142b3fb4151 (from gerrit refs/changes/17/417/2). Modified to disable verification and 100000 iterations. The difference is big enough.
root@sysmobts-v2:~# time ./TbfTest.TREE
real 0m24.754s user 0m22.940s sys 0m1.790s root@sysmobts-v2:~# time ./TbfTest
real 1m15.535s user 1m13.100s sys 0m1.730s
perf report -i perf.data.tree
33.00% TbfTest.TREE libosmocore.so.7.0.0 [.] bitvec_set_bit_pos 20.46% TbfTest.TREE TbfTest.TREE [.] bitvec_write_field(bitvec*, unsigned int&, unsigned long long, unsigned int) 14.30% TbfTest.TREE libosmocore.so.7.0.0 [.] bitvec_set_bit 9.94% TbfTest.TREE TbfTest.TREE [.] search_runlen(node*, unsigned char const*, unsigned char, unsigned char*, unsigned short*) 5.27% TbfTest.TREE TbfTest.TREE [.] Decoding::decompress_crbb(signed char, unsigned char, unsigned char const*, bitvec*) 4.97% TbfTest.TREE libosmocore.so.7.0.0 [.] 0x0000523c 4.29% TbfTest.TREE TbfTest.TREE [.] 0x000023a8
perf report -i perf.data.roll
57.51% TbfTest libosmocore.so.7.0.0 [.] osmo_t4_decode 12.09% TbfTest libosmocore.so.7.0.0 [.] bitvec_shiftl 10.63% TbfTest libosmocore.so.7.0.0 [.] bitvec_set_bit_pos 4.49% TbfTest libosmocore.so.7.0.0 [.] bitvec_set_bit 4.17% TbfTest libosmocore.so.7.0.0 [.] bitvec_fill 3.62% TbfTest libosmocore.so.7.0.0 [.] 0x00004d88 2.21% TbfTest libosmocore.so.7.0.0 [.] bitvec_get_int16_msb 1.60% TbfTest libc-2.18.so [.] memcpy 0.68% TbfTest [kernel.kallsyms] [k] read_cycles
Conclusions and comments in the next mail.
holger
On 13 Jul 2016, at 11:43, Holger Freyther holger@freyther.de wrote:
Hi all,
perf report -i perf.data.tree
33.00% TbfTest.TREE libosmocore.so.7.0.0 [.] bitvec_set_bit_pos 20.46% TbfTest.TREE TbfTest.TREE [.] bitvec_write_field(bitvec*, unsigned int&, unsigned long long, unsigned int) 14.30% TbfTest.TREE libosmocore.so.7.0.0 [.] bitvec_set_bit
so crazily neither the original of the C++ bitvec_write_field nor the C version end up inlining bitvec_set_bit/bitvec_set_bit_pos.
1st) the C++ bitvec_write_field with the reference should be a inline function that calls the C version and passes the parameter as pointer
2nd) We need to get set_bit_pos and set_bit inlined into bitvec_write_field. The wall clock time of my benchmark run goes from ~24s to ~13s if these routines are inlined.
9.94% TbfTest.TREE TbfTest.TREE [.] search_runlen(node*, unsigned char const*, unsigned char, unsigned char*, unsigned short*)
5.27% TbfTest.TREE TbfTest.TREE [.] Decoding::decompress_crbb(signed char, unsigned char, unsigned char const*, bitvec*)
57.51% TbfTest libosmocore.so.7.0.0 [.] osmo_t4_decode
osmo_t4_decode (got the runlen step and such inlined). What I think decompress_crbb is doing better is 1st not using bitvec as input but iterating over the bits itself and being more direct in applying the codeword in the result. What I am missing and have to check is if search_runlen can be implemented around the "table" we have and what the performance difference is. I have asked Max for help.
I will follow up after I have seen the performance difference.
holger
After looking at the code, I don't think reusing table implementation would be easy - the approaches are too different as well as underlying data structures.
On 07/13/2016 11:53 AM, Holger Freyther wrote:
osmo_t4_decode (got the runlen step and such inlined). What I think decompress_crbb is doing better is 1st not using bitvec as input but iterating over the bits itself and being more direct in applying the codeword in the result. What I am missing and have to check is if search_runlen can be implemented around the "table" we have and what the performance difference is. I have asked Max for help.
On 13 Jul 2016, at 12:21, Max msuraev@sysmocom.de wrote:
Hi all,
After looking at the code, I don't think reusing table implementation would be easy - the approaches are too different as well as underlying data structures.
okay. then we will need to use the tree based version and from my point of view we should do it the following way:
* Remove osmo_t4_decode (and related routines) from libosmocore (last step) * Make the tree based decoder ready in terms of the surrounding style * Adopt/Clone the osmo_t4_decode tests and move them to the PCU (as part of the commit that adds the decoder to the PCU)
@Me: After the infrastructure is in the PCU I will remove osmo_t4_decode (and tests) from libosmocore
@Radisys:
I am afraid the tree based decoding is not ready yet and as it impacts the upcoming encoder (and other RLE code as far as I understand) let me put it here.
In osmo-pcu (and all other projects) we use libtalloc for memory allocations. This means the tree should use talloc with proper parent/child allocations. This way the destructor of the decoder will also free all nodes of the tree. There should be no call to malloc/free in the PCU.
Please take the tests from libosmocore and make sure they work with the new decoder. Tests are the lifeline and we need more and not less of them.
The decoder needs to be robust against invalid input. search_runlen can fail but the caller doesn't check for that. Please test with invalid/truncated or broken inputs to see we don't end in a never terminating while loop. Test using unit tests. Similar the current osmo_t4_decode can return an error that is propagated back to the compressed bitmap decoding and we should have the same.
Reduce code duplication. With the decoder the only difference between the if/else is if the data is filled with 0x00 or 0xFF and which table/root to use. Instead of having code clones please parameterize either by having local variables in the beginning or by calling different (inline) functions.
kind regards holger
On 13 Jul 2016, at 13:51, Holger Freyther holger@freyther.de wrote:
Hey!
After looking at the code, I don't think reusing table implementation would be easy - the approaches are too different as well as underlying data structures.
okay. then we will need to use the tree based version and from my point of view we should do it the following way:
- Remove osmo_t4_decode (and related routines) from libosmocore (last step)
- Make the tree based decoder ready in terms of the surrounding style
- Adopt/Clone the osmo_t4_decode tests and move them to the PCU (as part of the commit that adds the decoder to the PCU)
@Me: After the infrastructure is in the PCU I will remove osmo_t4_decode (and tests) from libosmocore
thanks to the review and patience of Neels and Harald the tree based t4 decoding is now in osmo-pcu and I have pushed https://gerrit.osmocom.org/1176 for review.
holger
osmocom-net-gprs@lists.osmocom.org