Dear all,
I would like to know your opinions about some optimizations of Viterbi decoder, which were already discussed previously.
First of all, I would like to share some benchmarking results. I used the test cases ("osmo-conv-test"), written by Tom Tsou, to ensure that SIMD optimization is integrated correctly. And, shortly speaking, the results are almost equal. Older version of decoder is a little bit faster, but I think it's because one is being compiled with "-march=native".
Returning back to the subject, as we allocate and free some memory on every osmo_conv_decode_acc() call, what may happen very frequently and tear down performance on some hardware, there was the following suggestions:
1) Use static memory allocation where it's possible. 2) Use talloc for dynamic allocation. 3) Internal caching:
Fri May 9 18:23:03 UTC 2014, Tom Tsou wrote:
Internal caching was in the original implementation, but stripped from the submitted version mainly for simplicity and avoiding the need for global variables, though we seem to be having that discussion anyway ;-) The trellis values can be cached based on pointer or hashed code. That works well until threading is involved and cache access needs to be locked. Those are features I need, but can probably be ignored in this case.
Again, I think the API should be kept intact. Internal caching, can be a topic for later discussion.
So, I am open for your ideas, opinions and remarks.
With best regards, Vadim Yanitskiy.
Hi Vadim,
On Fri, Jun 16, 2017 at 04:43:12AM +0700, Vadim Yanitskiy wrote:
I used the test cases ("osmo-conv-test"), written by Tom Tsou, to ensure that SIMD optimization is integrated correctly. And, shortly speaking, the results are almost equal. Older version of decoder is a little bit faster, but I think it's because one is being compiled with "-march=native".
You can compile the new code with the same optimization flags (specified in CFLAGS at configure time). We just simply cannot do -march=native as a default, as that is not safe. But you can use that to verify your assumption.
Returning back to the subject, as we allocate and free some memory on every osmo_conv_decode_acc() call, what may happen very frequently and tear down performance on some hardware, there was the following suggestions:
- Use static memory allocation where it's possible.
That's generally what we do a lot in osmocom code. If it's predictable how much memory a given process takes, we allocate that memory once (statically or dynamically), use that all the time and then release it once that process/procedure/instance is destroyed.
When I look at the current API, the decode_init()/decode_deinit() functions are great candidates to perform such allocation/release of memory. If applications use the osmo_conv_decode() convenience wrapper, it simply means they don't care about efficiency, but if they use the real init/scan/flush/deinit API, they would benefit from it.
Not sure if scan+flush should/could become one call for simplicity?
The viterbi.c code (which by the way chould be renamed to conv_acc.c or the like) would then have to be restructured to plug into init/fini for allocation+release.
- Use talloc for dynamic allocation.
I doubt this will improve your speed. Feel free to try, but this is more of a clean-up to ensure all the memory allocated within osmocom projects is tracked/accounted for, and you can get talloc usage reports, debug memory leaks, etc.
Hi Harald,
When I look at the current API, the decode_init()/decode_deinit() functions are great candidates to perform such allocation/release of memory. If applications use the osmo_conv_decode() convenience wrapper, it simply means they don't care about efficiency, but if they use the real init/scan/flush/deinit API, they would benefit from it.
You are talking about Sylvain's implementation of Viterbi, which is currently used as fail-safe back-end for a new algorithmically and SIMD optimized implementation, written by Tom Tsou.
In the Sylvain's implementation (conv.c), init/scan/flush/deinit are exposed for external usage, meanwhile Tom's implementation is transparently integrated into the first one, and doesn't expose anything outside.
The viterbi.c code (which by the way chould be renamed to conv_acc.c or the like) would then have to be restructured to plug into init/fini for allocation+release.
Do you suggest to expose alloc_vdec, free_vdec and osmo_conv_decode_acc like it is done in the conv.c?
Anyway, using init/fini is profitably only if you are doing serial (en/de)coding of the same convolutional code. As soon as you need to (en/de)code another convolutional code, you will have to reallocate and reconfigure the vdec object.
(which by the way chould be renamed to conv_acc.c or the like)
I like this idea, already in my TODO list :)
I doubt this will improve your speed. Feel free to try, but this is more of a clean-up to ensure all the memory allocated within osmocom projects is tracked/accounted for, and you can get talloc usage reports, debug memory leaks, etc.
Agree with you. But what do you think about talloc memory pools?
With best regards, Vadim Yanitskiy.
On Sun, Jun 18, 2017 at 12:54 PM, Vadim Yanitskiy axilirator@gmail.com wrote:
In the Sylvain's implementation (conv.c), init/scan/flush/deinit are exposed for external usage, meanwhile Tom's implementation is transparently integrated into the first one, and doesn't expose anything outside.
One note on API usage. Sylvain's init-deinit calls are exposed, but I'm not aware of the those calls being used anywhere - only the all-in-one version exists in osmo-bts.
So an API change on the init-deinit or alloc-dealloc of the decoder object probably won't cause many dependency issues. But putting those calls into use involves a large number of changes in the current osmo-bts regardless of which implementation is used.
-TT
Hi.
Microbenchmarking is nice but I'd like to know bigger picture: why do we need to optimize current implementation further? Is it just for the sake of aesthetics and fun of it? Do we have some profiling result showing that in scenario A Viterbi decoder occupies XX% of the time?
Don't get me wrong - I'm not against optimizations, just would like to know more about the general context.
On 15.06.2017 23:43, Vadim Yanitskiy wrote:
So, I am open for your ideas, opinions and remarks.
Hi Vadim,
On Thu, Jun 15, 2017 at 2:43 PM, Vadim Yanitskiy axilirator@gmail.com wrote:
Returning back to the subject, as we allocate and free some memory on every osmo_conv_decode_acc() call, what may happen very frequently and tear down performance on some hardware, there was the following suggestions:
Max has a valid point of asking whether there is significant value in further optimization when weighted against the cost of a non-trivial API change. There are fielded deployments running the baseline code. Will further optimization make a significant difference in those cases? The answer is system and situation dependent; I can't answer those questions in the general sense.
- Use static memory allocation where it's possible.
- Use talloc for dynamic allocation.
- Internal caching:
Persistent allocation is the best solution. Talloc will minimally affect performance if at all. Internal caching is somewhat of a hack to hide static allocation behind an API that does not allow it.
Fri May 9 18:23:03 UTC 2014, Tom Tsou wrote:
Internal caching was in the original implementation, but stripped from the submitted version mainly for simplicity and avoiding the need for global variables, though we seem to be having that discussion anyway ;-) The trellis values can be cached based on pointer or hashed code. That works well until threading is involved and cache access needs to be locked. Those are features I need, but can probably be ignored in this case.
Three years ago, which feels like much longer, I was performing brute force RNTI searches on LTE control channels. Convolutional decoding was the limiting factor and maxed out multiple threads; every optimization helped. That was a much different set of requirements.
-TT
Hi Max and Tom,
why do we need to optimize current implementation further? Is it just for the sake of aesthetics and fun of it?
You are right, this is for the sake of aesthetics. I had a task: to take some dead code from mailing lists, to make it satisfy the Osmocom coding requirements and merge it part by part. There was some warnings about object visibility (fixed), about naming convention (fixed) and about memory allocation.
In the past, the discussion about memory allocation was set aside, until code is merged. So, till now, this task remained in my mind as some kind of unfinished work...
Do we have some profiling result showing that in scenario A Viterbi decoder occupies XX% of the time?
I am not sure that we can profile different parts of code without actual modifications (probably perf can do that, correct me otherwise). But we can profile a whole implementation using the benchmark code from Tom, which I mentioned before.
Don't get me wrong - I'm not against optimizations, just would like to know more about the general context.
It's ok, I am always happy to know your and other opinions :)
Persistent allocation is the best solution. Talloc will minimally affect performance if at all. Internal caching is somewhat of a hack to hide static allocation behind an API that does not allow it.
I'll try to use persistent allocation where it's possible, e.g. the struct vdecoder in osmo_conv_decode_acc() could be allocated in the stack instead of heap.
Three years ago, which feels like much longer, I was performing brute force RNTI searches on LTE control channels. Convolutional decoding was the limiting factor and maxed out multiple threads; every optimization helped. That was a much different set of requirements.
Ok, that is exactly what I need to know. Now I can let internal caching go, as it isn't critical for GSM.
With best regards, Vadim Yanitskiy.