Hi all,
I'm moving this conversation from private to public. I think Sylvain and Andreas might be interested in participating.
---------- Forwarded message ---------- From: Thomas Tsou tom@tsou.cc Date: Tue, Jul 9, 2013 at 2:51 AM Subject: Re: DSP optimization To: Alexander Chemeris alexander.chemeris@gmail.com
On Mon, Jul 8, 2013 at 6:52 AM, Alexander Chemeris alexander.chemeris@gmail.com wrote:
Wow, optimization of 5-16x for Viterbi is huge indeed. I wonder what would be results for our Atoms.
Without SSE, just C only butterfly, the improvement is around 4x. SSE 3 (Atom) forces a small change on the normalization (not separated out yet), but the results weren't very far off from SSE 4.1 when I tested on Core 2 Duo.
I might try to manipulate the interface to read in the state tables instead of the generator polynomials. That would really help with testing and integration, but I'm not sure yet. There are many ways to go here.
What is problematic with the runtime detection? CPU autodetection on Linux should be as easy as reading /proc/cpuinfo. But I see an issue is with correctly setting up build system to generate all version on the same run. I think we could leave CPU autodetection for the "everything else" milestone, using compile time selection for now.
I think compile time detection is more appropriate. For GSM / LTE we're almost always dealing with fixed sized vectors and not odd calculations (e.g. 1023 size FFT), so it's unlikely that the results will change on repeated runs.
/proc/cpuinfo parsing scripts I've seen have been prone to breakage. If you have a really good one, let me know. I usually prefer to run configure checks against the actual instruction, but that can get messy with a lot of checks. Anyhow, I'm not worrying about this now.
What repository will you push at? We need to have at least master branch and dual-channel branch working with the optimizations. And I believe everyone would be happy to see optimizations in the libosmocore for the benefit of other projects as well. I don't foresee any issues with a slight change in the API of libosmocore if it is justified - just send an RFC/patch to the OpenBSC mailing list and it will be reviewed.
Non-Viterbi changes are sigProc.cpp changes only, so they are not branch-specific - they will probably merge into the oldest available OpenBTS releases. The Viterbi changes merge into Andreas's branch, which is a very large change. For now, somebody needs to write it, which is why I'm considering making the interfaces match.
Attached are the standalone unit test cases for SSE 4.2. As previously mentioned, Atom needs SSE3 only. I'll add the ifdefs for those shortly. I don't know if there's an appropriate repository for these right now - linking libosmocore from the transceiver for comparison purposes only seems silly. I just generated a temporary tarball for the time being.
Thomas
-- Regards, Alexander Chemeris. CEO, Fairwaves LLC / ООО УмРадио http://fairwaves.ru
Hi Thomas,
From: Thomas Tsou tom@tsou.cc On Mon, Jul 8, 2013 at 6:52 AM, Alexander Chemeris alexander.chemeris@gmail.com wrote:
I think compile time detection is more appropriate.
Yes, that's fine for now.
/proc/cpuinfo parsing scripts I've seen have been prone to breakage. If you have a really good one, let me know. I usually prefer to run configure checks against the actual instruction, but that can get messy with a lot of checks. Anyhow, I'm not worrying about this now.
Sorry, I'm not aware of any.
What repository will you push at? We need to have at least master branch and dual-channel branch working with the optimizations. And I believe everyone would be happy to see optimizations in the libosmocore for the benefit of other projects as well. I don't foresee any issues with a slight change in the API of libosmocore if it is justified - just send an RFC/patch to the OpenBSC mailing list and it will be reviewed.
Non-Viterbi changes are sigProc.cpp changes only, so they are not branch-specific - they will probably merge into the oldest available OpenBTS releases. The Viterbi changes merge into Andreas's branch, which is a very large change. For now, somebody needs to write it, which is why I'm considering making the interfaces match.
Could you publish them for review? It's hard to talk about a code without looking at it.
Attached are the standalone unit test cases for SSE 4.2. As previously mentioned, Atom needs SSE3 only. I'll add the ifdefs for those shortly. I don't know if there's an appropriate repository for these right now - linking libosmocore from the transceiver for comparison purposes only seems silly. I just generated a temporary tarball for the time being.
Ok, waiting for the updated version to test it on my Core 2 Duo and Atoms.
-- Regards, Alexander Chemeris. CEO, Fairwaves LLC / ООО УмРадио http://fairwaves.ru
On Tue, Jul 9, 2013 at 4:54 PM, Alexander Chemeris alexander.chemeris@gmail.com wrote:
From: Thomas Tsou tom@tsou.cc Attached are the standalone unit test cases for SSE 4.2. As previously mentioned, Atom needs SSE3 only. I'll add the ifdefs for those shortly. I don't know if there's an appropriate repository for these right now - linking libosmocore from the transceiver for comparison purposes only seems silly. I just generated a temporary tarball for the time being.
Ok, waiting for the updated version to test it on my Core 2 Duo and Atoms.
Pushed for the time being to:
https://github.com/ttsou/sse-tests.git
To enable SSE4:
./configure --with-sse4
Vector 16-bit integer to floating point conversion is only enabled with SSE4. In general, the SSE type conversion benefits are marginal because the native conversion instructions operate on 32-bit widths, which requires a separate sign extension. If there is no dedicated instruction for sign extension (SSE3) then the benefits (if any) are reduced even more.
Thomas
Hi,
I don't foresee any issues with a slight change in the API of libosmocore if it is justified - just send an RFC/patch to the OpenBSC mailing list and it will be reviewed.
I _do_ see an issue with ABI and/or API change.
The conv code is used in many projects with a bunch of different code so any change would need to have a very good justification of why it's needed and should be absolutely necessary.
Also, it should be obvious but any change in libosmocore must stay compatible with all codes that have been supported until now and not just a subset of GSM/LTE specific codes.
All the GSM channel coding will also be moving to libosmocore soonish to allow usage in other projects so keeping some specific stuff in osmo-bts will not be a viable option for long.
Cheers,
Sylvain
On Wed, Jul 10, 2013 at 2:26 PM, Sylvain Munaut 246tnt@gmail.com wrote:
I don't foresee any issues with a slight change in the API of libosmocore if it is justified - just send an RFC/patch to the OpenBSC mailing list and it will be reviewed.
I _do_ see an issue with ABI and/or API change.
The conv code is used in many projects with a bunch of different code so any change would need to have a very good justification of why it's needed and should be absolutely necessary.
This was the issue I was trying to raise. The patch (that doesn't exist) for the API change on osmo-bts alone is huge. I don't like any API breaking changes, and I have limited interest in making changes to osmo-bts. I do like compatibility layers though.
Also, it should be obvious but any change in libosmocore must stay compatible with all codes that have been supported until now and not just a subset of GSM/LTE specific codes.
Aside from testing all the various combinations, I'm less concerned about structural (non-API) compatibility. The convolutional codes are defined by the trellis structure or, equivalently, shift register and generator polynomials. The Viterbi algorithm is fundamentally the same in all cases. Unless there are some truly bizarre cases, which you would know more about than I do, I see this as less of an issue.
All the GSM channel coding will also be moving to libosmocore soonish to allow usage in other projects so keeping some specific stuff in osmo-bts will not be a viable option for long.
Leaving aside LTE code, the implementation in question only exists in the posted test cases.
Thomas
Hi,
This was the issue I was trying to raise. The patch (that doesn't exist) for the API change on osmo-bts alone is huge. I don't like any API breaking changes, and I have limited interest in making changes to osmo-bts. I do like compatibility layers though.
Yes, both adapting the code and/or adding compatibility layer are no issue.
Ideally the code should just end up in libosmocore, (being compiled if supported), and just "work" with any code that previously worked, possibly fallbacking to the reference impl at runtime if something weird is not supported.
And all in all be completely transparent to the user, just accelerating it if possible.
Also, it should be obvious but any change in libosmocore must stay compatible with all codes that have been supported until now and not just a subset of GSM/LTE specific codes.
Aside from testing all the various combinations, I'm less concerned about structural (non-API) compatibility. The convolutional codes are defined by the trellis structure or, equivalently, shift register and generator polynomials. The Viterbi algorithm is fundamentally the same in all cases. Unless there are some truly bizarre cases, which you would know more about than I do, I see this as less of an issue.
The treillis scan is very similar in all cases. The only thing to support is the puncturing during the scan (I wanted to avoid having the de-puncture first). It also currently supports progressive decoding (i.e. you can start decoding a message before you have all of it if you want, you just feed it chunk by chunk and it can resume where it left off).
Most of the changes between variations come from the way the termination is handled but since that's always only a few symbols, that could be left alone unoptimized (assuming the internal state is compatible enough).
Cheers,
Sylvain
On Wed, Jul 10, 2013 at 3:46 PM, Sylvain Munaut 246tnt@gmail.com wrote:
Ideally the code should just end up in libosmocore, (being compiled if supported), and just "work" with any code that previously worked, possibly fallbacking to the reference impl at runtime if something weird is not supported.
The optimized portions, PMU and BMU, can stand independently, and my preferred approach would be to integrate those pieces with the existing trellis parts intact. That assumes that the accumulated path metrics, branch metrics, and path decisions are each represented by contiguously allocated 16-bit values. The downside of such an approach is that it would probably take far more time to integrate than the time to actually create the optimizations.
The treillis scan is very similar in all cases. The only thing to support is the puncturing during the scan (I wanted to avoid having the de-puncture first).
What is the benefit of this? I explicitly separated puncturing because puncturing is inherently not part of the Viterbi algorithm. Admittedly, though, I'm probably biased because for LTE I can remove the puncturing unit entirely and directly replace it with a rate-matching unit.
The SSE optimization operates vertically through the trellis, so in-scan puncturing can be accommodated.
It also currently supports progressive decoding (i.e. you can start decoding a message before you have all of it if you want, you just feed it chunk by chunk and it can resume where it left off).
If not enough bits exist to decode a message, why attempt to partially decode it instead of waiting for the other bits? I could add support for this; it's just not a use case that I ever encountered.
Most of the changes between variations come from the way the termination is handled but since that's always only a few symbols, that could be left alone unoptimized (assuming the internal state is compatible enough).
Right. I don't think termination cases (or tail cases in general) are worth handling with any optimized code.
Thomas
Hi,
The optimized portions, PMU and BMU, can stand independently, and my preferred approach would be to integrate those pieces with the existing trellis parts intact. That assumes that the accumulated path metrics, branch metrics, and path decisions are each represented by contiguously allocated 16-bit values.
Well, the internal state is "private" to the implementation so if there is need for change to better match the SSE and make it easier to match SSE/non-SSE code, that shouldn't be an issue.
The downside of such an approach is that it would probably take far more time to integrate than the time to actually create the optimizations.
Well, that's often the case. Getting some proof of concept is "easy". The cleanup and proper integration is definitely a significant part of the work.
The treillis scan is very similar in all cases. The only thing to support is the puncturing during the scan (I wanted to avoid having the de-puncture first).
What is the benefit of this? I explicitly separated puncturing because puncturing is inherently not part of the Viterbi algorithm.
It was done inside the "common code" because all the protocols that were targeted but the initial code used it and it allowed better code re-use. Now some actually do it externally themselves (TETRA) and just set the 'puncture' to NULL.
Now, it was done in-line vs two-steps to avoid an allocation and data movement. IIRC I tested both cases and the perf difference weren't significant. But I'm not strictly attached to it. If we have to change it into two steps were it's depunctured in a first loop then processed, it would be acceptable.
It also currently supports progressive decoding (i.e. you can start decoding a message before you have all of it if you want, you just feed it chunk by chunk and it can resume where it left off).
If not enough bits exist to decode a message, why attempt to partially decode it instead of waiting for the other bits? I could add support for this; it's just not a use case that I ever encountered.
It was an effort to support "continuous" streams rather than single messages. Part of this capability also useful to support tail-biting codes where you need to basically feeding the data twice without resetting everything.
But all that needed is to store offsets and support not resetting path costs, so nothing touching the 'core' of the loop so it shouldn't be an issue.
Cheers,
Sylvain