Re: [PATCH 1/4] core/conv: Add optimized Viterbi decoding

10 May 2014

      On Fri, May 9, 2014 at 4:22 AM, Sylvain Munaut 246tnt@gmail.com wrote:
...
Among all those not marked 'static' and don't have an osmo_ prefix, I didn't
yet do an exhaustive check to see what was an oversight and what's really
needed to be global.
Overall, I don't think there should be _any_ non-static or global
variable as that was the original intent. Admittedly, there were some
oversights in the architecture specific BMU/PMU calls.
...
...
...

Did you check (and test in make test) that it works for all codes

that matches
   "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?
Almost all codes used by osmo-bts and a few others are covered by make
check. Rates N=5 are skipped and TCH/HR passes with the posted
puncturing fix for osmo-bts.
I was more thinking about random codes. Like if someone invents a random
code with N=4 and K=5, will it work ? When implementing it, you didnt see
any other limitations that those right ?
The original implementation used octal based generator polynomials, so
all non-degenerate cases are handled. By degenerate case I mean
something like a K=7 code with generator polynomial of 0104, which is
really a shifted K=5 code. No practical implementation will handle
those cases of bad codes because the symmetric/anti-symmetric ACS
properties of the trellis vanish.
Similarly for recursive codes, the register feedback is based on the
generator polynomial with the allowance that the systematic bit can be
on any output. The AFS codes require that, which was really annoying.
...
...

Single file for generic and architecture specific implementations

with large ifdef blocks.
2. Separate files with exported symbols (current patch).
3. Separate files with restricted visibility with
__attribute__((__visibility__("hidden"))) or other combinations with
the gcc -fvisibility flag.
I think we'll need to play with 3. Still probably need a prefix (not
the full "osmo_conv_", but something less conflicty thatn gen_)
Agreed on internal prefixing.
Currently, libosmocore checks for the availability of hidden symbol
visibility, sets SYMBOL_VISIBILITY="-fvisibility=hidden", but ignores
use the result. Without making library wide changes or using ifdef
blocks, the only approach is to mark the osmo_conv_gen calls with
hidden visibility
...
When I designed the API, I did do benchmark and just repeated them, and
there is virtually no difference between the two versions. ( < 1% )
The allocation effects are case specific and more significant as the
compute load drops. Also, you may have noticed the bitswap operations,
which only exist because the trellis and API have conflicting trellis
endian formats. Those calls are only used during initialization.
Parallel decoding threads also seems affected.
Overall though, for GSM where the data rates are quite low, the
allocate and tear-down effects, while measurable, are probably not
significant enough to justify an API change. There may be a case on
ARM or Epiphany, but I have not tested those platforms enough to
comment.
...
But I guess that's because in the current version, the 'decoder' struct
stores virtually nothing that can be re-used between different bursts.
(and malloc itself is pretty fast nowadays)
The main memory allocation is the path metric storage, which is
unavoidable, and similar in both cases. The current version does store
paths with 8-bits instead of 16-bits, so that allocation is halved.
The SSE outputs are 16-bit so converting to 8-bit at every stage
wasn't worth the overhead. Still, cache performance on the submitted
code codes out far ahead.
$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -o
=================================================
[+] Testing: GPRS CS3
[.] Specs: (N=2, K=5, non-recursive, flushed, not punctured)
[.] Input length  : ret = 334  exp = 334 -> OK
[.] Output length : ret = 676  exp = 676 -> OK
[.] Performance benchmark:
[..] Encoding / Decoding 500000 bursts on 1 thread(s):
[..] Testing base:
[..] Elapsed time....................... 18.189684 secs
[..] Rate............................... 9.181028 Mbps
Performance counter stats for './conv_test -c 3 -b -i 500000 -o':
65,116      cache-misses
               284      page-faults
18.191569647 seconds time elapsed
$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -s
=================================================
[+] Testing: GPRS CS3
[.] Specs: (N=2, K=5, non-recursive, flushed, not punctured)
[.] Input length  : ret = 334  exp = 334 -> OK
[.] Output length : ret = 676  exp = 676 -> OK
[.] Performance benchmark:
[..] Encoding / Decoding 500000 bursts on 1 thread(s):
[..] Testing SIMD:
[..] Elapsed time....................... 1.820236 secs
[..] Rate............................... 91.746345 Mbps
Performance counter stats for './conv_test -c 3 -b -i 500000 -s':
11,508      cache-misses
               288      page-faults
1.822036339 seconds time elapsed
...
In your case there is the treillis derived from the 'code' struct. And I
guess this is where the performance hits comes from. An option would be
to have an internal 'cache' of codes that have been used in the past
so those data would be re-used.
Internal caching was in the original implementation, but stripped from
the submitted version mainly for simplicity and avoiding the need for
global variables, though we seem to be having that discussion anyway
;-) The trellis values can be cached based on pointer or hashed code.
That works well until threading is involved and cache access needs to
be locked. Those are features I need, but can probably be ignored in
this case.
Again, I think the API should be kept intact. Internal caching, can be
a topic for later discussion.
-TT

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [PATCH 1/4] core/conv: Add optimized Viterbi decoding