On Fri, May 9, 2014 at 4:22 AM, Sylvain Munaut 246tnt@gmail.com wrote:
Among all those not marked 'static' and don't have an osmo_ prefix, I didn't yet do an exhaustive check to see what was an oversight and what's really needed to be global.
Overall, I don't think there should be _any_ non-static or global variable as that was the original intent. Admittedly, there were some oversights in the architecture specific BMU/PMU calls.
- Did you check (and test in make test) that it works for all codes
that matches "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?
Almost all codes used by osmo-bts and a few others are covered by make check. Rates N=5 are skipped and TCH/HR passes with the posted puncturing fix for osmo-bts.
I was more thinking about random codes. Like if someone invents a random code with N=4 and K=5, will it work ? When implementing it, you didnt see any other limitations that those right ?
The original implementation used octal based generator polynomials, so all non-degenerate cases are handled. By degenerate case I mean something like a K=7 code with generator polynomial of 0104, which is really a shifted K=5 code. No practical implementation will handle those cases of bad codes because the symmetric/anti-symmetric ACS properties of the trellis vanish.
Similarly for recursive codes, the register feedback is based on the generator polynomial with the allowance that the systematic bit can be on any output. The AFS codes require that, which was really annoying.
- Single file for generic and architecture specific implementations
with large ifdef blocks. 2. Separate files with exported symbols (current patch). 3. Separate files with restricted visibility with __attribute__((__visibility__("hidden"))) or other combinations with the gcc -fvisibility flag.
I think we'll need to play with 3. Still probably need a prefix (not the full "osmo_conv_", but something less conflicty thatn gen_)
Agreed on internal prefixing.
Currently, libosmocore checks for the availability of hidden symbol visibility, sets SYMBOL_VISIBILITY="-fvisibility=hidden", but ignores use the result. Without making library wide changes or using ifdef blocks, the only approach is to mark the osmo_conv_gen calls with hidden visibility
When I designed the API, I did do benchmark and just repeated them, and there is virtually no difference between the two versions. ( < 1% )
The allocation effects are case specific and more significant as the compute load drops. Also, you may have noticed the bitswap operations, which only exist because the trellis and API have conflicting trellis endian formats. Those calls are only used during initialization. Parallel decoding threads also seems affected.
Overall though, for GSM where the data rates are quite low, the allocate and tear-down effects, while measurable, are probably not significant enough to justify an API change. There may be a case on ARM or Epiphany, but I have not tested those platforms enough to comment.
But I guess that's because in the current version, the 'decoder' struct stores virtually nothing that can be re-used between different bursts. (and malloc itself is pretty fast nowadays)
The main memory allocation is the path metric storage, which is unavoidable, and similar in both cases. The current version does store paths with 8-bits instead of 16-bits, so that allocation is halved. The SSE outputs are 16-bit so converting to 8-bit at every stage wasn't worth the overhead. Still, cache performance on the submitted code codes out far ahead.
$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -o
================================================= [+] Testing: GPRS CS3 [.] Specs: (N=2, K=5, non-recursive, flushed, not punctured) [.] Input length : ret = 334 exp = 334 -> OK [.] Output length : ret = 676 exp = 676 -> OK
[.] Performance benchmark: [..] Encoding / Decoding 500000 bursts on 1 thread(s): [..] Testing base: [..] Elapsed time....................... 18.189684 secs [..] Rate............................... 9.181028 Mbps
Performance counter stats for './conv_test -c 3 -b -i 500000 -o':
65,116 cache-misses 284 page-faults
18.191569647 seconds time elapsed
$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -s
================================================= [+] Testing: GPRS CS3 [.] Specs: (N=2, K=5, non-recursive, flushed, not punctured) [.] Input length : ret = 334 exp = 334 -> OK [.] Output length : ret = 676 exp = 676 -> OK
[.] Performance benchmark: [..] Encoding / Decoding 500000 bursts on 1 thread(s): [..] Testing SIMD: [..] Elapsed time....................... 1.820236 secs [..] Rate............................... 91.746345 Mbps
Performance counter stats for './conv_test -c 3 -b -i 500000 -s':
11,508 cache-misses 288 page-faults
1.822036339 seconds time elapsed
In your case there is the treillis derived from the 'code' struct. And I guess this is where the performance hits comes from. An option would be to have an internal 'cache' of codes that have been used in the past so those data would be re-used.
Internal caching was in the original implementation, but stripped from the submitted version mainly for simplicity and avoiding the need for global variables, though we seem to be having that discussion anyway ;-) The trellis values can be cached based on pointer or hashed code. That works well until threading is involved and cache access needs to be locked. Those are features I need, but can probably be ignored in this case.
Again, I think the API should be kept intact. Internal caching, can be a topic for later discussion.
-TT