From chiefy.padua at gmail.com  Fri Aug 16 14:27:06 2019
From: chiefy.padua at gmail.com (Tony Clark)
Date: Fri, 16 Aug 2019 15:27:06 +0100
Subject: high ksoftirqd while using module gtp?
In-Reply-To: <CAA5f=PhaeiKt_in-D9Tj+3uor7La4EKjK37FbfPJS7xdUAOugw@mail.gmail.com>
References: <CA+dJ=bxPunvB4GOV5gwUdSf+PPkukNkwhVD=NAr5y2i5a4Gj3w@mail.gmail.com>
 <CAA5f=PhaeiKt_in-D9Tj+3uor7La4EKjK37FbfPJS7xdUAOugw@mail.gmail.com>
Message-ID: <CA+dJ=bzUhJh8oYAua4bjRLhShCRGyB=7HiX3dhHEhQb4Mw7GOg@mail.gmail.com>

Firstly I would like to say great thanks Firat for the reply, it certainly
put me on a different investigation path. And apologies for not replying
sooner. I wanted to make sure it was the correct path before I replied back
to the group with the findings and associated solution.


If the GTP-U connection is connecting to the PG-W with a single IP at
(src/dst) each side, and UDP flow has isn't been enabled on the network
card of the host using gtp.ko in the kernel all the associated network
traffic will be received on a single queue on the network, which is then
serviced by a single ksoftirq thread. at somepoint the system will be
receiving more traffic than there is available thread space to service the
request, your ksoftirq will burn at 100%. That means all your traffic will
be bound to a single network queue, bound to a single irq thread, and limit
your overall throughput, no matter how big your network pipe is.

This is because the network card hashed the packet via
SRC_IP:SRC_PORT:DEST_IP:DEST_PORT:PROTO to a single queue.

# take note on the discussions about udp-flow-hash udp4 using ethtool

https://home.regit.org/tag/performance/

https://www.joyent.com/blog/virtualizing-nics

https://www.serializing.me/2015/04/25/rxtx-buffers-rss-others-on-boot/


You can check if your card supports adjustable parameters by using "ethtool
-k DEV | egrep -v fixed". As firat eludes to (below) udp flow hashing
should be supported.

If you enabled UDP flow hash then it will spread the hash over multiple
queues. The default number of queues on the network card can vary,
depending on your hardware firmware driver, and any additional associated
kernel parameters.

Would recommend having the latest firmware driver for your network card,
and latest kernel driver for the network card if possible.

Alas the network cards used by my hardware didn't support flow hash, it had
intel flow director, which wasn't granular enough and worked with TCP, so
to work around this limitation having multiple SRC_IPs in different name
spaces with the same GTP UDP PORT numbers resolved the problem. Of course
if you are sending GTP-U to a single destination from multiple sources (say
6 IP's), via 6 different kernel name spaces, you spread the load over 6
queues, which is better than nothing on a limited feature network card.
Time to upgrade the 10G network card....

This took the system from 100% ksoftirq on a single cpu running at
throughput 1G, to around 7 to 8GIG throughput at 90% ksoftirq over multiple
cpu;s... There is still massive room for improvement.


For performance some things to investigate/consider... Which I had
different levels of success... Here are my ramblings.....

on the linux host... Assuming your traffic is now spread across multiple
queues (above) - or at least spread as best as can be...

Kernel sysctl tweaking is always of benefit, if your using out of the box
kernel config... Example udp buffers, queue sizes, paging and virtual
memory settings...  There is a application called "tuned", which allows you
to adjust profiles for the kernel sysctl... My performance profile which
suited the testing best was "throuhput-performance"

if your looking for straight performance, disable audit processing like
"auditd".

Question use of SELINUX, enforcing/permissive or disabled. can bring
results on performance, if you doing testing or load testing... ofcourse
its a security consideration..

If you don't need to use ipfilters/firewall in my case can increase the
throughput by a 3rd by disabling (cleaning the filter tables and unloading
the modules). Black listing the modules so they dont get loaded at kernel
time. Note you can stop modules getting loaded with
kernel.modules_disabled=1, but be careful if your also messing with
initramfs rebuilds, because you don't get any modules once your set that
parameter, i learnt the hard way :)

Investigate smp_affinity and affinity_hint, along with irqbalane using the
--hintpolicy=exact. understand which irq's service the network cards, and
how many queues you have.. /proc/interrupts will guide you (grep 'CPU|rxtx'
/proc/interrupts)... understand the smp_affinity numbers.. "for
((irq=START_IRQ; irq<=END_IRQ; irq++)); do    cat
/proc/irq/$irq/smp_affinity;     done | sort ?u", as you can adjust which
queue goes to which ksoftirq to manually balance the queues if you so
desire. brilliant document on irq debugging....
https://events.static.linuxfound.org/sites/events/files/slides/LinuxConJapan2016_makita_160714.pdf


you can monitor what calls are been executed on cpu's by using... I found
this most useful to understand that ipfilter was eating a significant
amount of CPU cycles, and also what other calls are eating up cycles inside
ksoftirq. https://github.com/brendangregg/FlameGraph

Investigate additional memory management using numactl (numa daemon).
remember if you are using virtualisation you might want to pin guests to
specific sockets, along with numa pinning on the vmhost.. Also look at
reserved memory allocation in the vmhost for the guest... This will make
your guest perform better.

enable sysstat (sar) as it will aid your investigation if you havent
already (sar -u ALL -P ALL 1). This will show which softirqs are eating
most cpu and to which cpu they are bound, this also translates directly to
the network queue that the traffic is coming in on.. Ie, network card queue
6, talks to cpu/6 talking to irq/6 and so on... Using flamegraph will help
you understand what syscalls and chewing the CPU..

If your using virtualisation then the number of default queues that vxnet
(vmware in this example) presents to the guest might be less than the
number of network card queues the vmhost sees (so watch out for that). You
can adjust the number of queues to the guest by params in the vmware
network driver... investigate VDMQ / netqueue, to increase the number of
available hardware queues from the vmhost to the guest. depending which
quest driver your using vxnet3, or others some drivers dont support NAPI
(see further down).

*  VMDQ: array of int*

*    Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable
(default=8)*

*  RSS: array of int*

*    Number of Receive-Side Scaling Descriptor Queues, default 1=number of
cpus*

*  MQ: array of int*

*    Disable or enable Multiple Queues, default 1*

  Node: array of int

    set the starting node to allocate memory on, default -1

*  IntMode: array of int*

*    Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2*

*  InterruptType: array of int*

*    Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode
(deprecated)*

Make sure your virtual switch (vmware) if used has Pass-through
(Direct-path I/O) enabled. NIC teaming policy should be validated depending
on your requirement, example Policy "route based on IP hash" can be of
benefit.

Check the network card is MSI-X, and the linux driver supports NAPI (most
should these days, but you never know), also check your vmhost driver
supports napi, if not get a NAPI supported kvm driver, or vmware driver
(vib update).

Upgrade your kernel, to a later release 4.x.. even consider using a later
distro of linux... I tried fedora 29. I also compiled latest osmocom from
source, with compile options for "optimisation -O3 and other such".

"bmon -b" was a good tool understand throughput loads, along with loading
through qdisc/fq_dodel mq's.... Understand qdisc via ip link or ifconfig (
http://tldp.org/HOWTO/Traffic-Control-HOWTO/components.html), adjusting the
queues has some traction, but if unsure leave as default.

TSO/UFo?GSO/LRO/GRO - understand your network card with respects to these,
this can improve performance if you haven't already enabled (or adversely
disabled options, since sometimes it doesn't actually help). You can get
the your card options using ethool

TCP Segmentation Offload (TSO)

    Uses the TCP protocol to send large packets. Uses the NIC to
handle segmentation, and then adds the TCP, IP and data link layer
protocol headers to each segment.

UDP Fragmentation Offload (UFO)

    Uses the UDP protocol to send large packets. Uses the NIC to
handle IP fragmentation into MTU sized packets for large UDP
datagrams.

Generic Segmentation Offload (GSO)

    Uses the TCP or UDP protocol to send large packets. If the NIC
cannot handle segmentation/fragmentation, GSO performs the same
operations, bypassing the NIC hardware. This is achieved by delaying
segmentation until as late as possible, for example, when the packet
is processed by the device driver.

Large Receive Offload (LRO)

    Uses the TCP protocol. All incoming packets are re-segmented as
they are received, reducing the number of segments the system has to
process. They can be merged either in the driver or using the NIC. A
problem with LRO is that it tends to resegment all incoming packets,
often ignoring differences in headers and other information which can
cause errors. It is generally not possible to use LRO when IP
forwarding is enabled. LRO in combination with IP forwarding can lead
to checksum errors. Forwarding is enabled if
/proc/sys/net/ipv4/ip_forward is set to 1.

Generic Receive Offload (GRO)

    Uses either the TCP or UDP protocols. GRO is more rigorous than
LRO when resegmenting packets. For example it checks the MAC headers
of each packet, which must match, only a limited number of TCP or IP
headers can be different, and the TCP timestamps must match.
Resegmenting can be handled by either the NIC or the GSO code.


Traffic steering was on by default with the version of linux i was using,
but worth checking if your using older versions.

https://www.kernel.org/doc/Documentation/networking/scaling.txt

(from the txt link) note: Some advanced NICs allow steering packets to
queues based on programmable filters. For example, webserver bound TCP
port 80 packets can be directed to their own receive queue. Such
???n-tuple??  filters can be configured from ethtool
(--config-ntuple).


Interestingly investigate your network card, for its hashing algorithms,
how it distributes the traffic over its ring buffers, you can on some cards
adjust the RSS hash function. Alas the card i was using stuck to "toeplitz"
for hits hashing, which others were disabled and unavailable / xor and
crc32. The  indirection table can be adjusted based on the tuplets "ethtool
-X" but didn't really assist too much on this.

ethtool -x <dev>

RX flow hash indirection table for ens192 with 8 RX ring(s):

    0:      0     1     2     3     4     5     6     7

    8:      0     1     2     3     4     5     6     7

   16:      0     1     2     3     4     5     6     7

   24:      0     1     2     3     4     5     6     7

RSS hash key:

Operation not supported

RSS hash function:

    toeplitz: on

    xor: off

    crc32: off


Check the default size of the rx/tx ring buffers, they maybe suboptimal.

ethtool -g ens192

Ring parameters for ens192:

Pre-set maximums:

RX:             4096

RX Mini:        0

RX Jumbo:       4096

TX:             4096

Current hardware settings:

RX:             1024

RX Mini:        0

RX Jumbo:       256

TX:             512


If your using port channels, make sure you have the correct hashing policy
enabled at the switch end...

I haven't investigated this option yet but some switches also do scaling,
to assist (certainly with virtualisation)... Maybe one day i will get
around to this...

Additionally CISCO describe that you should have VM-FEX optimisation

https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/unified-computing/vm_fex_best_practices_deployment_guide.html

note:

*table 4.* Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue Enabled

*Table 5.* Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue Disabled


Another thing to consider/investigate - openvswitch/bridging... If your
using eth pairs to send your traffic down name spaces... you can have some
varied results with performance by trying openvswitch/brctl


I really enjoyed the investigation path, again thanks to Firat for the
pointer, otherwise it would have taken longer to get the answer...


Tony


On Fri, Jun 21, 2019 at 6:50 AM f?rat s?nmez <firatssonmez at gmail.com> wrote:

> Hi,
>
> It has been over 2 years that I have worked with gtp and I kind of had the
> same problem that time, we had a 10gbit cable and tried to see how much udp
> flow we could get. I think we used iperf to test it and when we list all
> the processes, the ksoftirq was using all the resource. Then I found this
> page: https://blog.cloudflare.com/how-to-receive-a-million-packets/. I do
> not remember the exact solution, but I guess when you configure your out
> ethernet interface with the command below, it must work then. To my
> understanding all the packets are processed in the same core in your
> situation, because the port number is always the same. So, for example, if
> you add another network with gtp-u tunnel on another port (different than
> 3386) then again your packets will be processed on the other core, too. But
> with the below command, the interface will be configured in a way that it
> wont check the port to process on which core it should be processed, but it
> will use the hash from the packet to distribute over the cores.
> ethtool -n (your_out_eth_interface) rx-flow-hash udp4
>
> Hope it will work you.
>
> F?rat
>
> Tony Clark <chiefy.padua at gmail.com>, 19 Haz 2019 ?ar, 15:07 tarihinde
> ?unu yazd?:
>
>> Dear All,
>>
>> I've been using the GTP-U kernel module to communicate with a P-GW.
>>
>> Running Fedora 29, kernel 4.18.16-300.fc29.x86_64.
>>
>> At high traffic levels through the GTP-U tunnel I see the performance
>> degrade as 100% CPU is consumed by a single ksoftirqd process.
>>
>> It is running on a multi-cpu machine and as far as I can tell the load is
>> evenly spread across the cpus (ie either manually via smp_affinity, or even
>> irqbalance, checking /proc/interrupts so forth.).
>>
>> Has anyone else experienced this?
>>
>> Is there any particular area you could recommend I investigate to find
>> the root cause of this bottleneck, as i'm starting to scratch my head where
>> to look next...
>>
>> Thanks in advance
>> Tony
>>
>> ---- FYI
>>
>> modinfo gtp
>> filename:
>> /lib/modules/4.18.16-300.fc29.x86_64/kernel/drivers/net/gtp.ko.xz
>> alias:          net-pf-16-proto-16-family-gtp
>> alias:          rtnl-link-gtp
>> description:    Interface driver for GTP encapsulated traffic
>> author:         Harald Welte <hwelte at sysmocom.de>
>> license:        GPL
>> depends:        udp_tunnel
>> retpoline:      Y
>> intree:         Y
>> name:           gtp
>> vermagic:       4.18.16-300.fc29.x86_64 SMP mod_unload
>>
>> modinfo udp_tunnel
>> filename:
>> /lib/modules/4.18.16-300.fc29.x86_64/kernel/net/ipv4/udp_tunnel.ko.xz
>> license:        GPL
>> depends:
>> retpoline:      Y
>> intree:         Y
>> name:           udp_tunnel
>> vermagic:       4.18.16-300.fc29.x86_64 SMP mod_unload
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/osmocom-net-gprs/attachments/20190816/5ab95833/attachment.htm>