high ksoftirqd while using module gtp?

historical

Dear All,

To update you on investigations.

If your want to push throughput even further I recommend if your
running hypervisor's or others to enable SR-IOV on the network cards.

Naturally your network cards need to support SR-IOV (check your tech
specs). And in the case of virtualisation SR-IOV might require
licensing from the vendor.

This does need some changes to both changes to BIOS settings on your
hardware (to enable sr-iov vt-x/vt-d (or iommu if your amd)).

You will also have to configure your hypervisors to support sr-iov.

You need to configure your vm guests to also to use your newly
presented network cards (VF).

Dont over allocate vf in your physical nic cards via sr-iov, you might
run out of interrupts :-D

You will see even better throughput, reducing latency, power
consumption and lower resource utilisation on your hypervisors.

Hope this helps.

On Fri, 2019-08-16 at 15:27 +0100, Tony Clark wrote:
> Firstly I would like to say great thanks Firat for the reply, it
> certainly put me on a different investigation path. And apologies for
> not replying sooner. I wanted to make sure it was the correct path
> before I replied back to the group with the findings and associated
> solution.
> 
> 
> If the GTP-U connection is connecting to the PG-W with a single IP at
> (src/dst) each side, and UDP flow has isn't been enabled on the
> network card of the host using gtp.ko in the kernel all the
> associated network traffic will be received on a single queue on the
> network, which is then serviced by a single ksoftirq thread. at
> somepoint the system will be receiving more traffic than there is
> available thread space to service the request, your ksoftirq will
> burn at 100%. That means all your traffic will be bound to a single
> network queue, bound to a single irq thread, and limit your overall
> throughput, no matter how big your network pipe is.
> 
> This is because the network card hashed the packet via
> SRC_IP:SRC_PORT:DEST_IP:DEST_PORT:PROTO to a single queue.
> 
> # take note on the discussions about udp-flow-hash udp4 using ethtool
> https://home.regit.org/tag/performance/
> https://www.joyent.com/blog/virtualizing-nics
> https://www.serializing.me/2015/04/25/rxtx-buffers-rss-others-on-boot
> /
> 
> You can check if your card supports adjustable parameters by using
> "ethtool -k DEV | egrep -v fixed". As firat eludes to (below) udp
> flow hashing should be supported.
> 
> If you enabled UDP flow hash then it will spread the hash over
> multiple queues. The default number of queues on the network card can
> vary, depending on your hardware firmware driver, and any additional
> associated kernel parameters.
> 
> Would recommend having the latest firmware driver for your network
> card, and latest kernel driver for the network card if possible.
> 
> Alas the network cards used by my hardware didn't support flow hash,
> it had intel flow director, which wasn't granular enough and worked
> with TCP, so to work around this limitation having multiple SRC_IPs
> in different name spaces with the same GTP UDP PORT numbers resolved
> the problem. Of course if you are sending GTP-U to a single
> destination from multiple sources (say 6 IP's), via 6 different
> kernel name spaces, you spread the load over 6 queues, which is
> better than nothing on a limited feature network card. Time to
> upgrade the 10G network card....
> 
> This took the system from 100% ksoftirq on a single cpu running at
> throughput 1G, to around 7 to 8GIG throughput at 90% ksoftirq over
> multiple cpu;s... There is still massive room for improvement.
> 
> 
> 
> For performance some things to investigate/consider... Which I had
> different levels of success... Here are my ramblings.....
> 
> on the linux host... Assuming your traffic is now spread across
> multiple queues (above) - or at least spread as best as can be...
> 
> Kernel sysctl tweaking is always of benefit, if your using out of the
> box kernel config... Example udp buffers, queue sizes, paging and
> virtual memory settings...  There is a application called "tuned",
> which allows you to adjust profiles for the kernel sysctl... My
> performance profile which suited the testing best was "throuhput-
> performance"
> 
> if your looking for straight performance, disable audit processing
> like "auditd".
> 
> Question use of SELINUX, enforcing/permissive or disabled. can bring
> results on performance, if you doing testing or load testing...
> ofcourse its a security consideration.. 
> 
> If you don't need to use ipfilters/firewall in my case can increase
> the throughput by a 3rd by disabling (cleaning the filter tables and
> unloading the modules). Black listing the modules so they dont get
> loaded at kernel time. Note you can stop modules getting loaded with
> kernel.modules_disabled=1, but be careful if your also messing with
> initramfs rebuilds, because you don't get any modules once your set
> that parameter, i learnt the hard way :)
> 
> Investigate smp_affinity and affinity_hint, along with irqbalane
> using the --hintpolicy=exact. understand which irq's service the
> network cards, and how many queues you have.. /proc/interrupts will
> guide you (grep 'CPU|rxtx' /proc/interrupts)... understand the
> smp_affinity numbers.. "for ((irq=START_IRQ; irq<=END_IRQ; irq++));
> do    cat /proc/irq/$irq/smp_affinity;     done | sort –u", as you
> can adjust which queue goes to which ksoftirq to manually balance the
> queues if you so desire. brilliant document on irq debugging.... http
> s://events.static.linuxfound.org/sites/events/files/slides/LinuxConJa
> pan2016_makita_160714.pdf
> 
> you can monitor what calls are been executed on cpu's by using... I
> found this most useful to understand that ipfilter was eating a
> significant amount of CPU cycles, and also what other calls are
> eating up cycles inside ksoftirq. https://github.com/brendangregg/Fla
> meGraph
> 
> Investigate additional memory management using numactl (numa daemon).
> remember if you are using virtualisation you might want to pin guests
> to specific sockets, along with numa pinning on the vmhost.. Also
> look at reserved memory allocation in the vmhost for the guest...
> This will make your guest perform better.
> 
> enable sysstat (sar) as it will aid your investigation if you havent
> already (sar -u ALL -P ALL 1). This will show which softirqs are
> eating most cpu and to which cpu they are bound, this also translates
> directly to the network queue that the traffic is coming in on.. Ie,
> network card queue 6, talks to cpu/6 talking to irq/6 and so on...
> Using flamegraph will help you understand what syscalls and chewing
> the CPU..
> 
> If your using virtualisation then the number of default queues that
> vxnet (vmware in this example) presents to the guest might be less
> than the number of network card queues the vmhost sees (so watch out
> for that). You can adjust the number of queues to the guest by params
> in the vmware network driver... investigate VDMQ / netqueue, to
> increase the number of available hardware queues from the vmhost to
> the guest. depending which quest driver your using vxnet3, or others
> some drivers dont support NAPI (see further down).
>   VMDQ: array of int
>     Number of Virtual Machine Device Queues: 0/1 = disable, 2-16
> enable (default=8)
>   RSS: array of int
>     Number of Receive-Side Scaling Descriptor Queues, default
> 1=number of cpus
>   MQ: array of int
>     Disable or enable Multiple Queues, default 1
>   Node: array of int
>     set the starting node to allocate memory on, default -1
>   IntMode: array of int
>     Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2
>   InterruptType: array of int
>     Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode
> (deprecated)
> 
> Make sure your virtual switch (vmware) if used has Pass-through
> (Direct-path I/O) enabled. NIC teaming policy should be validated
> depending on your requirement, example Policy "route based on IP
> hash" can be of benefit.
> 
> Check the network card is MSI-X, and the linux driver supports NAPI
> (most should these days, but you never know), also check your vmhost
> driver supports napi, if not get a NAPI supported kvm driver, or
> vmware driver (vib update).
> 
> Upgrade your kernel, to a later release 4.x.. even consider using a
> later distro of linux... I tried fedora 29. I also compiled latest
> osmocom from source, with compile options for "optimisation -O3 and
> other such".
> 
> "bmon -b" was a good tool understand throughput loads, along with
> loading through qdisc/fq_dodel mq's.... Understand qdisc via ip link
> or ifconfig (http://tldp.org/HOWTO/Traffic-Control-HOWTO/components.h
> tml), adjusting the queues has some traction, but if unsure leave as
> default.
> 
> TSO/UFo?GSO/LRO/GRO - understand your network card with respects to
> these, this can improve performance if you haven't already enabled
> (or adversely disabled options, since sometimes it doesn't actually
> help). You can get the your card options using ethool
> TCP Segmentation Offload (TSO)
>     Uses the TCP protocol to send large packets. Uses the NIC to
> handle segmentation, and then adds the TCP, IP and data link layer
> protocol headers to each segment. 
> UDP Fragmentation Offload (UFO)
>     Uses the UDP protocol to send large packets. Uses the NIC to
> handle IP fragmentation into MTU sized packets for large UDP
> datagrams. 
> Generic Segmentation Offload (GSO)
>     Uses the TCP or UDP protocol to send large packets. If the NIC
> cannot handle segmentation/fragmentation, GSO performs the same
> operations, bypassing the NIC hardware. This is achieved by delaying
> segmentation until as late as possible, for example, when the packet
> is processed by the device driver. 
> Large Receive Offload (LRO)
>     Uses the TCP protocol. All incoming packets are re-segmented as
> they are received, reducing the number of segments the system has to
> process. They can be merged either in the driver or using the NIC. A
> problem with LRO is that it tends to resegment all incoming packets,
> often ignoring differences in headers and other information which can
> cause errors. It is generally not possible to use LRO when IP
> forwarding is enabled. LRO in combination with IP forwarding can lead
> to checksum errors. Forwarding is enabled if
> /proc/sys/net/ipv4/ip_forward is set to 1. 
> Generic Receive Offload (GRO)
>     Uses either the TCP or UDP protocols. GRO is more rigorous than
> LRO when resegmenting packets. For example it checks the MAC headers
> of each packet, which must match, only a limited number of TCP or IP
> headers can be different, and the TCP timestamps must match.
> Resegmenting can be handled by either the NIC or the GSO code.
> 
> Traffic steering was on by default with the version of linux i was
> using, but worth checking if your using older versions.
> https://www.kernel.org/doc/Documentation/networking/scaling.txt
> (from the txt link) note: Some advanced NICs allow steering packets
> to queues based on programmable filters. For example, webserver bound
> TCP port 80 packets can be directed to their own receive queue. Such
> â€œn-tupleâ€ filters can be configured from ethtool (--config-
> ntuple).
> 
> Interestingly investigate your network card, for its hashing
> algorithms, how it distributes the traffic over its ring buffers, you
> can on some cards adjust the RSS hash function. Alas the card i was
> using stuck to "toeplitz" for hits hashing, which others were
> disabled and unavailable / xor and crc32. The  indirection table can
> be adjusted based on the tuplets "ethtool -X" but didn't really
> assist too much on this.
> ethtool -x <dev>
> RX flow hash indirection table for ens192 with 8 RX ring(s):
>     0:      0     1     2     3     4     5     6     7
>     8:      0     1     2     3     4     5     6     7
>    16:      0     1     2     3     4     5     6     7
>    24:      0     1     2     3     4     5     6     7
> RSS hash key:
> Operation not supported
> RSS hash function:
>     toeplitz: on
>     xor: off
>     crc32: off
> 
> 
> Check the default size of the rx/tx ring buffers, they maybe
> suboptimal.
> ethtool -g ens192
> Ring parameters for ens192:
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       4096
> TX:             4096
> Current hardware settings:
> RX:             1024
> RX Mini:        0
> RX Jumbo:       256
> TX:             512
> 
> If your using port channels, make sure you have the correct hashing
> policy enabled at the switch end...
> 
> I haven't investigated this option yet but some switches also do
> scaling, to assist (certainly with virtualisation)... Maybe one day i
> will get around to this...
> Additionally CISCO describe that you should have VM-FEX optimisation
> https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtua
> lization/unified-
> computing/vm_fex_best_practices_deployment_guide.html 
> note:
> table 4. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
> Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue
> Enabled
> Table 5. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
> Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue
> Disabled
> 
> 
> Another thing to consider/investigate - openvswitch/bridging... If
> your using eth pairs to send your traffic down name spaces... you can
> have some varied results with performance by trying openvswitch/brctl
> 
> 
> 
> I really enjoyed the investigation path, again thanks to Firat for
> the pointer, otherwise it would have taken longer to get the
> answer...
> 
> Tony
> 
> On Fri, Jun 21, 2019 at 6:50 AM fırat sönmez <firatssonmez at gmail.com>
> wrote:
> > Hi,
> > 
> > It has been over 2 years that I have worked with gtp and I kind of
> > had the same problem that time, we had a 10gbit cable and tried to
> > see how much udp flow we could get. I think we used iperf to test
> > it and when we list all the processes, the ksoftirq was using all
> > the resource. Then I found this page: https://blog.cloudflare.com/h
> > ow-to-receive-a-million-packets/. I do not remember the exact
> > solution, but I guess when you configure your out ethernet
> > interface with the command below, it must work then. To my
> > understanding all the packets are processed in the same core in
> > your situation, because the port number is always the same. So, for
> > example, if you add another network with gtp-u tunnel on another
> > port (different than 3386) then again your packets will be
> > processed on the other core, too. But with the below command, the
> > interface will be configured in a way that it wont check the port
> > to process on which core it should be processed, but it will use
> > the hash from the packet to distribute over the cores.
> > ethtool -n (your_out_eth_interface) rx-flow-hash udp4 
> > 
> > Hope it will work you.
> > 
> > Fırat
> > 
> > Tony Clark <chiefy.padua at gmail.com>, 19 Haz 2019 Çar, 15:07
> > tarihinde şunu yazdı:
> > > Dear All,
> > > 
> > > I've been using the GTP-U kernel module to communicate with a P-
> > > GW.
> > > 
> > > Running Fedora 29, kernel 4.18.16-300.fc29.x86_64.
> > > 
> > > At high traffic levels through the GTP-U tunnel I see the
> > > performance degrade as 100% CPU is consumed by a single ksoftirqd
> > > process.
> > > 
> > > It is running on a multi-cpu machine and as far as I can tell the
> > > load is evenly spread across the cpus (ie either manually via
> > > smp_affinity, or even irqbalance, checking /proc/interrupts so
> > > forth.).
> > > 
> > > Has anyone else experienced this?
> > > 
> > > Is there any particular area you could recommend I investigate to
> > > find the root cause of this bottleneck, as i'm starting to
> > > scratch my head where to look next...
> > > 
> > > Thanks in advance
> > > Tony
> > >  
> > > ---- FYI
> > > 
> > > modinfo gtp
> > > filename:       /lib/modules/4.18.16-
> > > 300.fc29.x86_64/kernel/drivers/net/gtp.ko.xz
> > > alias:          net-pf-16-proto-16-family-gtp
> > > alias:          rtnl-link-gtp
> > > description:    Interface driver for GTP encapsulated traffic
> > > author:         Harald Welte <hwelte at sysmocom.de>
> > > license:        GPL
> > > depends:        udp_tunnel
> > > retpoline:      Y
> > > intree:         Y
> > > name:           gtp
> > > vermagic:       4.18.16-300.fc29.x86_64 SMP mod_unload 
> > > 
> > > modinfo udp_tunnel
> > > filename:       /lib/modules/4.18.16-
> > > 300.fc29.x86_64/kernel/net/ipv4/udp_tunnel.ko.xz
> > > license:        GPL
> > > depends:        
> > > retpoline:      Y
> > > intree:         Y
> > > name:           udp_tunnel
> > > vermagic:       4.18.16-300.fc29.x86_64 SMP mod_unload 
> > >