Dear All,
To update you on investigations.
If your want to push throughput even further I recommend if your
running hypervisor's or others to enable SR-IOV on the network cards.
Naturally your network cards need to support SR-IOV (check your tech
specs). And in the case of virtualisation SR-IOV might require
licensing from the vendor.
This does need some changes to both changes to BIOS settings on your
hardware (to enable sr-iov vt-x/vt-d (or iommu if your amd)).
You will also have to configure your hypervisors to support sr-iov.
You need to configure your vm guests to also to use your newly
presented network cards (VF).
Dont over allocate vf in your physical nic cards via sr-iov, you might
run out of interrupts :-D
You will see even better throughput, reducing latency, power
consumption and lower resource utilisation on your hypervisors.
Hope this helps.
On Fri, 2019-08-16 at 15:27 +0100, Tony Clark wrote:
Firstly I would like to say great thanks Firat for the
reply, it
certainly put me on a different investigation path. And apologies for
not replying sooner. I wanted to make sure it was the correct path
before I replied back to the group with the findings and associated
solution.
If the GTP-U connection is connecting to the PG-W with a single IP at
(src/dst) each side, and UDP flow has isn't been enabled on the
network card of the host using gtp.ko in the kernel all the
associated network traffic will be received on a single queue on the
network, which is then serviced by a single ksoftirq thread. at
somepoint the system will be receiving more traffic than there is
available thread space to service the request, your ksoftirq will
burn at 100%. That means all your traffic will be bound to a single
network queue, bound to a single irq thread, and limit your overall
throughput, no matter how big your network pipe is.
This is because the network card hashed the packet via
SRC_IP:SRC_PORT:DEST_IP:DEST_PORT:PROTO to a single queue.
# take note on the discussions about udp-flow-hash udp4 using ethtool
https://home.regit.org/tag/performance/
https://www.joyent.com/blog/virtualizing-nics
https://www.serializing.me/2015/04/25/rxtx-buffers-rss-others-on-boot
/
You can check if your card supports adjustable parameters by using
"ethtool -k DEV | egrep -v fixed". As firat eludes to (below) udp
flow hashing should be supported.
If you enabled UDP flow hash then it will spread the hash over
multiple queues. The default number of queues on the network card can
vary, depending on your hardware firmware driver, and any additional
associated kernel parameters.
Would recommend having the latest firmware driver for your network
card, and latest kernel driver for the network card if possible.
Alas the network cards used by my hardware didn't support flow hash,
it had intel flow director, which wasn't granular enough and worked
with TCP, so to work around this limitation having multiple SRC_IPs
in different name spaces with the same GTP UDP PORT numbers resolved
the problem. Of course if you are sending GTP-U to a single
destination from multiple sources (say 6 IP's), via 6 different
kernel name spaces, you spread the load over 6 queues, which is
better than nothing on a limited feature network card. Time to
upgrade the 10G network card....
This took the system from 100% ksoftirq on a single cpu running at
throughput 1G, to around 7 to 8GIG throughput at 90% ksoftirq over
multiple cpu;s... There is still massive room for improvement.
For performance some things to investigate/consider... Which I had
different levels of success... Here are my ramblings.....
on the linux host... Assuming your traffic is now spread across
multiple queues (above) - or at least spread as best as can be...
Kernel sysctl tweaking is always of benefit, if your using out of the
box kernel config... Example udp buffers, queue sizes, paging and
virtual memory settings... There is a application called "tuned",
which allows you to adjust profiles for the kernel sysctl... My
performance profile which suited the testing best was "throuhput-
performance"
if your looking for straight performance, disable audit processing
like "auditd".
Question use of SELINUX, enforcing/permissive or disabled. can bring
results on performance, if you doing testing or load testing...
ofcourse its a security consideration..
If you don't need to use ipfilters/firewall in my case can increase
the throughput by a 3rd by disabling (cleaning the filter tables and
unloading the modules). Black listing the modules so they dont get
loaded at kernel time. Note you can stop modules getting loaded with
kernel.modules_disabled=1, but be careful if your also messing with
initramfs rebuilds, because you don't get any modules once your set
that parameter, i learnt the hard way :)
Investigate smp_affinity and affinity_hint, along with irqbalane
using the --hintpolicy=exact. understand which irq's service the
network cards, and how many queues you have.. /proc/interrupts will
guide you (grep 'CPU|rxtx' /proc/interrupts)... understand the
smp_affinity numbers.. "for ((irq=START_IRQ; irq<=END_IRQ; irq++));
do cat /proc/irq/$irq/smp_affinity; done | sort –u", as you
can adjust which queue goes to which ksoftirq to manually balance the
queues if you so desire. brilliant document on irq debugging.... http
s://events.static.linuxfound.org/sites/events/files/slides/LinuxConJa
pan2016_makita_160714.pdf
you can monitor what calls are been executed on cpu's by using... I
found this most useful to understand that ipfilter was eating a
significant amount of CPU cycles, and also what other calls are
eating up cycles inside ksoftirq.
https://github.com/brendangregg/Fla
meGraph
Investigate additional memory management using numactl (numa daemon).
remember if you are using virtualisation you might want to pin guests
to specific sockets, along with numa pinning on the vmhost.. Also
look at reserved memory allocation in the vmhost for the guest...
This will make your guest perform better.
enable sysstat (sar) as it will aid your investigation if you havent
already (sar -u ALL -P ALL 1). This will show which softirqs are
eating most cpu and to which cpu they are bound, this also translates
directly to the network queue that the traffic is coming in on.. Ie,
network card queue 6, talks to cpu/6 talking to irq/6 and so on...
Using flamegraph will help you understand what syscalls and chewing
the CPU..
If your using virtualisation then the number of default queues that
vxnet (vmware in this example) presents to the guest might be less
than the number of network card queues the vmhost sees (so watch out
for that). You can adjust the number of queues to the guest by params
in the vmware network driver... investigate VDMQ / netqueue, to
increase the number of available hardware queues from the vmhost to
the guest. depending which quest driver your using vxnet3, or others
some drivers dont support NAPI (see further down).
VMDQ: array of int
Number of Virtual Machine Device Queues: 0/1 = disable, 2-16
enable (default=8)
RSS: array of int
Number of Receive-Side Scaling Descriptor Queues, default
1=number of cpus
MQ: array of int
Disable or enable Multiple Queues, default 1
Node: array of int
set the starting node to allocate memory on, default -1
IntMode: array of int
Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2
InterruptType: array of int
Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode
(deprecated)
Make sure your virtual switch (vmware) if used has Pass-through
(Direct-path I/O) enabled. NIC teaming policy should be validated
depending on your requirement, example Policy "route based on IP
hash" can be of benefit.
Check the network card is MSI-X, and the linux driver supports NAPI
(most should these days, but you never know), also check your vmhost
driver supports napi, if not get a NAPI supported kvm driver, or
vmware driver (vib update).
Upgrade your kernel, to a later release 4.x.. even consider using a
later distro of linux... I tried fedora 29. I also compiled latest
osmocom from source, with compile options for "optimisation -O3 and
other such".
"bmon -b" was a good tool understand throughput loads, along with
loading through qdisc/fq_dodel mq's.... Understand qdisc via ip link
or ifconfig (
http://tldp.org/HOWTO/Traffic-Control-HOWTO/components.h
tml), adjusting the queues has some traction, but if unsure leave as
default.
TSO/UFo?GSO/LRO/GRO - understand your network card with respects to
these, this can improve performance if you haven't already enabled
(or adversely disabled options, since sometimes it doesn't actually
help). You can get the your card options using ethool
TCP Segmentation Offload (TSO)
Uses the TCP protocol to send large packets. Uses the NIC to
handle segmentation, and then adds the TCP, IP and data link layer
protocol headers to each segment.
UDP Fragmentation Offload (UFO)
Uses the UDP protocol to send large packets. Uses the NIC to
handle IP fragmentation into MTU sized packets for large UDP
datagrams.
Generic Segmentation Offload (GSO)
Uses the TCP or UDP protocol to send large packets. If the NIC
cannot handle segmentation/fragmentation, GSO performs the same
operations, bypassing the NIC hardware. This is achieved by delaying
segmentation until as late as possible, for example, when the packet
is processed by the device driver.
Large Receive Offload (LRO)
Uses the TCP protocol. All incoming packets are re-segmented as
they are received, reducing the number of segments the system has to
process. They can be merged either in the driver or using the NIC. A
problem with LRO is that it tends to resegment all incoming packets,
often ignoring differences in headers and other information which can
cause errors. It is generally not possible to use LRO when IP
forwarding is enabled. LRO in combination with IP forwarding can lead
to checksum errors. Forwarding is enabled if
/proc/sys/net/ipv4/ip_forward is set to 1.
Generic Receive Offload (GRO)
Uses either the TCP or UDP protocols. GRO is more rigorous than
LRO when resegmenting packets. For example it checks the MAC headers
of each packet, which must match, only a limited number of TCP or IP
headers can be different, and the TCP timestamps must match.
Resegmenting can be handled by either the NIC or the GSO code.
Traffic steering was on by default with the version of linux i was
using, but worth checking if your using older versions.
https://www.kernel.org/doc/Documentation/networking/scaling.txt
(from the txt link) note: Some advanced NICs allow steering packets
to queues based on programmable filters. For example, webserver bound
TCP port 80 packets can be directed to their own receive queue. Such
“n-tuple†filters can be configured from ethtool (--config-
ntuple).
Interestingly investigate your network card, for its hashing
algorithms, how it distributes the traffic over its ring buffers, you
can on some cards adjust the RSS hash function. Alas the card i was
using stuck to "toeplitz" for hits hashing, which others were
disabled and unavailable / xor and crc32. The indirection table can
be adjusted based on the tuplets "ethtool -X" but didn't really
assist too much on this.
ethtool -x <dev>
RX flow hash indirection table for ens192 with 8 RX ring(s):
0: 0 1 2 3 4 5 6 7
8: 0 1 2 3 4 5 6 7
16: 0 1 2 3 4 5 6 7
24: 0 1 2 3 4 5 6 7
RSS hash key:
Operation not supported
RSS hash function:
toeplitz: on
xor: off
crc32: off
Check the default size of the rx/tx ring buffers, they maybe
suboptimal.
ethtool -g ens192
Ring parameters for ens192:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 4096
TX: 4096
Current hardware settings:
RX: 1024
RX Mini: 0
RX Jumbo: 256
TX: 512
If your using port channels, make sure you have the correct hashing
policy enabled at the switch end...
I haven't investigated this option yet but some switches also do
scaling, to assist (certainly with virtualisation)... Maybe one day i
will get around to this...
Additionally CISCO describe that you should have VM-FEX optimisation
https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtua
lization/unified-
computing/vm_fex_best_practices_deployment_guide.html
note:
table 4. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue
Enabled
Table 5. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines
Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue
Disabled
Another thing to consider/investigate - openvswitch/bridging... If
your using eth pairs to send your traffic down name spaces... you can
have some varied results with performance by trying openvswitch/brctl
I really enjoyed the investigation path, again thanks to Firat for
the pointer, otherwise it would have taken longer to get the
answer...
Tony
On Fri, Jun 21, 2019 at 6:50 AM fırat sönmez <firatssonmez(a)gmail.com>
wrote:
> Hi,
>
> It has been over 2 years that I have worked with gtp and I kind of
> had the same problem that time, we had a 10gbit cable and tried to
> see how much udp flow we could get. I think we used iperf to test
> it and when we list all the processes, the ksoftirq was using all
> the resource. Then I found this page:
https://blog.cloudflare.com/h
> ow-to-receive-a-million-packets/. I do not remember the exact
> solution, but I guess when you configure your out ethernet
> interface with the command below, it must work then. To my
> understanding all the packets are processed in the same core in
> your situation, because the port number is always the same. So, for
> example, if you add another network with gtp-u tunnel on another
> port (different than 3386) then again your packets will be
> processed on the other core, too. But with the below command, the
> interface will be configured in a way that it wont check the port
> to process on which core it should be processed, but it will use
> the hash from the packet to distribute over the cores.
> ethtool -n (your_out_eth_interface) rx-flow-hash udp4
>
> Hope it will work you.
>
> Fırat
>
> Tony Clark <chiefy.padua(a)gmail.com>om>, 19 Haz 2019 Çar, 15:07
> tarihinde şunu yazdı:
> > Dear All,
> >
> > I've been using the GTP-U kernel module to communicate with a P-
> > GW.
> >
> > Running Fedora 29, kernel 4.18.16-300.fc29.x86_64.
> >
> > At high traffic levels through the GTP-U tunnel I see the
> > performance degrade as 100% CPU is consumed by a single ksoftirqd
> > process.
> >
> > It is running on a multi-cpu machine and as far as I can tell the
> > load is evenly spread across the cpus (ie either manually via
> > smp_affinity, or even irqbalance, checking /proc/interrupts so
> > forth.).
> >
> > Has anyone else experienced this?
> >
> > Is there any particular area you could recommend I investigate to
> > find the root cause of this bottleneck, as i'm starting to
> > scratch my head where to look next...
> >
> > Thanks in advance
> > Tony
> >
> > ---- FYI
> >
> > modinfo gtp
> > filename: /lib/modules/4.18.16-
> > 300.fc29.x86_64/kernel/drivers/net/gtp.ko.xz
> > alias: net-pf-16-proto-16-family-gtp
> > alias: rtnl-link-gtp
> > description: Interface driver for GTP encapsulated traffic
> > author: Harald Welte <hwelte(a)sysmocom.de>
> > license: GPL
> > depends: udp_tunnel
> > retpoline: Y
> > intree: Y
> > name: gtp
> > vermagic: 4.18.16-300.fc29.x86_64 SMP mod_unload
> >
> > modinfo udp_tunnel
> > filename: /lib/modules/4.18.16-
> > 300.fc29.x86_64/kernel/net/ipv4/udp_tunnel.ko.xz
> > license: GPL
> > depends:
> > retpoline: Y
> > intree: Y
> > name: udp_tunnel
> > vermagic: 4.18.16-300.fc29.x86_64 SMP mod_unload
> >