This is merely a historical archive of years 2008-2021, before the migration to mailman3.
A maintained and still updated list archive can be found at https://lists.osmocom.org/hyperkitty/list/osmocom-net-gprs@lists.osmocom.org/.
chiefy chiefy.padua at gmail.comDear All, To update you on investigations. If your want to push throughput even further I recommend if your running hypervisor's or others to enable SR-IOV on the network cards. Naturally your network cards need to support SR-IOV (check your tech specs). And in the case of virtualisation SR-IOV might require licensing from the vendor. This does need some changes to both changes to BIOS settings on your hardware (to enable sr-iov vt-x/vt-d (or iommu if your amd)). You will also have to configure your hypervisors to support sr-iov. You need to configure your vm guests to also to use your newly presented network cards (VF). Dont over allocate vf in your physical nic cards via sr-iov, you might run out of interrupts :-D You will see even better throughput, reducing latency, power consumption and lower resource utilisation on your hypervisors. Hope this helps. On Fri, 2019-08-16 at 15:27 +0100, Tony Clark wrote: > Firstly I would like to say great thanks Firat for the reply, it > certainly put me on a different investigation path. And apologies for > not replying sooner. I wanted to make sure it was the correct path > before I replied back to the group with the findings and associated > solution. > > > If the GTP-U connection is connecting to the PG-W with a single IP at > (src/dst) each side, and UDP flow has isn't been enabled on the > network card of the host using gtp.ko in the kernel all the > associated network traffic will be received on a single queue on the > network, which is then serviced by a single ksoftirq thread. at > somepoint the system will be receiving more traffic than there is > available thread space to service the request, your ksoftirq will > burn at 100%. That means all your traffic will be bound to a single > network queue, bound to a single irq thread, and limit your overall > throughput, no matter how big your network pipe is. > > This is because the network card hashed the packet via > SRC_IP:SRC_PORT:DEST_IP:DEST_PORT:PROTO to a single queue. > > # take note on the discussions about udp-flow-hash udp4 using ethtool > https://home.regit.org/tag/performance/ > https://www.joyent.com/blog/virtualizing-nics > https://www.serializing.me/2015/04/25/rxtx-buffers-rss-others-on-boot > / > > You can check if your card supports adjustable parameters by using > "ethtool -k DEV | egrep -v fixed". As firat eludes to (below) udp > flow hashing should be supported. > > If you enabled UDP flow hash then it will spread the hash over > multiple queues. The default number of queues on the network card can > vary, depending on your hardware firmware driver, and any additional > associated kernel parameters. > > Would recommend having the latest firmware driver for your network > card, and latest kernel driver for the network card if possible. > > Alas the network cards used by my hardware didn't support flow hash, > it had intel flow director, which wasn't granular enough and worked > with TCP, so to work around this limitation having multiple SRC_IPs > in different name spaces with the same GTP UDP PORT numbers resolved > the problem. Of course if you are sending GTP-U to a single > destination from multiple sources (say 6 IP's), via 6 different > kernel name spaces, you spread the load over 6 queues, which is > better than nothing on a limited feature network card. Time to > upgrade the 10G network card.... > > This took the system from 100% ksoftirq on a single cpu running at > throughput 1G, to around 7 to 8GIG throughput at 90% ksoftirq over > multiple cpu;s... There is still massive room for improvement. > > > > For performance some things to investigate/consider... Which I had > different levels of success... Here are my ramblings..... > > on the linux host... Assuming your traffic is now spread across > multiple queues (above) - or at least spread as best as can be... > > Kernel sysctl tweaking is always of benefit, if your using out of the > box kernel config... Example udp buffers, queue sizes, paging and > virtual memory settings... There is a application called "tuned", > which allows you to adjust profiles for the kernel sysctl... My > performance profile which suited the testing best was "throuhput- > performance" > > if your looking for straight performance, disable audit processing > like "auditd". > > Question use of SELINUX, enforcing/permissive or disabled. can bring > results on performance, if you doing testing or load testing... > ofcourse its a security consideration.. > > If you don't need to use ipfilters/firewall in my case can increase > the throughput by a 3rd by disabling (cleaning the filter tables and > unloading the modules). Black listing the modules so they dont get > loaded at kernel time. Note you can stop modules getting loaded with > kernel.modules_disabled=1, but be careful if your also messing with > initramfs rebuilds, because you don't get any modules once your set > that parameter, i learnt the hard way :) > > Investigate smp_affinity and affinity_hint, along with irqbalane > using the --hintpolicy=exact. understand which irq's service the > network cards, and how many queues you have.. /proc/interrupts will > guide you (grep 'CPU|rxtx' /proc/interrupts)... understand the > smp_affinity numbers.. "for ((irq=START_IRQ; irq<=END_IRQ; irq++)); > do cat /proc/irq/$irq/smp_affinity; done | sort –u", as you > can adjust which queue goes to which ksoftirq to manually balance the > queues if you so desire. brilliant document on irq debugging.... http > s://events.static.linuxfound.org/sites/events/files/slides/LinuxConJa > pan2016_makita_160714.pdf > > you can monitor what calls are been executed on cpu's by using... I > found this most useful to understand that ipfilter was eating a > significant amount of CPU cycles, and also what other calls are > eating up cycles inside ksoftirq. https://github.com/brendangregg/Fla > meGraph > > Investigate additional memory management using numactl (numa daemon). > remember if you are using virtualisation you might want to pin guests > to specific sockets, along with numa pinning on the vmhost.. Also > look at reserved memory allocation in the vmhost for the guest... > This will make your guest perform better. > > enable sysstat (sar) as it will aid your investigation if you havent > already (sar -u ALL -P ALL 1). This will show which softirqs are > eating most cpu and to which cpu they are bound, this also translates > directly to the network queue that the traffic is coming in on.. Ie, > network card queue 6, talks to cpu/6 talking to irq/6 and so on... > Using flamegraph will help you understand what syscalls and chewing > the CPU.. > > If your using virtualisation then the number of default queues that > vxnet (vmware in this example) presents to the guest might be less > than the number of network card queues the vmhost sees (so watch out > for that). You can adjust the number of queues to the guest by params > in the vmware network driver... investigate VDMQ / netqueue, to > increase the number of available hardware queues from the vmhost to > the guest. depending which quest driver your using vxnet3, or others > some drivers dont support NAPI (see further down). > VMDQ: array of int > Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 > enable (default=8) > RSS: array of int > Number of Receive-Side Scaling Descriptor Queues, default > 1=number of cpus > MQ: array of int > Disable or enable Multiple Queues, default 1 > Node: array of int > set the starting node to allocate memory on, default -1 > IntMode: array of int > Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2 > InterruptType: array of int > Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode > (deprecated) > > Make sure your virtual switch (vmware) if used has Pass-through > (Direct-path I/O) enabled. NIC teaming policy should be validated > depending on your requirement, example Policy "route based on IP > hash" can be of benefit. > > Check the network card is MSI-X, and the linux driver supports NAPI > (most should these days, but you never know), also check your vmhost > driver supports napi, if not get a NAPI supported kvm driver, or > vmware driver (vib update). > > Upgrade your kernel, to a later release 4.x.. even consider using a > later distro of linux... I tried fedora 29. I also compiled latest > osmocom from source, with compile options for "optimisation -O3 and > other such". > > "bmon -b" was a good tool understand throughput loads, along with > loading through qdisc/fq_dodel mq's.... Understand qdisc via ip link > or ifconfig (http://tldp.org/HOWTO/Traffic-Control-HOWTO/components.h > tml), adjusting the queues has some traction, but if unsure leave as > default. > > TSO/UFo?GSO/LRO/GRO - understand your network card with respects to > these, this can improve performance if you haven't already enabled > (or adversely disabled options, since sometimes it doesn't actually > help). You can get the your card options using ethool > TCP Segmentation Offload (TSO) > Uses the TCP protocol to send large packets. Uses the NIC to > handle segmentation, and then adds the TCP, IP and data link layer > protocol headers to each segment. > UDP Fragmentation Offload (UFO) > Uses the UDP protocol to send large packets. Uses the NIC to > handle IP fragmentation into MTU sized packets for large UDP > datagrams. > Generic Segmentation Offload (GSO) > Uses the TCP or UDP protocol to send large packets. If the NIC > cannot handle segmentation/fragmentation, GSO performs the same > operations, bypassing the NIC hardware. This is achieved by delaying > segmentation until as late as possible, for example, when the packet > is processed by the device driver. > Large Receive Offload (LRO) > Uses the TCP protocol. All incoming packets are re-segmented as > they are received, reducing the number of segments the system has to > process. They can be merged either in the driver or using the NIC. A > problem with LRO is that it tends to resegment all incoming packets, > often ignoring differences in headers and other information which can > cause errors. It is generally not possible to use LRO when IP > forwarding is enabled. LRO in combination with IP forwarding can lead > to checksum errors. Forwarding is enabled if > /proc/sys/net/ipv4/ip_forward is set to 1. > Generic Receive Offload (GRO) > Uses either the TCP or UDP protocols. GRO is more rigorous than > LRO when resegmenting packets. For example it checks the MAC headers > of each packet, which must match, only a limited number of TCP or IP > headers can be different, and the TCP timestamps must match. > Resegmenting can be handled by either the NIC or the GSO code. > > Traffic steering was on by default with the version of linux i was > using, but worth checking if your using older versions. > https://www.kernel.org/doc/Documentation/networking/scaling.txt > (from the txt link) note: Some advanced NICs allow steering packets > to queues based on programmable filters. For example, webserver bound > TCP port 80 packets can be directed to their own receive queue. Such > “n-tuple†filters can be configured from ethtool (--config- > ntuple). > > Interestingly investigate your network card, for its hashing > algorithms, how it distributes the traffic over its ring buffers, you > can on some cards adjust the RSS hash function. Alas the card i was > using stuck to "toeplitz" for hits hashing, which others were > disabled and unavailable / xor and crc32. The indirection table can > be adjusted based on the tuplets "ethtool -X" but didn't really > assist too much on this. > ethtool -x <dev> > RX flow hash indirection table for ens192 with 8 RX ring(s): > 0: 0 1 2 3 4 5 6 7 > 8: 0 1 2 3 4 5 6 7 > 16: 0 1 2 3 4 5 6 7 > 24: 0 1 2 3 4 5 6 7 > RSS hash key: > Operation not supported > RSS hash function: > toeplitz: on > xor: off > crc32: off > > > Check the default size of the rx/tx ring buffers, they maybe > suboptimal. > ethtool -g ens192 > Ring parameters for ens192: > Pre-set maximums: > RX: 4096 > RX Mini: 0 > RX Jumbo: 4096 > TX: 4096 > Current hardware settings: > RX: 1024 > RX Mini: 0 > RX Jumbo: 256 > TX: 512 > > If your using port channels, make sure you have the correct hashing > policy enabled at the switch end... > > I haven't investigated this option yet but some switches also do > scaling, to assist (certainly with virtualisation)... Maybe one day i > will get around to this... > Additionally CISCO describe that you should have VM-FEX optimisation > https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtua > lization/unified- > computing/vm_fex_best_practices_deployment_guide.html > note: > table 4. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines > Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue > Enabled > Table 5. Scaling of Dynamic vNIC with VMDirectPath, Virtual Machines > Running on Linux Guest with VMXNET3 Emulated Driver and Multi-Queue > Disabled > > > Another thing to consider/investigate - openvswitch/bridging... If > your using eth pairs to send your traffic down name spaces... you can > have some varied results with performance by trying openvswitch/brctl > > > > I really enjoyed the investigation path, again thanks to Firat for > the pointer, otherwise it would have taken longer to get the > answer... > > Tony > > On Fri, Jun 21, 2019 at 6:50 AM fırat sönmez <firatssonmez at gmail.com> > wrote: > > Hi, > > > > It has been over 2 years that I have worked with gtp and I kind of > > had the same problem that time, we had a 10gbit cable and tried to > > see how much udp flow we could get. I think we used iperf to test > > it and when we list all the processes, the ksoftirq was using all > > the resource. Then I found this page: https://blog.cloudflare.com/h > > ow-to-receive-a-million-packets/. I do not remember the exact > > solution, but I guess when you configure your out ethernet > > interface with the command below, it must work then. To my > > understanding all the packets are processed in the same core in > > your situation, because the port number is always the same. So, for > > example, if you add another network with gtp-u tunnel on another > > port (different than 3386) then again your packets will be > > processed on the other core, too. But with the below command, the > > interface will be configured in a way that it wont check the port > > to process on which core it should be processed, but it will use > > the hash from the packet to distribute over the cores. > > ethtool -n (your_out_eth_interface) rx-flow-hash udp4 > > > > Hope it will work you. > > > > Fırat > > > > Tony Clark <chiefy.padua at gmail.com>, 19 Haz 2019 Çar, 15:07 > > tarihinde şunu yazdı: > > > Dear All, > > > > > > I've been using the GTP-U kernel module to communicate with a P- > > > GW. > > > > > > Running Fedora 29, kernel 4.18.16-300.fc29.x86_64. > > > > > > At high traffic levels through the GTP-U tunnel I see the > > > performance degrade as 100% CPU is consumed by a single ksoftirqd > > > process. > > > > > > It is running on a multi-cpu machine and as far as I can tell the > > > load is evenly spread across the cpus (ie either manually via > > > smp_affinity, or even irqbalance, checking /proc/interrupts so > > > forth.). > > > > > > Has anyone else experienced this? > > > > > > Is there any particular area you could recommend I investigate to > > > find the root cause of this bottleneck, as i'm starting to > > > scratch my head where to look next... > > > > > > Thanks in advance > > > Tony > > > > > > ---- FYI > > > > > > modinfo gtp > > > filename: /lib/modules/4.18.16- > > > 300.fc29.x86_64/kernel/drivers/net/gtp.ko.xz > > > alias: net-pf-16-proto-16-family-gtp > > > alias: rtnl-link-gtp > > > description: Interface driver for GTP encapsulated traffic > > > author: Harald Welte <hwelte at sysmocom.de> > > > license: GPL > > > depends: udp_tunnel > > > retpoline: Y > > > intree: Y > > > name: gtp > > > vermagic: 4.18.16-300.fc29.x86_64 SMP mod_unload > > > > > > modinfo udp_tunnel > > > filename: /lib/modules/4.18.16- > > > 300.fc29.x86_64/kernel/net/ipv4/udp_tunnel.ko.xz > > > license: GPL > > > depends: > > > retpoline: Y > > > intree: Y > > > name: udp_tunnel > > > vermagic: 4.18.16-300.fc29.x86_64 SMP mod_unload > > >