On an Intel Ivy Bridge test system with an ExaNIC X25, the median latency at 10G from application to network to application is 643 nanoseconds for small packets, when taking advantage of all of the available latency optimisation techniques available for ExaNICs. This increases with frame size in a way that's approximately ideal, ie if you send a 164 byte packet you need to add 100 bytes at 10G line rate to this latency figure if frames are sent and received in full. This will vary with architecture but see below for benchmarking setup, utilities and trouble shooting.
Following is a general guide to optimizing your system for performance benchmarks.
In general it is advised when performing initial benchmarking investigations to setup with as little equipment as possible. We suggest a simple looped back cable for initial validation tests without other switches or network equipment in series.
ExaNIC Configuration
The simplest starting point is to install a loopback cable from the first two ports of an ExaNIC device. (i.e. exanic0:0 to exanic0:1)
With the cable present exanic-config
should report SFP present
and signal detected
for the two ports. If this is not the case confirm the SFP's are completely inserted and if possible replace with a known good cable.
The next step is to validate that the speeds of the two ports are set to match, again run exanic-config
and confirm that the two Port speed:
values match. For best benchmarking results this should be the highest available speed. e.g. 10,000Mbps. If the speeds differ or are slow you can change them by running
exanic-config exanic0:0 speed 10000
exanic-config exanic0:1 speed 10000
Next confirm that the Port status:
values are enabled. If they are not execute the exanic-config exanic0:0 up
command.
The final exanic-config output for both ports should look similar to:
Port 0:
Interface: enp1s0
Port speed: 10000 Mbps
Port status: enabled, SFP present, signal detected, link active
and
Port 1:
Interface: enp1s0d1
Port speed: 10000 Mbps
Port status: enabled, SFP present, signal detected, link active
For simple loopback tests, also ensure bypass only mode is used: exanic-config exanic0:0 bypass-only on
Running benchmarking utilities
A number of benchmarking utilities, both for the ExaNIC and for other cards, are located in the perf-test directory provided with the distribution. To build these benchmark utilities for ExaNIC:
$ cd perf-test
$ make exanic
The exanic_perf_test
application can be used to benchmark the performance of ExaNIC cards in a variety of configurations. This guide will demonstrate how to run a "loopback" benchmark, which measures the time taken for a frame to be sent and received by software with libexanic.
$ ./exanic_perf_test
exanic_perf_test: Measure the latency performance of ExaNICs with libexanic
Usage: ./exanic_perf_test -d device
[-m testmode] [-t txport] [-r rxport]
[-T txmode] [-R rxmode]
[-s size] [-c count] [-w warmups] [-a]
-m: specify the test mode (loopback/forward)
-d: specify the exanic device name (e.g. exanic0)
-t/-r set the port to transmit/receive packets on
-T set the method to transmit packets (frame/preloaded)
-R set the method to receive packets (frame/chunk_inplace)
-s: specify the packet size to send (default 60)
-c: specify how many packets to send (default 1000000)
-w: specify how many warmup frames to send (default 100000)
-a: print raw cycle counts instead of a percentile breakdown
To get an initial indication of your hosts perfomance with libexanic, the exanic_perf_test
can be used to run a loopback benchmark between ports 0 and 1. Connect a cable from port 0 to port 1 and run exanic_perf_test
as follows:
$ ./exanic_perf_test -d exanic0 -T frame -R frame
CPU GHz = 3.31
Percentile 0.000 = 625ns
Percentile 1.000 = 642ns
Percentile 5.000 = 679ns
Percentile 10.000 = 683ns
Percentile 25.000 = 689ns
Percentile 50.000 = 696ns
Percentile 75.000 = 709ns
Percentile 90.000 = 759ns
Percentile 95.000 = 770ns
Percentile 99.000 = 959ns
Percentile 100.000 = 1336ns
This configuration causes exanic_perf_test
to loopback frames from port 0 to port 1, using exanic_transmit_frame()
to send frames and exanic_receive_frame()
to receive them. The libexanic documentation can provide more information on what these function calls perform.
It is possible to improve upon these values by making changes to the hosts BIOS settings, kernel build configuration and kernel boot parameters. This guide will also demonstrates the performance gains that are possible by taking advantage of latency saving techniques in libexanic.
BIOS Configuration.
Turn off hyperthreading, speedstep, power saving, and any other energy saving settings that may be turned on. These can cause poor latency while the CPU ramps up. It is sometimes possible to identify if this is happening by looking at the cpuinfo
. To do this run:
$cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
stepping : 9
microcode : 0x1c
cpu MHz : 1600.00
In the above example the CPU is running in a low power state at 1600Mhz, which is below its 3500Mhz nominal speed.
The exanic_perf_test
application will by default transmit warmup frames before the benchmark actually begins, to bring the CPU out of a power saving state for optimal results.
Kernel Build Configuration
Ensure that your kernel is built with CONFIG_NO_HZ_FULL=y
. This setting will allow you to run the kernel in fully tickless mode on your performance cores. Timer ticks from the kernel will interrupt your process causing it to have latency spikes. To check if your kernel supports full tickless behaviour examine the kernel config file, e.g.:
cat /boot/config-4.10.11-100.fc24.x86_64 | grep NO_HZ_FULL
CONFIG_NO_HZ_FULL=y
Kernel Boot Configuration
When testing software performance, ensure that the kernel boot configuration is configured for realtime performance. This is usually done by modifying /etc/default/grub. Following is an example:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 intel_idle.max_cstate=0 irqaffinity=0,1 selinux=0 audit=0 tsc=reliable"
isolcpus=2,3
causes the scheduler to remove CPUs 2 and 3 from the scheduling poolnohz_full=2,3
causes CPUs 2 and 3 to run in fully tickless modercu_nocbs=2,3
stops RCU callbacks to these coresintel_idle.max_cstate=0
disables the intel_idle and fall back mode on acpi_idleirqaffinity=0,1
sets the default IRQ mask to cores 0 and 1selinux=0
disable the SE Linux extensionsaudit=0
disable kernel auditing systemtsc=reliable
marks the tsc clocksource as reliable, this disables clocksource verification at runtime
After regenerating the boot image and rebooting, you can check that this command has taken effect by running the below command, you should see the parameters above
$ cat /proc/cmdline
See Linux kernel parameters documentation for more information.
Hardware Configuration
Make sure the ExaNIC is plugged into a PCIe x8 Gen 3 slot and is running @ 8.0 GT/s per lane (for systems that support PCIe Gen3) . This can be identified by running the lspci command and looking at the LnkSta (link status) output.
$ sudo lspci -d 1ce4:* -vvv |grep LnkSta:
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Make sure the ExaNIC is plugged into a PCIe slot directly connected to a CPU. The server or motherboard documentation should indicate which slots are connected to CPUs and which are connected to the chipset. If unsure, the following procedure can be used. First determine the bus number of the ExaNIC from lspci:
$ sudo lspci -d 1ce4:*
02:00.0 Ethernet controller: Exablaze ExaNIC X25
In this case, the bus number is 02
. Now search for the device that has secondary=02
in the
output of lspci -v, for example:
$ sudo lspci -v
...
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core
processor PCI Express Root Port (rev 09) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
...
For optimal performance, this should be a processor root port (in this case, “Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port”).
Software Settings
There are a number of configuration options for Linux that will improve realtime performance.
We recommend the following:
cpus="2 3"
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /proc/sys/kernel/watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 3 > /proc/irq/default_smp_affinity
for irq in `ls /proc/irq/`; do echo 1 > /proc/irq/$irq/smp_affinity; done
for irq in `ls /proc/irq/`; do echo -n "$irq "; cat /proc/irq/$irq/smp_affinity_list; done
for cpu in $cpus
do
echo "performance" > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
echo 0 >/sys/devices/system/machinecheck/machinecheck$cpu/check_interval
done
The above code:
- disables the Linux realtime throttling which ensures that realtime processes cannot starve the CPUS.
- disables the Linux watchdog timer which is used to detect and recover from software faults.
- disables the debugging feature for catching hardware hangings.
- sets the default CPU affinit of 0b11 (3), which means that only CPU 0 and 1 handle interrupts.
- moves all interrupts off cpu 2 and 3.
For more information, see Improving Linux Realtime Properties for more information.
Improving performance with libexanic (raw frames)
When running a benchmark, pin the process to one of the isolated cores as follows:
$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T frame -R frame
It is possible to obtain better results by using faster methods for transmit and receive. By running exanic_perf_test
with the following options:
$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T preloaded -R chunk_inplace
CPU GHz = 3.31
Percentile 0.000 = 568ns
Percentile 1.000 = 597ns
Percentile 5.000 = 603ns
Percentile 10.000 = 607ns
Percentile 25.000 = 617ns
Percentile 50.000 = 629ns
Percentile 75.000 = 637ns
Percentile 90.000 = 642ns
Percentile 95.000 = 671ns
Percentile 99.000 = 750ns
Percentile 100.000 = 1209ns
This causes exanic_perf_test
to use transmit preloading and in-place chunked receive. Running the exanic_perf_test
application in this manner ensures the best results.
Benchmarking latency with exasock (UDP/TCP)
We use sockperf for testing because it is open-source and well understood. Before testing UDP/TCP, please ensure raw frames are working correctly (as from above).
Start by downloading the sockperf source from the Github repository, and then build the application by running:
$ ./autogen.sh
$ ./configure --prefix=
$ make
$ make install
Ensure that bypass-only and local loopback are disabled:
$ exanic-config exanic0:0 bypass-only off
$ exanic-config exanic0:0 local-loopback off
Set up a second machine with the configuration options from the steps above, and connect ports 0 together with another ExaNIC using a short fibre/direct attach cable.
Then, set up IP's on both hosts with:
$ ifconfig <interface> <ip-address> netmask <mask>
Run accelerated TCP/UDP sockperf on client and server:
server# exasock taskset -c 2 sockperf sr -i 10.10.0.2
client# exasock taskset -c 2 sockperf pp -i 10.10.0.2 -t5 -m 14
sockperf: == version #3.1-16.gitc6a0d0e3ab53 ==
sockperf: [Total Run] RunTime=5.450 sec; SentMessages=2882949;
sockperf: ====> avg-lat= 0.805 (std-dev=0.034)
sockperf: Summary: Latency is 0.805 usec
sockperf: ---> <MAX> observation = 3.062
sockperf: ---> percentile 99.999 = 1.174
sockperf: ---> percentile 99.900 = 1.002
sockperf: ---> percentile 99.000 = 0.931
sockperf: ---> percentile 90.000 = 0.838
sockperf: ---> percentile 75.000 = 0.817
sockperf: ---> percentile 50.000 = 0.802
sockperf: ---> percentile 25.000 = 0.784
sockperf: ---> <MIN> observation = 0.736
RESULT: ExaSock UDP 1/2RTT latency <810ns.