4. Performance Tuning
Here are some tips and tricks to get the best possible performance with Corundum.
First, it’s always a good idea to test with a commercial 100G NIC as a sanity check - if a commercial 100G NIC doesn’t run near 100G line rate, then Corundum will definitely have issues.
Second, check the PCIe configuration with
lspci. You’ll want to make sure that the card is actually running with the full PCIe bandwidth. Some x16 PCIe slots don’t have all of the lanes physically wired, and in many cases lanes can be switched between slots depending on which slots are used—for example, two slots may share 16 lanes, if the second slot is empty, the first slot will use all 16 lanes, but if the second slot has a card installed, both slots will run with 8 lanes.
This is what
lspci reports for Corundum in a gen 3 x16 configuration in a machine with an AMD EPYC 7302P CPU:
$ sudo lspci -d 1234:1001 -vvv 81:00.0 Ethernet controller: Device 1234:1001 Subsystem: Silicom Denmark Device a00e Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin ? routed to IRQ 337 NUMA node: 1 IOMMU group: 13 Region 0: Memory at 20020000000 (64-bit, prefetchable) [size=16M] Capabilities:  Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities:  MSI: Enable+ Count=32/32 Maskable+ 64bit+ Address: 00000000fee00000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities:  Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [1c0 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Kernel driver in use: mqnic
The device driver also prints out some PCIe-related information when it attaches to the device, to save the trouble of running
[ 349.460705] mqnic 0000:81:00.0: mqnic PCI probe [ 349.460712] mqnic 0000:81:00.0: Vendor: 0x1234 [ 349.460715] mqnic 0000:81:00.0: Device: 0x1001 [ 349.460717] mqnic 0000:81:00.0: Subsystem vendor: 0x1c2c [ 349.460719] mqnic 0000:81:00.0: Subsystem device: 0xa00e [ 349.460721] mqnic 0000:81:00.0: Class: 0x020000 [ 349.460723] mqnic 0000:81:00.0: PCI ID: 0000:81:00.0 [ 349.460730] mqnic 0000:81:00.0: Max payload size: 512 bytes [ 349.460733] mqnic 0000:81:00.0: Max read request size: 512 bytes [ 349.460735] mqnic 0000:81:00.0: Link capability: gen 3 x16 [ 349.460737] mqnic 0000:81:00.0: Link status: gen 3 x16 [ 349.460739] mqnic 0000:81:00.0: Relaxed ordering: enabled [ 349.460740] mqnic 0000:81:00.0: Phantom functions: disabled [ 349.460742] mqnic 0000:81:00.0: Extended tags: enabled [ 349.460744] mqnic 0000:81:00.0: No snoop: enabled [ 349.460745] mqnic 0000:81:00.0: NUMA node: 1 [ 349.460753] mqnic 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 349.460767] mqnic 0000:81:00.0: enabling device (0000 -> 0002) [ 349.460802] mqnic 0000:81:00.0: Control BAR size: 16777216 [ 349.462723] mqnic 0000:81:00.0: Configured 32 IRQs
LnkSta: Speed 8GT/s (ok), Width x16 (ok), indicating that the link is running at the max supported speed and max supported link width. If one of those is reported as
(degraded), then further investigation is required. If
(degraded) is not shown, then compare
LnkCap to see if
LnkSta reports lower values. In this case,
LnkCap: Port #0, Speed 8GT/s, Width x16, which matches
LnkSta. It also reports
MSI: Enable+ Count=32/32, indicating that all 32 MSI channels are active. Some motherboards do not fully implement MSI and limit devices to a single channel. Eventually, Corundum will migrate to MSI-X to mitigate this issue, as well as support more interrupt channels. Also note that
MaxPayload 512 bytes—this is the largest that I have seen so far (on AMD EPYC), most modern systems report 256 bytes. Obviously, the larger, the better in terms of PCIe overhead.
Non-uniform memory access (NUMA) is another potential pitfall to be aware of. Systems with multiple CPU sockets will generally have at least one NUMA node associated with each socket, and some CPUs, like AMD EPYC, have internal NUMA nodes even with a single CPU. For best performance, any processes that access the NIC should be pinned to the NIC’s local NUMA node. If packets are stored in memory located on a different NUMA node, then there will be a performance penalty associated with the NIC accessing that memory via QPI, UPI, etc. Use
numactl -s to get a list of all physical CPUs and NUMA nodes on the system. If only one node is listed, then no binding is required. If you’re running a CPU with internal NUMA nodes such as AMD EPYC, make sure that BIOS is set up to expose the internal NUMA nodes. The NUMA node associated with the network interface is shown both in the
lspci and driver output output (
NUMA node: 3), and it can also be read from sysfs (
numactl -l -N <node> <command> to run programs on a specified NUMA node, for example,
numactl -l -N 3 iperf3 -s. Recent versions of
numactl also support automatically determining the NUMA node from the network device name, so in this case
numactl -l -N netdev:enp129s0 iperf3 -s would run
iperf on the NUMA node that
enp129s0 is associated with. It’s important to make sure that both the client and the server are run on the correct NUMA node, so it’s probably a better idea to manually run
iperf3 -s under
numactl than to run
iperf3 as a system service that could potentially run on any NUMA node. On Intel CPUs, PCM can be used to monitor QPI/UPI traffic to confirm that processes are bound to the correct NUMA nodes.
It’s also advisable to go into BIOS setup and disable any power-management features to get the system into its highest-performance state.
Notes on the performance evaluation for the FCCM paper: the servers used are Dell R540 machines with dual Intel Xeon 6138 CPUs and all memory channels populated, and
MaxPayload 256 bytes. The machines have two NUMA nodes, so only one CPU is used for performance evaluation to prevent traffic from traversing the UPI link. On these machines, a single
iperf process would run at 20-30 Gbps with 1500 byte MTU, or 40-50 Gbps with 9000 byte MTU. The Corundum design for those tests was configured with 8192 TX queues and 256 RX queues.