3. Debugging

3.1. The server rebooted when configuring the FPGA

This is a common problem caused by the server management subsystem (IPMI, iLO, iDRAC, or whatever your server manufacturer calls it). It detects the PCIe device falling off the bus when the FPGA is reset, and does something in response. In some machines it just complains bitterly by logging an event and turning on an angry red LED. In other machines, it does that, and then immediately reboots the machine. Highly annoying. However, there is a simple solution for this that involves some poking around in PCIe configuration registers to disable PCIe fatal error reporting on the port that the device is connected to.

Run the pcie_disable_fatal_err.sh script before you configure the FPGA. Specify the PCIe device ID of the FPGA board as the argument (xy:00.0, as shown in lspci). You’ll have to do a warm reboot after loading the configuration if the PCIe BAR configuration has changed. This will most likely be the case when going from a stock flash image that brings up the PCIe link to Corundum, but not from Corundum to Corundum unless something was changed in the PCIe IP core configuration. If the BAR configuration has not changed, then using the pcie_hot_reset.sh script to perform a hot reset of the device may be sufficient. The firmware update utility mqnic-fw includes the same functionality as pcie_disable_fatal_err.sh and pcie_hot_reset.sh, so if the card is running a Corundum design, then you can use mqnic-fw -t to disable fatal error reporting and reset the card before connecting to it via JTAG.

3.3. Ping and iperf don’t work

Things to check, in no particular order:

  • Check that the interface is up (ip link set dev <interface> up)

  • Check that the interface has an IP address assigned (ip addr, ip -c a, ip -c -br, to check, ip addr add 192.168.1.1/24 dev <interface> to set)

  • The corundum driver does not currently report the link status to the OS, so check for a link light (not all design variants implement this) and check the link partner for the link status (ip link, NO-CARRIER means the link is down at the PHY layer)

  • Try hot resetting the card with the link partner connected (clear up possible RX DFE problem)

  • Check tcpdump for inbound traffic on both ends of the link tcpdump -i <interface> -Q in to see what is actually traversing the link. If the TX direction works but the RX direction does not, there is a high probability it is a transceiver DFE issue that may be fixable with a hot reset.

  • Check with mqnic-dump to see if there is anything stuck in transmit queues, transmit or receive completion queues, or event queues.

3.4. The device loses its IP address

This is not a corundum issue, this is NetworkManager or a similar application causing trouble by attempting to run DHCP or similar on the interface. There are basically four options here: disable NetworkManager, configure NetworkManager to ignore the interface, use NetworkManager to configure the interface and assign the IP address you want, or use network namespaces to isolate the interface from NetworkManager. Unfortunately, if you have a board that doesn’t support persistent MAC addresses, it may not be possible to configure NetworkManager to deal with the interface correctly.