One of the key differences between an ExaNIC and most other network cards is that the ExaNIC is optimised for the lowest possible latency. This focus on latency comes with a few design decisions that differ from conventional NICs.
Receive architecture
To keep latency low, there is very little buffering done by ExaNIC cards, with data received from the network sent through to host memory as soon as possible. This could be considered an extreme version of "cut-through networking".
In host memory, the receive (RX) buffers are implemented as 2 megabyte ring buffers. There are up to 33 ring buffers per port: the default buffer (0) and up to 32 flow steering buffers.
As a packet comes off the wire, data is sent to the host in 'chunks' of 128 bytes, consisting of 120 bytes of data and 8 bytes of metadata. (This size was chosen as a trade-off between bandwidth and latency, and is assumed throughout the current architecture; it is not user-configurable.) Software normally waits for a whole frame to arrive before starting processing, however for some applications it may be useful to process frames chunk-by-chunk, which is possible with the ExaNIC API.
If the hardware gets more than 2 megabytes ahead of software, packets may be lost (this is described in the API as a "software overflow", or informally as "getting lapped"). The lack of flow control allows multiple applications to receive packets from the same receive buffers with no performance penalty. However, it means that keeping up with the packet rate matters more than for conventional cards.
It is possible to achieve loss-free 10G line rate reception from a single buffer if attention is paid to CPU allocation (e.g. pinning the receive thread to a free CPU) and if the work done by the receive thread is kept to a minimum. Larger receive buffers are a common request and may be implemented in the future for capture applications. Note however that if an application is lagging by 2 MiB of data, then the packets it is receiving are many milliseconds old. Thus larger buffers are not a good solution for low-latency applications. Drops in a low-latency application are an indication of application slowness or hiccups that should be investigated.
Flow steering and flow hashing can be used to distribute load to multiple buffers and increase the effective buffer space while still maintaining low latency for important flows. Note that, for latency and PCI Express bandwidth reasons, the ExaNIC only ever delivers each packet to one buffer. If one application needs to listen to flows A and B and another needs to listen to flows B and C, then either all the flows need to be directed to one buffer, or each application needs to listen on two buffers.
Transmit architecture
Unlike the receive buffers which are located in host RAM, the transmit (TX) buffers are located on the ExaNIC itself. They can be written directly by software via a memory mapping (but not read). To transmit a packet, software writes the packet data into the transmit buffer (together with a short header), and then writes a register on the ExaNIC to start packet transmission.
The amount of transmit buffer memory varies from card to card, e.g.:
- X25: 128Kb per port (256KB total - a beta firmware which provides 1MB per port (2MB total) is available on request, submit a support ticket to request access to the beta image.)
- X10/GM/HPT: 128KB per port (256KB total)
- X40/V5P/X100: 64KB per port (512KB total)
- X2/X4: 32KB per port (64KB/128KB total)
Applications can request parts of the transmit buffer, at a minimum granularity
of 4KiB. The ExaNIC driver usually also acquires one 4KiB chunk for packets
being sent from the kernel, although this can be disabled by setting the
interface to bypass-only mode with exanic-config
. (Note that exasock
will not function correctly in this mode, however, since it relies on the
kernel driver for TCP slow path processing; it is only useful when using
libexanic exclusively.)
The packet header written to the TX buffer together with each packet contains a feedback slot index and feedback ID. After sending the packet, the card writes the given ID to the given slot (within a feedback structure in host memory). This allows software to determine when packet transmission is complete.