Keysight AI Fabric RoCEv2 Test Solution

Data Sheets

Training large AI models requires thousands of AI-accelerated compute nodes processing in parallel and using collective communication operations to share data and results. In many cases, the time and cost to complete the training task is determined by the network performance supporting these collective communications. Keysight’s AI Fabric Test Solution recreates these collective communication patterns so engineers can optimize network performance under AI training workloads.

 

To ensure the most timely and cost-effective platform for collective communications, the underlying network must provide high bandwidth throughput, low latency, and lossless traffic. Care needs to be taken for ECMP hashing, PFC deadlock, and end-to-end communication latency. To validate and benchmark the AI network fabric’s performance, the fabric needs to exercise RoCE Congestion Control and Priority Flow Control (PFC) to optimize buffer management.

 

The Keysight AI Fabric Test Solution includes a high-density cost-effective test hardware platform (AresONE-M 800GE or AresONE-S 400GE) and the IxNetwork test application. IxNetwork models the AI training workload running on the tester’s target topology. The system creates traffic that results from collective communications emanating from simulated end points. This includes emulating Queue-Pair (QP) connections and flows, generating congestion notifications, performing DCQCN-based dynamic rate control, and providing flexibility to test throughput, buffer management and ECMP hashing. The combination allows engineers to optimize the fabric’s performance under the stress of the target AI workload and resulting collective communication patterns.