Finding hard -to-find data plane bugs with PTA
A collaboration between Dr Pietro Bressana (USI), Prof Robert Soulé (Yale) and Prof Noa Zilberman (Oxford)
Everyone hates bugs, and hardware bugs can be especially mischievous. Not only can these bugs be very hard to find, but they can have catastrophic results if your network device is already part of a production system. A bug in an ASIC can cost millions of dollars to fix, and even more in reputation and market share.
There are many reasons for bugs, ranging from errors in the design that lead to functional and performance bugs, through compiler and architecture bugs, to bugs caused by under-specification.
In the PTA project, we develop a framework that allows users to find such hidden bugs in network hardware. PTA is a Portable Test Architecture that allows a user to easily design and port tests between devices. The framework is programmable, using P4, and configurable, so a single P4 program allows users to run many different tests of their device. This is especially important in programmable network-devices, such as programmable switches or SmartNICs, where the same device can be used to run many different data plane programs. In the field, data-plane programs can change all the time, and a vendor can not validate the device functionality as they could with fixed-function network devices.
The PTA architecture, shown in the figure below, combines a programmable test packet generator and a programmable checker, both controlled by an external host. Beyond directly writing tests, PTA is also integrated with P4v, a well known P4 verifying tool. Tests written in P4v during the design stage can be automatically translated and run by PTA. This ensures that the validation stage runs identical tests to the verification stage (and more!).
PTA was implemented and tested on two platform: Intel Barefoot Tofino (ASIC) and NetFPGA SUME (FPGA). Using PTA, we have found a range of bugs on both platforms, and across all range of bug types. Examples include compiler bugs (e.g., byte swapping), under specification (e.g., parser's reject), performance bugs (throughput loss for certain packet sizes) and more. We even found a NetFPGA architecture bug that was hidden for over 10 years and in multiple generations of the platform!
The PTA paper is published in ACM CoNEXT 2020, and we are making PTA open for the community.