When visiting the Supercomputing conference this year, there were plenty of big GPU systems on display for machine learning. A large number were geared towards the heavy duty cards for training, but also a number of inference-only systems cropped up. In heeding the NVIDIA CEO’s mantra of ‘the more you buy, the more you save’, Supermicro showed me one of their scalable inference systems based around the newly released NVIDIA Tesla T4 inference GPU.

The big server, based on two Xeon Scalable processors and 24 memory slots, has access to 20 PCIe slots, each running at PCIe 3.0 x16, for a total of 320 PCIe lanes. This is achieved using Broadcom 9797-series PLX chips, splitting each PCIe x16 root complex from each processor into five x16 links, each of which can take a T4 accelerator.

These accelerators are full-length but half-height, so even though the chassis supports full height cards, Supermicro told me that the whole system is designed to be modular, should the customer wants a different PCIe layout or a different CPU design, it could potentially be arranged.

The reason why Supermicro says that this design is scalable is that they expect to deploy it to customers with varying numbers of inference cards equipped. It was explained that some customers only want a development platform to start with, and may only request four cards to begin. Over time, as their needs change, or as new hardware is developed, new GPUs or new accelerators can be added, especially if a combination requirement develops. Each of the slots can run PCIe 3.0 x16 and up to 75W, which is the sweet spot for a lot of inference deployments.

With only one card in, it did look a little empty (!)

There looks like plenty of cooling. Those are some big Delta fans, so this thing is going to be loud.

There are technically 21 PCIe slots in this configuration. The slot in the middle is for an additional non-inference card, either a low powered FPGA for offload or a custom networking card.