ben_s's comments

ben_s · 2025-12-18T16:44:51 1766076291

We didn't focus on vGPU and largely avoided it on purpose. Instead, we focused on whole-GPU and NVSwitch-partitioned passthrough (Shared NVSwitch Multitenancy Mode), which is a better fit for the workloads we care about.

ben_s · 2025-12-18T16:23:21 1766075001

We haven't looked deeply at inter-machine communication yet. NVLink/NVSwitch (which this post focuses on) are intra-node, so InfiniBand is mostly orthogonal I think and comes down to NIC passthrough, NUMA/PCIe placement, and validating RDMA inside the VM.

ben_s · 2025-12-18T15:52:34 1766073154

Thanks for the comment! You're right that a lot of the mechanics apply more generally. On point (3) specifically: we handle this by allocating at the IOMMU-group level rather than individual devices. Our allocator selects an IOMMU group and passes through all devices in that group (e.g., GPU video + audio), which avoids the partial-passthrough wonkiness you mentioned. For reference: https://github.com/ubicloud/ubicloud/blob/main/scheduling/al...

ben_s · 2025-12-18T15:39:16 1766072356

Fabric Manager itself is not open source. It's NVIDIA-provided software, and today it's required to bring up and manage the NVLink/NVSwitch fabric on HGX systems. What we meant by "open" is that everything around it - the hypervisor, our control plane logic, partition selection, host configuration, etc. - is implemented in the open and available in our repos. You're right that this isn't a fully open GPU stack.

On isolation: in Shared NVSwitch Multitenancy mode, isolation is enforced at multiple layers. Fabric Manager programs the NVSwitch routing tables so GPUs in different partitions cannot exchange NVLink traffic, and each VM receives exclusive ownership of its assigned GPUs via VFIO passthrough. Large providers apply additional hardening and operational controls beyond what we describe here. We're not claiming this is equivalent to AWS's internal threat model, but it does rely on NVIDIA's documented isolation mechanisms.

ben_s · 2025-12-18T15:24:16 1766071456

Thanks! I haven't looked deeply into slicing up a single GPU. My understanding is that vGPU (which we briefly mention in the post) can partition memory but time-shares compute, while MIG is the only mechanism that provides partitioning of both SMs and memory bandwidth within a single GPU.

ben_s · 2025-12-18T14:34:41 1766068481

(author of the blog post here)

For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.

AMA if you want to dig into any of the details.

spwa4 · 2025-12-18T20:17:01 1766089021

Would it be possible to implement "virtual memory" for a GPU this way? Let's say you have GPUs at 30% utilization, but memory limited. Could you run 2 workloads by offloading the GPU memory when not in use?

ben_s · 2025-12-19T08:02:51 1766131371

Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.

checker659 · 2025-12-18T16:07:29 1766074049

Isn't SR-IOV a thing with these big GPUs? Or, is it that you're not concerned with fractional granularity?

ben_s · 2025-12-18T16:28:33 1766075313

In this article, we're primarily concerned with whole-GPU or multi-GPU partitions that preserve NVLink bandwidth, rather than finer-grained fractional sharing of a single GPU.