Splitting GPU resources with SRIOV accross multiple containers

Hi,

I am currently building a server which will run docker and containers which need to use part of a gpu via SR-IOV.

Many questions are still on the table but can anyone detail their experience regarding this kind of setup?

I know nvidia basically allows it with very expensive hardware setups but I am not after this currently.

I understand that AMD has the MI100 and up series for that kind of use, but I would appreciate anyopne sharing their experience with that kind of setup.

I also know a sriov plugin exists for docker but does it support GPUs? It’s intended use case is for fast networking and I am unsure if the same plugin will work to enable GPU splitting accross containers.

As for AMD GPUs, I know the SR-IOV is not behind a paywall and MI100 seems to have better driver support for this use case than the previous MI25 and MI50 series, for which AMD gave exclusive access to support to Alibaba and Microsoft but did not make the solution public.

I appreciate any leads and input on the situation.

Thanks in advance!