I need to efficiently manage multiple NVIDIA drivers, each supporting different CUDA environments for various Tensor Flow versions on a server with four NVIDIA GPUs.
I work on various machine-learning projects and my server needs different CUDA toolkits and corresponding NVIDIA drivers for each project. The current method of installing and removing drivers is unstable, and inexperienced engineers are causing issues while installing new drivers and removing old ones.
I’m looking for a more efficient solution that ensures performance and compatibility across all projects. Can you guide the best practice for handling this scenario? Also, do you know if Nvidia GPU operators in docker containers with a Kubernetes manager help in this case?