We have few cloud machines to do deep learning training. Each node is a physical machine with 8 V100 cards. We deployed OpenPAI on it. I test pytorch framework on openpai, which is much slower than I expect. After the investigation, I find out the bottleneck is about the pinned memory usage during training process.
As you may know, the DataLoader class in pytorch has a parameter pin_memory which defines whether using pin memory to accelerate tensor transfer between host memory and device memory. However, setting pin_memory to True , the training speed is dramatically 10x slower than False . I also try the same docker image with same code on the same machine, by ssh logging in. I use nv-docker run the image and test the code. The training speed with False option is equivalent to the same False option launching by openpai. The speed with True option is a bit faster than False option as expected.
The only difference between the test, is the docker launching parameters. I checked the log of openpai job. The openpai seems not using nv docker. It launched the docker with
docker run --runtime=runc --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/fuse ... . While I ran the docker on machine with
docker run --runtime=nvidia ... . I wonder if the docker run parameter difference results in the bad behavior of pinned memory.
I also post an issue under openpai github page. At here I want to focus my question about docker. We have docker 18.09.6 build 481bc77 installed on node. What’s the difference between two launching methods mentioned above? Is there any chance, our specific version docker engine has a defect about this feature?
Any advice would be appreciated. Thanks.