Slow I/O performance inside container while writing a CSV file as compared with the host server

Hi Everyone,

Before I begin explaining my problem, I wanted to let you know that I am a new user of docker and have only recently started working on it. So please bare with me if I sound a bit naive.


So currently I have a python script running within a docker container. The script queries data from a DB and then imports the info into a pandas dataframe and then writes out a csv file to a folder within an nfs mount. The nfs share is mounted during runtime.

Server Properties:
fc8tdtsr@fc8tdbitmapconvs08]$ uname -r

Docker version:
fc8tdtsr@fc8tdbitmapconvs08]$ docker --version
Docker version 18.03.1-ce, build 9ee9f40

Docker Info
fc8tdtsr@fc8tdbitmapconvs08]$ docker info
Containers: 46
Running: 46
Paused: 0
Stopped: 0
Images: 459
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
Profile: default
Kernel Version: 3.10.0-957.5.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 1.968TiB
Name: fc8tdbitmapconvs08
Docker Root Dir: /local/docker-data-root
Debug Mode (client): false
Debug Mode (server): false
HTTPS Proxy:
Experimental: false
Insecure Registries:
Live Restore Enabled: false

Docker Command that I run:

docker run -d --restart=always --volume-driver=nfs -v /td-bmp:/td-bmp:rw
–memory=128g --memory-reservation=32g --cpu-shares=28
–name=worker-1 daas-worker:latest --spark-name daas1

My Problem

The csv generation takes a LONG time when I try to run the script within the docker container as compared to running it on the host.

For example in-order to generate a 5GB csv file the host takes an avg time of 30 mins (including querying the db and writing out the csv file). Whereas if I run the same scenario within the container, it takes almost 1.5 hrs to generate the same results. That is an hour more than the host.

From what I understand, the difference shouldn’t be that huge. I mean I do understand that there will be some trade offs but this 1 hr sounds real bad. Am I doing something wrong?

Please do let me know if you need anything else from me.


I’m having the exactly same problem!

Docker version 18.09.2, build 6247962
Python 3.7
pandas 0.24.2

I have a minimal working example that writes a 8 GiB csv file to disk using pandas. It takes 15 mins on host and 1.5 hours in docker to finish.

Hi @meownoid,

I think I have a quick fix for this. This doesn’t seems to be a docker problem. All you need to do is add chunksize to the pandas.to_csv() and you will notice a drastic improvement in the speed at which a csv file gets written.

Here is what I did:
data.to_csv(filename, index=False, header=True, chunksize=200000, encoding=‘utf-8’)

By doing so I was able to generate a 5GB csv file within 6 mins.

You can further improve this by writing out a file:

data.to_csv(filename +’.zip’, index=False, header=True, chunksize=200000, compression =‘zip’, encoding=‘utf-8’)

–> I was able to generate a 5 GB within 1 min. Which is super quick!

Hope this helps!

Thanks a lot @sdhanyam, that helped! Turned out I had different versions of pandas on the host and in docker and since version 0.24 to_csv works much slowly without additional arguments.