Slow I/O performance inside container while writing a CSV file as compared with the host server

Hi Everyone,

Before I begin explaining my problem, I wanted to let you know that I am a new user of docker and have only recently started working on it. So please bare with me if I sound a bit naive.

Scenario:

So currently I have a python script running within a docker container. The script queries data from a DB and then imports the info into a pandas dataframe and then writes out a csv file to a folder within an nfs mount. The nfs share is mounted during runtime.

Server Properties:
fc8tdtsr@fc8tdbitmapconvs08]$ uname -r
3.10.0-957.5.1.el7.x86_64

Docker version:
fc8tdtsr@fc8tdbitmapconvs08]$ docker --version
Docker version 18.03.1-ce, build 9ee9f40

Docker Info
fc8tdtsr@fc8tdbitmapconvs08]$ docker info
Containers: 46
Running: 46
Paused: 0
Stopped: 0
Images: 459
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.5.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 1.968TiB
Name: fc8tdbitmapconvs08
ID: NXBX:GCN7:UY6S:QWW4:RB5G:JDLW:FRMI:YJQZ:37SY:RDV5:5NO6:V2MS
Docker Root Dir: /local/docker-data-root
Debug Mode (client): false
Debug Mode (server): false
HTTPS Proxy: uswwwp1.gfoundries.com:74
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
0.0.0.0/0
127.0.0.0/8
Live Restore Enabled: false

Docker Command that I run:

docker run -d --restart=always --volume-driver=nfs -v /td-bmp:/td-bmp:rw
–memory=128g --memory-reservation=32g --cpu-shares=28
–name=worker-1 daas-worker:latest --spark-name daas1

My Problem

The csv generation takes a LONG time when I try to run the script within the docker container as compared to running it on the host.

For example in-order to generate a 5GB csv file the host takes an avg time of 30 mins (including querying the db and writing out the csv file). Whereas if I run the same scenario within the container, it takes almost 1.5 hrs to generate the same results. That is an hour more than the host.

From what I understand, the difference shouldn’t be that huge. I mean I do understand that there will be some trade offs but this 1 hr sounds real bad. Am I doing something wrong?

Please do let me know if you need anything else from me.

Thanks!

I’m having the exactly same problem!

Docker version 18.09.2, build 6247962
Python 3.7
pandas 0.24.2

I have a minimal working example that writes a 8 GiB csv file to disk using pandas. It takes 15 mins on host and 1.5 hours in docker to finish.

Hi @meownoid,

I think I have a quick fix for this. This doesn’t seems to be a docker problem. All you need to do is add chunksize to the pandas.to_csv() and you will notice a drastic improvement in the speed at which a csv file gets written.

Here is what I did:
data.to_csv(filename, index=False, header=True, chunksize=200000, encoding=‘utf-8’)

By doing so I was able to generate a 5GB csv file within 6 mins.

You can further improve this by writing out a csv.zip file:

data.to_csv(filename +’.zip’, index=False, header=True, chunksize=200000, compression =‘zip’, encoding=‘utf-8’)

–> I was able to generate a 5 GB csv.zip within 1 min. Which is super quick!

Hope this helps!

Thanks a lot @sdhanyam, that helped! Turned out I had different versions of pandas on the host and in docker and since version 0.24 to_csv works much slowly without additional arguments.