Running chromium inside container resulting in inconsistent crashes

I am currently having some difficulty in running chromium within a docker container. I am running a custom python application that can capture data from websites. Currently, I’m in the process of containerizing this app.

I run my non-containerized application in production on ec2 instances. I orchestrate the app via Jenkins, which is hooked up to an autoscaling group to start/stop nodes to run the app on. I have an NFS mount on each of these ec2 instances - this is important to note because for running docker containers, the host machine must have the NFS mount already mounted on it. I share the mount with the docker containers via the -v flag during the docker run command. I cannot currently use EFS as a replacement because it’s way more expensive, and it doesn’t have the performance that my stack requires.

I am trying to simply hit the page https://bot.incolumitas.com/ and capture a screenshot. The error that I see is a custom error for the application that says “you’ve hit your timeout limit of 30 seconds”.

The interesting part of this issue is that when I re-run the app on an ec2 instance that has already been used to run the app to scrape the site https://bot.incolumitas.com/, I get a near 100% success rate. I’ve tried adding a sleep command to the ec2 startup script that waits for X minutes before running the script. I’ve also tried hitting a very basic internally-hosted site on the initial execution of a freshly-started ec2 instance immediately before running against bot.incolumitas. Neither of these ideas worked.

This is the docker run command that I run for each execution:

docker run -d --privileged -it --rm --name container-${random_hash} --shm-size=4g --ipc=host -v /run/dbus/system_bus_socket:/host/run/dbus/system_bus_socket -v /nfs/mount/path/on/host:/mount/path/on/container ${image_id} /bin/bash -c “source script_that_starts_xvfb_display.sh; ./run_container.sh --headless false”

The contents of script_that_starts_xvfb_display.sh:

# this script returns "1366x768x24", "1920x1080x24", or "1536x864x24"
screen_res=$(python3 get_random_screen_resolution.py)

check=`pgrep Xvfb |wc -l`
if [ $check -ne 1 ] ; then
    Xvfb :101 -screen 0 $screen_res &
fi

export DISPLAY=:101
export DISPLAY_CONFIGURATION=${screen_res}

# randomizing the timezone as well
# this will return something like "America/New_York", or "Europe/London", or "Asia/Hong_Kong"
TZ="$(shuf -n 1 all_timezones_list.txt)"
export TZ

I’ve used the latest stable versions of chromium, ranging from versions 88 to 94.

In the current production setup, I get a 100% success rate when hitting bot.incolumitas, and it takes 5-10 seconds for a successful execution. In the containerized setup, when the ec2 instances it’s running on have been freshly started (i.e. no previous executions), I get a 20% success rate, and it takes around 30 seconds for a successful execution.

I do not believe the container is hitting a memory or CPU limit because I’ve tailed the stats of the containers during execution and everything looks fine. I’m thinking that it could potentially be that the first execution doesn’t have cached files of the website because it’s having to access this site for the first time. However, the non-containerized app is able to successfully reach that site, no problem, within 5-10 seconds, even if it’s the first execution on a freshly-started server.

Therefore, this leads me to believe that it could simply be a crash of chromium that just isn’t showing up in the error messages, but I’m not entirely sure. It could also be a docker configuration setting that I’m missing.

Things I’ve tried so far [to no avail]:

  1. Adding --disable-dev-shm-usage to the chromium startup command (I’ve already been doing this)
  2. Using --shm-size in your docker run command
  3. Using --ipc=host in your docker run command
  4. Trying other chromium variations (different versions)
  5. Adding a “sleep X” command to the initial execution on a freshly-started ec2 instance
  6. Hitting a dummy site (an internal site that returns your user-agent) on the first execution of a freshly-started ec2 instance, then once it returns the same failure, hitting bot.incolumitas

Is there anything that I could be missing? I appreciate any ideas or recommendations.