I’m trying to figure out the best way to debug builds that take a long time, specifically when the long stretches of time are not shown as occurring in specific steps of the build, but seemingly happen outside of a specific step of the build.
I currently have a multi-stage build (it’s using a Docker Hardened Image for node, so it first builds a stage that uses the -dev version of the image, and then builds a second stage using the non-dev version and copies the built artifacts from the first stage). The entire first stage of the build takes around 40 seconds (with the final step taking about 30 seconds, which makes sense since it’s the npm ci stage of the build that downloads and installs all the Node modules), but then the build just goes into 5-6 minutes of… just a pause, where the build output shows nothing at all, nor does any other log I can find (the Docker host, etc.), and the elapsed time at the top of the build output just increments and increments. Finally, after this 5-6 minutes, the build output shows the second stage build kicking off.
What could be happening in this intermission between the two stages being built, and how might I effectively debug it? I’ve tried to watch any and every log I can to see what might even be happening during the 5-6 minutes, but nothing shows me anything at all… so I don’t even know where the underlying problem might be.
For reference, here’s basically the Dockerfile I’m using; the long pause that I reference above is happening after the RUN npm ci step (step 7/7 of the first stage), and before the second WORKDIR /app step (step 2/4 of the second stage, noting that the build output doesn’t show anything for step 1/4 of that stage, the FROM line):
# create build stage
FROM dhi.io/node:24.13.0-alpine3.22-dev AS build-stage
# Set up app directory, copy node package files, and pull in dependencies -- early, to cache them.
WORKDIR /app
COPY package.json .
COPY package-lock.json .
COPY .npmrc .
COPY ssl/NIH-DPKI-chain.pem ./ssl/
ENV NODE_EXTRA_CA_CERTS="/app/ssl/NIH-DPKI-chain.pem"
RUN npm ci --ignore-scripts --omit=dev && npm cache clean --force && mkdir -p node_modules
# create runtime stage from DHI that doesn't have npm or a shell
FROM dhi.io/node:24.13.0-alpine3.22 AS runtime-stage
# Set up container environment and exposed resources.
ENV TZ=America/New_York
ENV NODE_EXTRA_CA_CERTS="/app/ssl/NIH-DPKI-chain.pem"
WORKDIR /app
EXPOSE 8443
# Copy node_modules from build stage
COPY --from=build-stage /app/node_modules ./node_modules
# Finish copying the app.
COPY --chown=node:node --chmod=755 . .
# Go time.
CMD [ "node", "app.mjs"]
And there is the Docker DX Plugin for Visual Studio Code, that allows debugging image builds done with Buildx:
It allows settings breakpoints on Dockerfile instructions, and inspect variables. explore the file system , and even allow to exec into current build steps.
Note: If you prefer Neovim or JetBrain IDEs over Visual Studio Code, links are provided at the bottom of the blog post
But I know that npm isn’t causing the problem here; that stage only takes 20 seconds or so (this project is big, so that’s a reasonable amount of time for npm ci to run). All my debugging to date tells me that the long-duration tasks are taking place between stage build steps — like, some sort of Docker build work on the image at that point (file compression? diff checking? exporting of something?) — and that’s what I’m trying to figure out how to “see” and debug.
BuildKit creates a new immutable intermediate image layer after COPY/ADD/RUN instructions, that represents the delta from the previous instruction, and then it caches it. I am not entirely sure how it stores the intermediate layers. The legacy builder did compress each layer into a tar.gz file and created a sha256 digest for it. I assume BuildKit still does the same.
The build command (which is an alias for buildx) processes stages independently in parallel, up to the point where a stage depends on artifacts from another stage. It uses a directed acyclic graph to determine which parent instructions are required to be finished in each stage in order to start the next instruction. This reduced unnecessary wait times.
Do you know if there’s any way to debug what’s happening in this process, though? I suspect that this is where I’m seeing the 5-6 minute delay, but ultimately I’d like to pin down what’s happening during that delay so I can focus my efforts on solving it (e.g., optimizing disk access, or adding memory, or whatever resources might be in contention and causing the slowdown).
I had another instance today where the build took forever, and I didn’t get any useful information via the VSCode debug build process. This instance was the same as prior — in my multi-stage build, steps 1-7 (which is all the steps) of the first-stage build happened, and steps 1-2 of the second-stage build happened in parallel, and then there was a 4-minute pause after the end of step 7 of the first stage before step 3 of the second-stage build happened. The docker build output was entirely silent during this pause, and no other logs had any meaningful output during the pause as well (the Docker daemon logs on the host, the system logs on the host, etc.).
There HAS to be a way to enable verbose logging for the builder, or otherwise get the builder (which I presume is BuildKit?) to tell me what’s going on during this 4-5 minute pause!