Unable to find user root: no matching entries in passwd file

So I don’t know if this is the same issue, but this is the closest to a discussion that I’ve found explaining a problem I’m having on my managed Kubernetes cluster:

$ kubectl -n jekyll exec -it jekyll-web-75c8dbb69c-qr4zf echo
unable to find user slug: no matching entries in passwd file
command terminated with exit code 126

I found https://github.com/mikelorant/kubectl-exec-user which basically lets me get access to the docker sock even though I’m on a managed kubernetes cluster (that was surprising, but a welcome one…) and this happened:

$ kubectl -n jekyll plugin exec-user -u root jekyll-web-75c8dbb69c-qr4zf bash
unable to find user root: no matching entries in passwd file
pod "exec-user-jekyll-web-75c8dbb69c-qr4zf" deleted
pod jekyll/exec-user-jekyll-web-75c8dbb69c-qr4zf terminated (Error)
error: exit status 126

$ kubectl -n jekyll plugin exec-user -u 0 jekyll-web-75c8dbb69c-qr4zf bash
If you don't see a command prompt, try pressing enter.
root@jekyll-web-75c8dbb69c-qr4zf:~#
root@jekyll-web-75c8dbb69c-qr4zf:~#

So, does anyone know what causes unable to find user root: no matching entries in passwd file?

Now that I’m able to peek inside of the container, I can see that nothing looks especially unusual about /etc/passwd or /etc/shadow

/etc/passwd ...
root:x:0:0:root:/root:/bin/bash
...
slug:x:2000:2000::/app:/bin/bash
/etc/shadow ...
root:*:16667:0:99999:7:::
...
slug:!:17623:0:99999:7:::

Those both look like normal entries for root and slug, but neither username is recognized by docker exec -u or other more basic primitive ways to get processes exec’ed into the container by Kubernetes (health checking, or kubectl exec which cannot accept a user param)

My container has this for a readiness check,

readinessProbe:
        exec:
          command:
          - bash
          - -c
          - '[[ ''$(ps -p 1 -o args)'' != *''bash /runner/init''* ]]'
        failureThreshold: 1
        initialDelaySeconds: 30
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 5

and it is never marked ready because the container definition specified USER slug which as we’ve already seen is unknown, like in the other reports dating back to 2016. How I arrived at this is still a mystery (I think I can delete the pod and it will solve it, I had three of these and solved two that way, but as other reporters have said, this is not sustainable solution… “kubectl delete pod” is a manual step and if this issue can’t be resolved by health checks, it means we’re back to the stone ages of pets not cattle, and being woken in the middle of the night by a pager when the website goes down.)

I don’t have steps to reproduce, as I did not docker cp to create this failure. It sounds like vanilla docker users don’t have any trouble reproducing the issue though. So do we have any idea what causes it (and can it be fixed?)