Can not find founder even though folder should exist

sstoica2 · August 4, 2025, 2:44am

I am trying to use docker to run a repo (verl) within a docker environment. My machine is linux, x86_64.

Following the steps in the installation instructions:

I run docker pull verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl db618adc68d5 sleep infinity
docker start verl
docker exec -it verl bash
I try and then do git clone https://github.com/Qsingle/verl.git && cd verl, but verl already exists (when I cd, I get all of the folders in my server), so I create a env folder, cd into that, and then run git clone https://github.com/Qsingle/verl.git && cd verl.
pip3 install --no-deps -e .
Then, I download the data that I need (git clone PAPOGalaxy/PAPO_ViRL39K_train · Datasets at Hugging Face & PAPOGalaxy/PAPO_MMK12_test · Datasets at Hugging Face) under the verl folder: /workspace/env/verl.

However, when I run:
python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files="PAPO_ViRL39K_train" data.val_files="PAPO_MMK12_test" data.train_batch_size=4 data.max_prompt_length=18432 data.max_response_length=32768 data.filter_overlong_prompts=True data.filter_overlong_prompts_workers=8 data.truncation='error' data.image_key=images data.trust_remote_code=True actor_rollout_ref.model.path=OpenGVLab/InternVL3-2B actor_rollout_ref.model.trust_remote_code=True actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=2 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=False actor_rollout_ref.actor.kl_loss_coef=0.0 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=1 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.enable_chunked_prefill=False actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False

I get FileNotFound: /workers/verl/PAPO_ViRL39K_train
even though, when I do ls, I see it.

Why is this happening and how can I fix it?

I know others have had similar problems, but none of the answers helper or were relevant to my specific use case.

rimelek · August 5, 2025, 5:13pm

This image is huge. Even the compressed version is almost 20GB, so I didn’t try it, but please, show us how you confirmed that the file is where you expect it to be.

BAsed on your described steps, you mounted the local folder to /workspace/verl in the container. Since the workdir is /workspace (based on the layer history on Docker Hub), when you started a shell in the container, you were in that workspace folder. Then you entered the verl folder and ran git clone which created a new verl folder inside the existing one. If there is any file you need that would be in /workspace/verl/verl.Let’s say it is in your local folder that you mounted. Then it is in /workspace/verl/PAPO_ViRL39K_train not /workers/verl/PAPO_ViRL39K_train.

I don’t know how /workspace/env/verl. is relevant at step 4.

john217ross · August 6, 2025, 5:33am

Hello!
Thanks for the info I will try to figure it out for more.

Milestone Mastercard

pearl44snow · August 6, 2025, 6:32am

Hello,
The FileNotFound error is occurring because the Docker container’s environment is separate from your host machine’s file system, despite the volume mount. The command python3 -m verl.trainer.main_ppo ... data.train_files="PAPO_ViRL39K_train" is looking for the dataset at a path relative to the container’s working directory (/workers/verl), but the dataset is located at /workspace/env/verl/PAPO_ViRL39K_train within the container. The simple fix is to either move the datasets to the expected location or, more robustly, update the data.train_files and data.val_files arguments in your python command to provide the full, correct path to the datasets within the container’s file system, which would be /workspace/env/verl/PAPO_ViRL39K_train and /workspace/env/verl/PAPO_MMK12_test respectively.

Regards,
Pearl

sstoica2 · August 6, 2025, 7:13am

I just saw that maybe that could be the problem as you were typing; however, when I tried and change train and test to:
data.train_files=/workspace/env/verl/PAPO_ViRL39K_train data.val_files=/workspace/env/verl/PAPO_MMK12_test

I am still getting
FileNotFoundError: Unable to find '/workspace/env/verl/PAPO_ViRL39K_train'

and I know that the file exists at this location because I can cd into that folder just fine:
root:/workspace/env/verl/PAPO_ViRL39K_train#

The entire command I am running is:

python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/workspace/env/verl/PAPO_ViRL39K_train data.val_files=/workspace/env/verl/PAPO_MMK12_test data.train_batch_size=4 data.max_prompt_length=18432 data.max_response_length=32768 data.filter_overlong_prompts=True data.filter_overlong_prompts_workers=8 data.truncation='error' data.image_key=images data.trust_remote_code=True actor_rollout_ref.model.path=OpenGVLab/InternVL3-2B 
actor_rollout_ref.model.trust_remote_code=True actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=2 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=False actor_rollout_ref.actor.kl_loss_coef=0.0 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=1 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.enable_chunked_prefill=False actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False

rimelek · August 6, 2025, 8:26pm

The post you reacted to is most likely AI-generated, so if the files are really where they should be, I would spend more time on figuring out why the application says the opposite. I don’t know anything about the app, but FileNotFound could also mean that it expects a file, not a directory. Or the filepath contains something that is just similar to another character or there i a non-printable character in the path or the reference to it. Maybe the user on behalf of which the process is running, has no access to any of the folders in the path. So if the files are where you expect them to be, it seems to be more like an application issue , permission issue or an invalid character issue.

Topic		Replies	Views
Error accesing file in container General	6	217	March 1, 2024
File not found(unable to read model)FileNotFoundError: [Errno 2] No such file or directory: Docker Hub dockerhub , docker , build	0	3596	March 5, 2021
Docker run not finding file Docker Desktop	0	747	March 26, 2018
Python import error in Docker compoment General docker , python	2	178	May 16, 2025
Docker container doesn't starts after changing host directory General	0	423	August 29, 2020

Can not find founder even though folder should exist

Related topics