Can not find founder even though folder should exist

I am trying to use docker to run a repo (verl) within a docker environment. My machine is linux, x86_64.

Following the steps in the installation instructions:

  1. I run docker pull verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2
  2. docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl db618adc68d5 sleep infinity
    docker start verl
    docker exec -it verl bash
  3. I try and then do git clone https://github.com/Qsingle/verl.git && cd verl, but verl already exists (when I cd, I get all of the folders in my server), so I create a env folder, cd into that, and then run git clone https://github.com/Qsingle/verl.git && cd verl.
    pip3 install --no-deps -e .
  4. Then, I download the data that I need (git clone PAPOGalaxy/PAPO_ViRL39K_train · Datasets at Hugging Face & PAPOGalaxy/PAPO_MMK12_test · Datasets at Hugging Face) under the verl folder: /workspace/env/verl.

However, when I run:
python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files="PAPO_ViRL39K_train" data.val_files="PAPO_MMK12_test" data.train_batch_size=4 data.max_prompt_length=18432 data.max_response_length=32768 data.filter_overlong_prompts=True data.filter_overlong_prompts_workers=8 data.truncation='error' data.image_key=images data.trust_remote_code=True actor_rollout_ref.model.path=OpenGVLab/InternVL3-2B actor_rollout_ref.model.trust_remote_code=True actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=2 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=False actor_rollout_ref.actor.kl_loss_coef=0.0 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=1 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.enable_chunked_prefill=False actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False

I get FileNotFound: /workers/verl/PAPO_ViRL39K_train
even though, when I do ls, I see it.

Why is this happening and how can I fix it?

I know others have had similar problems, but none of the answers helper or were relevant to my specific use case.

This image is huge. Even the compressed version is almost 20GB, so I didn’t try it, but please, show us how you confirmed that the file is where you expect it to be.

BAsed on your described steps, you mounted the local folder to /workspace/verl in the container. Since the workdir is /workspace (based on the layer history on Docker Hub), when you started a shell in the container, you were in that workspace folder. Then you entered the verl folder and ran git clone which created a new verl folder inside the existing one. If there is any file you need that would be in /workspace/verl/verl.Let’s say it is in your local folder that you mounted. Then it is in /workspace/verl/PAPO_ViRL39K_train not /workers/verl/PAPO_ViRL39K_train.

I don’t know how /workspace/env/verl. is relevant at step 4.

Hello!
Thanks for the info I will try to figure it out for more.

Hello,
The FileNotFound error is occurring because the Docker container’s environment is separate from your host machine’s file system, despite the volume mount. The command python3 -m verl.trainer.main_ppo ... data.train_files="PAPO_ViRL39K_train" is looking for the dataset at a path relative to the container’s working directory (/workers/verl), but the dataset is located at /workspace/env/verl/PAPO_ViRL39K_train within the container. The simple fix is to either move the datasets to the expected location or, more robustly, update the data.train_files and data.val_files arguments in your python command to provide the full, correct path to the datasets within the container’s file system, which would be /workspace/env/verl/PAPO_ViRL39K_train and /workspace/env/verl/PAPO_MMK12_test respectively.

Regards,
Pearl

I just saw that maybe that could be the problem as you were typing; however, when I tried and change train and test to:
data.train_files=/workspace/env/verl/PAPO_ViRL39K_train data.val_files=/workspace/env/verl/PAPO_MMK12_test

I am still getting
FileNotFoundError: Unable to find '/workspace/env/verl/PAPO_ViRL39K_train'

and I know that the file exists at this location because I can cd into that folder just fine:
root:/workspace/env/verl/PAPO_ViRL39K_train#

The entire command I am running is:

python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/workspace/env/verl/PAPO_ViRL39K_train data.val_files=/workspace/env/verl/PAPO_MMK12_test data.train_batch_size=4 data.max_prompt_length=18432 data.max_response_length=32768 data.filter_overlong_prompts=True data.filter_overlong_prompts_workers=8 data.truncation='error' data.image_key=images data.trust_remote_code=True actor_rollout_ref.model.path=OpenGVLab/InternVL3-2B 
actor_rollout_ref.model.trust_remote_code=True actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=2 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=False actor_rollout_ref.actor.kl_loss_coef=0.0 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=1 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.enable_chunked_prefill=False actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False