Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !554

Fix an issue with distributed utils to enable launching jobs with tor…

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge fix-distributed-utils into main Dec 19, 2022
  • Overview 4
  • Commits 1
  • Pipelines 0
  • Changes 1

Created by: tangbinh

Summary of Changes

This all-reduce call currently fails for non-Slurm jobs on multiple nodes as GPU devices are not set correctly when we initialize distributed groups:

Traceback (most recent call last):
  File "/home/binhtang/src/metaseq/metaseq/scripts/interactive.py", line 66, in <module>
    distributed_utils.call_main(cfg, main)
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 289, in call_main
    return distributed_main(
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 222, in distributed_main
    cfg.distributed_training.distributed_rank = distributed_init(cfg)
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 157, in distributed_init
    dist.all_reduce(torch.zeros(1).cuda())
  File "/home/binhtang/.conda/envs/metaseq/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 101c0

To fix it, we set device ID using the environment variable LOCAL_RANK (see this documentation).

Test Plan

  • Launch multi-node jobs successfully without Slurm on AWS:
NCCL_SOCKET_IFNAME=ens32 torchrun --nnodes 2 --node_rank 0 --nproc_per_node 8 --master_addr 172.31.25.180 --master_port 29600 metaseq/scripts/interactive.py --merges-filename /data/checkpoints/gpt2-merges.txt --vocab-filename /data/checkpoints/gpt2-vocab.json --hf-tokenizer /data/checkpoints/gpt2-unified.json --path /path/to/checkpoint/reshard.pt --model-parallel-size 16 --distributed-world-size 16 --ddp-backend fully_sharded --use-sharded-state  --beam 1 --max-source-positions 4 --max-target-positions 128
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: fix-distributed-utils