Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !476

Unify code path for metaseq and metaseq-internal

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge github/fork/Xirider/unify_training_codepaths into main Nov 01, 2022
  • Overview 29
  • Commits 30
  • Pipelines 0
  • Changes 5

Created by: Xirider

Currently we have 2 versions of sweep.py and slurm.py: One version is used when a opt_baseline script is run from metaseq and the other version is used with sweep_baseline from metaseq-internal. Maintaining both versions adds unnecessary complexity to the code base and makes testing more difficult.

This PR brings most features of the sweep and slurm file from metaseq-internal to metaseq, in preparation of deleting these files in metaseq-internal.

Here some notes:

  • brought in the tombstone feature
  • metaseq-internal had a wrapper (to log some worker info) around the training script "train_wrapper.py" that I now moved to the train.py file
  • there was some shuffling around in the path logic for the train command in slurm.py, so that it will now work independently of the user's working directory

Some things were not brought in (i.e. flags without use). For a number of other features that I included I'm not sure if they are actually used currently:

  • post_cmds
  • container_image and container_save
  • array_length
  • args.dep and args.sequential Should these stay?

Issue: https://github.com/facebookresearch/metaseq/issues/472 Internal PR: https://github.com/fairinternal/metaseq-internal/pull/558

Testing: python metaseq/launcher/opt_baselines.py --prefix train.8m --model-size 8m --checkpoints-dir ./test-checkpoint --tensorboard-logdir ./test-checkpoint --num-trials 1 --azure --num-gpus 4 --num-nodes 1 --seed 1 --circleci --local --disable-validation --max-epoch 100 --max-update 100

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/Xirider/unify_training_codepaths