Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !473

Added dynamic configs (profiling)

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge dcfg into main Nov 01, 2022
  • Overview 31
  • Commits 11
  • Pipelines 0
  • Changes 5

Created by: igormolybogFB

Patch Description Made tombstoning work for RSC:

Added dynamic configs:

enabling: Just include --dynamic-config-path and a specified json file path. The file must exist and will be picked up as the default dynamic config file.

using:

  1. when you want to change the value of a dynamic configuration just amend it in the dynamic config json file and save
  2. Allow some time for the changes to propagate (default timeout is 30 sec)

how it works: dynamic configuration is a dict with built-in timeout. If you are trying to access a value in the dict before the time is out, it will just give you the value with minimal overhead. If you are trying to access a value after the timeout, the entire dict will be reloaded from file and timer reset.

why it is useful: If there is some computationally expensive logging/profiling that needs to be done only when weird behavior of the training procedure is observed, one should be able to trigger these operations on demand.

I have added the first dynamic config flag "force_profile" - it enables enabling profiling not only on step 5 but anywhere throughout training, even if cfg.common.profile = False

Testing steps

(using metaseq-internal) on Azure, run: python -m metaseq_internal.projects.zucchini.sweep_baseline -g 8 -n 1 --azure --model-size 125m --data /data/gpt-z/zucchini/consolidated/v1.0 --tokenizer noregex --partition zetta --profile --dynamic-config --prefix dcfg_test

Results are in /shared/home/igormolybog/checkpoints/dcfg_test22/ (including traces from mnt/)

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: dcfg