Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Issues
  • #277
Closed
Open
Issue created Aug 01, 2022 by Administrator@rootOwner9 of 16 checklist items completed9/16 checklist items

List of Usability Changes

Created by: stephenroller

🚀 Feature Request

Metaseq has a number of rough edges that can make it difficult to use. This is a list of my proposals.

  • Print the first batch on ranker 0: when we start training, we should print the first batch. This can help alleviate issues where users don't know what is going into the model
  • Eliminate the Hydra warnings: they clog up the stderr
  • Move NCCL logs to a different file
  • Dump the config as a separate yml or json file next to the train log
  • Make it so stdout logs don't contain all the gnorm/pnorm stuff: those should go in tboard/wandb/... but not stdout
  • Add a single log on rank0 saying all the nodes involved in the training (just print the SLURM env variable)
  • Get rid of the distributed initialization logs off of rank0, they make stdout horrible
  • Have the logger also dump raw metrics to a file that is only jsonl. We shouldn't be parsing stdout in order to generate analysis graphs
  • Reorganize the metrics with "/" in them so they're better grouped in tboard.
  • [maybe] Don't use separate tboards for train/valid.
  • Eliminate the "train" tboard. We only need "train_inner" and "valid"
  • We should switch megatron's kernels to be compiled/installed on pip install, not on import.
  • Need a way to forward step by step in distributed mode
  • have code snapshots grouped with the training logs
  • We should make model loading "just work". I shouldn't need to pass so many args to get it to find the right checkpoint. https://github.com/facebookresearch/metaseq/issues/78
  • I should be able to specify sharded checkpoints by pointing to the shard0-rank0 pt. https://github.com/facebookresearch/metaseq/issues/78
Assignee
Assign to
Time tracking