GitLab

M metaseq
Project information
- Project information
- Activity
- Labels
- Members
Repository
- Repository
- Files
- Commits
- Branches
- Tags
- Contributors
- Graph
- Compare
Issues 95
- Issues 95
- List
- Boards
- Service Desk
- Milestones
Merge requests 41
- Merge requests 41
CI/CD
- CI/CD
- Pipelines
- Jobs
- Schedules
Deployments
- Deployments
- Environments
- Releases
Packages and registries
- Packages and registries
- Package Registry
- Infrastructure Registry
Monitor
- Monitor
- Incidents
Analytics
- Analytics
- Value stream
- CI/CD
- Repository
Wiki
- Wiki
Snippets
- Snippets
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards

Closed

Open

Issue created Aug 01, 2022 by Administrator @rootOwner9 of 16 checklist items completed9/16 checklist items

List of Usability Changes

Created by: stephenroller

🚀 Feature Request

Metaseq has a number of rough edges that can make it difficult to use. This is a list of my proposals.

Print the first batch on ranker 0: when we start training, we should print the first batch. This can help alleviate issues where users don't know what is going into the model
Eliminate the Hydra warnings: they clog up the stderr
Move NCCL logs to a different file
Dump the config as a separate yml or json file next to the train log
Make it so stdout logs don't contain all the gnorm/pnorm stuff: those should go in tboard/wandb/... but not stdout
Add a single log on rank0 saying all the nodes involved in the training (just print the SLURM env variable)
Get rid of the distributed initialization logs off of rank0, they make stdout horrible
Have the logger also dump raw metrics to a file that is only jsonl. We shouldn't be parsing stdout in order to generate analysis graphs
Reorganize the metrics with "/" in them so they're better grouped in tboard.
[maybe] Don't use separate tboards for train/valid.
Eliminate the "train" tboard. We only need "train_inner" and "valid"
We should switch megatron's kernels to be compiled/installed on pip install, not on import.
Need a way to forward step by step in distributed mode
have code snapshots grouped with the training logs
We should make model loading "just work". I shouldn't need to pass so many args to get it to find the right checkpoint. https://github.com/facebookresearch/metaseq/issues/78
I should be able to specify sharded checkpoints by pointing to the shard0-rank0 pt. https://github.com/facebookresearch/metaseq/issues/78

Assignee

Assign to

Time tracking