List of Usability Changes
Created by: stephenroller
🚀 Feature Request
Metaseq has a number of rough edges that can make it difficult to use. This is a list of my proposals.
-
Print the first batch on ranker 0: when we start training, we should print the first batch. This can help alleviate issues where users don't know what is going into the model -
Eliminate the Hydra warnings: they clog up the stderr -
Move NCCL logs to a different file -
Dump the config as a separate yml or json file next to the train log -
Make it so stdout logs don't contain all the gnorm/pnorm stuff: those should go in tboard/wandb/... but not stdout -
Add a single log on rank0 saying all the nodes involved in the training (just print the SLURM env variable) -
Get rid of the distributed initialization logs off of rank0, they make stdout horrible -
Have the logger also dump raw metrics to a file that is only jsonl. We shouldn't be parsing stdout in order to generate analysis graphs -
Reorganize the metrics with "/" in them so they're better grouped in tboard. -
[maybe] Don't use separate tboards for train/valid. -
Eliminate the "train" tboard. We only need "train_inner" and "valid" -
We should switch megatron's kernels to be compiled/installed on pip install, not on import. -
Need a way to forward step by step in distributed mode -
have code snapshots grouped with the training logs -
We should make model loading "just work". I shouldn't need to pass so many args to get it to find the right checkpoint. https://github.com/facebookresearch/metaseq/issues/78 -
I should be able to specify sharded checkpoints by pointing to the shard0-rank0 pt. https://github.com/facebookresearch/metaseq/issues/78