logging every step for 175B model is roughly costing 1 second per step and not getting counted in WPS and UPS
Created by: ngoyal2707
Most likely because we loop over bunch of tensors (gradients / activations norms etc) and move them to cpu for logging.
Weirdly this happens outside of WPS and UPS counters, so we were not noticing this.
Its high priority to fix it cause with our latest changes 1 seconds per iteration can be roughly 8-10% of training cost