Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !521

Automatically detect and skip loss spikes

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge peter/detect_spikes into bigbig Nov 17, 2022
  • Overview 17
  • Commits 16
  • Pipelines 0
  • Changes 2

Created by: Xirider

Patch Description Added a new flag max_loss_to_skip_batch that, if set to some maximum acceptable loss will abort the iteration before doing an optimizer step. The loss value to compare to is the same one used in the logs. It might be or not different to the one in tensorboard. The logic is similar to our skip_gradient_update_on_clip_norm flag which also skips batches, whenever the gradient norm is above the clip value, and also how we handle overflows.

Testing steps Tested this with our small sweep script. I think our disks are full so I couldn't test this with a longer run. For testing I increased the loss and checked whether we are skipping correctly

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: peter/detect_spikes