Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Issues
  • #142
Closed
Open
Issue created Jun 07, 2022 by Administrator@rootOwner

Requeue with restore file and broken checkpoint upload... is broken

Created by: suchenzang

Right now, it seems like if we requeue a job that started with a restore file and since starting from the restore file there has been broken checkpoint uploads, the run will simply restart from scratch while continuing to increment its iteration count.

Go through our checkpointing spagetti and figure out how to clean this up: https://github.com/facebookresearch/metaseq/blob/ae825b2fa9010ab0406f20d6164ebb058a7e97cf/metaseq/checkpoint_utils.py#L257

Assignee
Assign to
Time tracking