Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !452

Simplify checkpoint download logic

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge davides/blob_cache into main Oct 25, 2022
  • Overview 11
  • Commits 10
  • Pipelines 0
  • Changes 5

Created by: davides

Patch Description

  • Add trailing wildcard support to PathManager.ls() to support changes below
  • Update checkpoint caching:
    • Remove the double call to get_local_path here which may have been causing a race condition. Passing force=True should get the intended effect. Add a utility to stress test file locking
    • Fix load_checkpoint_to_cpu() to support remote checkpoints when DP>1 (see the stacktrace I got here). I think this only worked before because get_local_path() is a no-op for local paths and the other shard files are already nearby. Updated to ensure we cache all shards locally before attempting to load, using the new wildcard support in PathManager.ls()

Testing steps Multiple eval runs on OPT 125M:

  • remote path + consolidated
  • remote path + DP>1
  • local path + consolidated
  • local path + DP>1

Running the stress test:

python -m tests.file_io.async_download_test
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: davides/blob_cache