Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Issues
  • #166
Closed
Open
Issue created Jun 21, 2022 by Administrator@rootOwner

Add support for subshards

Created by: stephenroller

🚀 Feature Request

Fast forwarding our on-the-fly tokenizer can be very slow when our data shards are very large, taking over an hour in some cases.

One easy solution is to just chop the data into more shards. This requires manual labor, and as our corpus is composed of many hundreds of files now, this makes things annoying. So let's do this effectively in the data loader.

Sketch

  • Add a new argument --data-subshards <int> flag to StreamingLanguageModel
  • When we load the data, use the epoch variable in order to skip documents: assuming subshards of 10, then on epoch 1 you'll take document 0, 10, 20... If epoch is 1, then you want documents 1, 11, 21, ...
  • You'll need to modify JsonlDataset to be aware of this
  • If epoch > subshards you'll need to wrap around
  • The effect will be roughly the same as if we had round robin distributed our datasets into different shards.

https://github.com/facebookresearch/metaseq/blob/f5442a181c6b54dbcc1b56afc9c27b2092306e49/metaseq/tasks/streaming_language_modeling.py#L271

In practice, setting --data-subshards to 10 or 20 should sort us.

Assignee
Assign to
Time tracking