Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !488

Add data proportions

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge data_prop into opt_instruct Nov 04, 2022
  • Overview 9
  • Commits 3
  • Pipelines 0
  • Changes 1

Created by: sriniiyer

PR to control data proportions

PER BENCHMARK PROPORTIONS

--data-sampling-prob '{"flan": 0.35, "ni_v2": 0.24, "exmix": 0.03, "crossfit": 0.03, "pretrain": 0.0, "cot": 0.02, "t5": 0.03, "promptsource": 0.28, "unified_skg": 0.02,}'

  • You don't need to specify probs for all benchmarks. The code will uniformly distribute the remaining prob mass amongst the rest of the benchmarks. So, you can specify just --data-sampling-prob '{"pretrain": 0.0}'

  • The logs print the before proportions, and the after proportions, which you can double check.

EQUALIZE CLUSTER PROPORTIONS AFTER APPLYING BENCHMARK PROPS

--equalize-cluster-probs

This will equalize the prob of each cluster within a benchmark. Note that this needs cluster names to be included in the dataset names.

PROVIDE PER BENCHMARK EPS

--caps '{"flan": 30000, "ni_v2": 5000, "exmix": 20000, "crossfit": 20000, "cot": 10000, "t5": 20000, "promptsource": 20000, "unified_skg": 20000,}'

Instead of applying a uniform eps, this helps provide a per benchmark eps. If its not provided for a benchmark, it will fall back to the default eps, and if no default is specified, it will fall back to the dataset length.

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: data_prop