Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Issues
  • #308
Closed
Open
Issue created Aug 20, 2022 by Administrator@rootOwner

Unify tokenizers

Created by: stephenroller

🚀 Feature Request

With #305, we now have two ways to specify a tokenizer: with the GPT2 tokenizer (provided as two files), and with the universal HF format (specified as one file). These are in two separate code paths, but they don't need to be: we could (manually) merge the two GPT2 files into the universal HF format and switch to only that, and we should.

The resulting code would be cleaner, but catching all the other places the old method is used (e.g. API and old sweeps) needs thorough review.

Assignee
Assign to
Time tracking