Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • M metaseq
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 95
    • Issues 95
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 41
    • Merge requests 41
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Administrator
  • metaseq
  • Merge requests
  • !104

Support document level attention while training model parallel models

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Administrator requested to merge punitkoura/doc-attention into main May 17, 2022
  • Overview 22
  • Commits 5
  • Pipelines 0
  • Changes 4

Created by: punitkoura

Patch Description We are trying to optimize model training by having document level attention. What this entails is that we modify the attention mask to just attend to tokens in the current document, as opposed to having a constant triangular (causal) attention mask. The goal here is to ensure that tokens don't attend across documents.

To accomplish this, we

  1. Change the constant upper triangular mask to depend on the data separator tokens, specified by the self_attn_doc_sep field
  2. Reset the positional encoding at the end of each document.

Testing steps Consider a list of 20 documents. We consider the following two cases -

  1. Passing the documents, one per sequence, in a batch of 20 sequences. No special attention mask is used, we still use the upper triangular mask.
  2. Passing the documents packed in a single sequence, separated by a special document separator token. Document attention is used here.

By intuition, we can see that the gradients in these two scenarios should match. We find that it is indeed the case.

Documents packed, with document level attention

> /fsx-mudslide/punitkoura/src/metaseq/metaseq/trainer.py(781)train_step()
-> grad_norm = self.clip_grad_norm(
(Pdb) self.model.decoder.embed_positions.weight.grad[2:10]
tensor([[-6.5479e-03,  9.6962e-03,  2.6375e-03,  ..., -5.1121e-03,
         -1.9985e-02,  2.4901e-03],
        [ 2.5498e-02, -7.1761e-03,  2.2065e-02,  ...,  6.4638e-03,
         -1.0154e-02,  5.8792e-03],
        [ 7.9259e-05, -4.0577e-03,  3.9698e-03,  ...,  2.3798e-04,
          3.1224e-03,  1.0833e-02],
        ...,
        [-7.4849e-06,  1.7103e-03, -1.5433e-03,  ...,  2.2430e-03,
         -9.3732e-03,  1.5123e-03],
        [-1.4408e-03, -1.8149e-03,  6.9925e-06,  ...,  1.1794e-03,
         -1.4160e-03,  2.6913e-03],
        [-1.4234e-03, -6.3481e-03,  7.6257e-03,  ...,  7.9472e-03,
         -1.1467e-03,  3.3719e-03]], device='cuda:0')
(Pdb) 

Documents sent one by one

> /fsx-mudslide/punitkoura/src/metaseq/metaseq/trainer.py(781)train_step()
-> grad_norm = self.clip_grad_norm(
(Pdb) self.model.decoder.embed_positions.weight.grad[2:10]
tensor([[-6.5555e-03,  9.6926e-03,  2.6306e-03,  ..., -5.1131e-03,
         -1.9988e-02,  2.4959e-03],
        [ 2.5496e-02, -7.1761e-03,  2.2073e-02,  ...,  6.4701e-03,
         -1.0157e-02,  5.8840e-03],
        [ 7.5817e-05, -4.0496e-03,  3.9688e-03,  ...,  2.3734e-04,
          3.1189e-03,  1.0830e-02],
        ...,
        [-1.3878e-05,  1.7050e-03, -1.5432e-03,  ...,  2.2473e-03,
         -9.3719e-03,  1.5083e-03],
        [-1.4369e-03, -1.8114e-03,  5.8990e-06,  ...,  1.1820e-03,
         -1.4172e-03,  2.6962e-03],
        [-1.4229e-03, -6.3467e-03,  7.6208e-03,  ...,  7.9439e-03,
         -1.1527e-03,  3.3779e-03]], device='cuda:0')
(Pdb) 
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: punitkoura/doc-attention