Convert opt to megatron-lm
Created by: appleeji
🚀 Feature Request
Convert opt checkpoint to megatron-lm or fastertransformer
Motivation
I am currently trying to use opt in a production environment. However, because the 175B model is too large, it cannot fit all on single A100 GPU. So I used huggingface's accelerate, but found that there was no latency benefit.
- https://twitter.com/huggingface/status/1524783489593360385
- https://github.com/huggingface/accelerate/pull/345#issuecomment-1121411236
From what I have found, it is important to utilize intra-layer parallelism
called tensor parallel in order to increase speed in a multi-gpu environment. like below
- deepspeed
- fastertransformer (megatron-lm)
So I tried to serve opt using fastertransformer. However, converting the opt model to megatron-lm or fastertransformer did not work well. I hope this feature is supported so that opt can be used much more.
Additional context
Here's what I tried for opt-> megatron-lm
- megatron-lm : https://catalog.ngc.nvidia.com/orgs/nvidia/models/megatron_lm_345m
- opt from huggingface : https://huggingface.co/facebook/opt-1.3b
- Compare and match the keys at each checkpoint
Megatron-LM | OPT from huggingface |
---|---|
word_embeddings/weight | decoder.embed_tokens.weight |
position_embeddings/weight | decoder.embed_positions.weight |
transformer/layers.{i}.input_layernorm.weight | decoder.layers.{i}.self_attn_layer_norm.weight |
transformer/layers.{i}.input_layernorm.bias | decoder.layers.{i}.self_attn_layer_norm.bias |
transformer/layers.{i}.attention.query_key_value.weight | decoder.layers.{i}.self_attn.q_proj.weight decoder.layers.{i}.self_attn.k_proj.weight decoder.layers.{i}.self_attn.v_proj.weight |
transformer/layers.{i}.attention.query_key_value.bias | decoder.layers.{i}.self_attn.q_proj.bias decoder.layers.{i}.self_attn.k_proj.bias decoder.layers.{i}.self_attn.v_proj.bias |
transformer/layers.{i}.attention.dense.weight | decoder.layers.{i}.self_attn.out_proj.weight |
transformer/layers.{i}.attention.dense.bias | decoder.layers.{i}.self_attn.out_proj.bias |
transformer/layers.{i}.post_attention_layernorm.weight | decoder.layers.{i}.final_layer_norm.weight |
transformer/layers.{i}.post_attention_layernorm.bias | decoder.layers.{i}.final_layer_norm.bias |
transformer/layers.{i}.mlp.dense_h_to_4h.weight | decoder.layers.{i}.fc1.weight |
transformer/layers.{i}.mlp.dense_h_to_4h.bias | decoder.layers.{i}.fc1.bias |
transformer/layers.{i}.mlp.dense_4h_to_h.weight | decoder.layers.{i}.fc2.weight |
transformer/layers.{i}.mlp.dense_4h_to_h.bias | decoder.layers.{i}.fc2.bias |
transformer/final_layernorm.weight | X |
transformer/final_layernorm.bias | X |
- Concat q,k,v in OPT like megatrons
with open(f"model.layers.{i}.attention.query_key_value.bias.0.bin", "wb") as f:
val = np.concatenate((model[f'decoder.layers.{i}.self_attn.q_proj.bias'].numpy(),
model[f'decoder.layers.{i}.self_attn.k_proj.bias'].numpy(),
model[f'decoder.layers.{i}.self_attn.v_proj.bias'].numpy()),
axis=0)
- Make empty weight for final_layernorm that doesn't exist in OPT fill ones to final_layernorm.weight and zeros to final_layernorm.bias
Although the OPT model was executed with FasterTransformer with the above procedure, a sentence that does not make sense was generated as a result of completion.