DeepSeek-MoE

Π—Π΅Ρ€ΠΊΠ°Π»ΠΎ ΠΈΠ· https://github.com/deepseek-ai/deepseek-moe
0

ОписаниС

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Π―Π·Ρ‹ΠΊΠΈ

  • Python100%
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
README.md
DeepSeek LLM

Model Download | Evaluation Results | Quick Start | License | Citation

Paper LinkπŸ‘οΈ

1. Introduction

DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. It is trained from scratch on 2T English and Chinese tokens, and exhibits comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of computations. For research purposes, we release the model checkpoints of DeepSeekMoE 16B Base and DeepSeekMoE 16B Chat to the public, which can be deployed on a single GPU with 40GB of memory without the need for quantization. The model code file can be found here.

2. Evaluation Results

DeepSeekMoE 16B Base

We evaluate DeepSeekMoE 16B on various benchmarks and compare it with a series of models, as shown in the following.

  • Comparison with open source models on the Open LLM Leaderboard. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.

table

  • Comparison with DeepSeek 7B on our internal benchmarks. DeepSeek 7B is a dense model trained on the same corpus as DeepSeekMoE 16B. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B.

table

  • Comparison with LLaMA2 7B on our internal benchmarks. With only 39.6% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks.

table

DeepSeekMoE 16B Chat

We also evaluate DeepSeekMoE 16B Chat on various benchmarks and compare it with DeepSeek 7B Chat and LLaMA2 7B SFT. All of the compared models follow the same fine-tuning setting and data for fair comparison. The evaluation results are shown in the following. With only about 40% of computations, DeepSeekMoE 16B Chat achieves comparable or better performance than DeepSeek 7B Chat and LLaMA2 7B SFT.

table

3. Model Downloads

We release the DeepSeekMoE 16B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.

Huggingface

ModelSequence LengthDownload
DeepSeekMoE 16B Base4096πŸ€— HuggingFace
DeepSeekMoE 16B Chat4096πŸ€— HuggingFace

4. Quick Start

Installation

On the basis of

Python >= 3.8
environment, install the necessary dependencies by running the following command:

Inference with Huggingface's Transformers

You can directly employ Huggingface's Transformers for model inference.

Text Completion

Chat Completion

Avoiding the use of the provided function

apply_chat_template
, you can also interact with our model following the sample template. Note that
messages
should be replaced by your input.

User: {messages[0]['content']} Assistant: {messages[1]['content']}<|end▁of▁sentence|>User: {messages[2]['content']} Assistant:

Note: By default (

add_special_tokens=True
), our tokenizer automatically adds a
bos_token
(
<|begin▁of▁sentence|>
) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input.

How to Fine-tune DeepSeekMoE

We provide script

fintune/finetune.py
for users to finetune our models on downstream tasks.

The script supports the training with DeepSpeed. You need install required packages by:

Please follow Sample Dataset Format to prepare your training data. Each item has two required fields

instruction
and
output
.

After data preparation, you can use the sample shell script to finetune the DeepSeekMoE model. Remember to specify

DATA_PATH
,
OUTPUT_PATH
. And please choose appropriate hyper-parameters(e.g.,
learning_rate
,
per_device_train_batch_size
) according to your scenario. We have used flash_attention2 by default. For devices supported by flash_attention, you can refer here. For this configuration, zero_stage needs to be set to 3, and we run it on eight A100 40 GPUs.

You can also finetune the model with 4/8-bits qlora, feel free to try it. For this configuration, it is possible to run on a single A100 80G GPU, and adjustments can be made according to your resources.

5. License

This code repository is licensed under the MIT License. The use of DeepSeekMoE models is subject to the Model License. DeepSeekMoE supports commercial use.

See the LICENSE-CODE and LICENSE-MODEL for more details.

6. Citation

@article{dai2024deepseekmoe, author={Damai Dai and Chengqi Deng and Chenggang Zhao and R. X. Xu and Huazuo Gao and Deli Chen and Jiashi Li and Wangding Zeng and Xingkai Yu and Y. Wu and Zhenda Xie and Y. K. Li and Panpan Huang and Fuli Luo and Chong Ruan and Zhifang Sui and Wenfeng Liang}, title={DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models}, journal = {CoRR}, volume = {abs/2401.06066}, year = {2024}, url = {https://arxiv.org/abs/2401.06066}, }

7. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.