sft-demos
Supervised finetuning of instruction-following Large Language Models (LLMs)
This repo contains demos for supervised finetuning (sft) of large language models, like Meta's llama-2. In particular, we focus on tuning for short-form instruction following capabilities.
Table of contents
Instruction-tuning background
The goal of instruction-tuning is to build LLMs that are capable of following natural language instructions to perform a wide range of tasks. The below was captured from the "State of GPTs" talk by Andrej Karpathy. The key points illustrated for sft:
- Collect small but high-quality datasets in the form of prompt and ideal responses.
- Do language modeling on this data, nothing changes algorithmically from pretraining.
- After training we get an sft model which can be deployed as assistants (and it works to some extent).
For more background, see any number of excellent papers on the subject, including Self-Instruct (2023), Orca (2023), and InstructGPT (2022).
Finetuned models
See src
for all finetuning runs.
- Scripts are included for both parameter-efficient
src/peft
and full-parametersrc/sft
finetuning- The full-parameter scripts are flexible to single- and multi-GPU setups thanks to 🤗's accelerate package, while peft scripts expect only single-GPU clusters
- Both peft and sft scripts leverage mixed precision training, with the former running in fp4 and the latter running in fp16
Here are some of my favorites:
- dfurman/Mixtral-8x7B-Instruct-v0.1 (peft)
- dfurman/Llama-2-70B-Instruct-v0.1 (peft)
- Note: This model was ranked 6th on 🤗's Open LLM Leaderboard in Aug 2023
Basic inference
Note: Use the code below to get started with our sft models, as ran on 1x A100 (40 GB SXM).
dfurman/Mixtral-8x7B-Instruct-v0.1
Setup
!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes
import torchfrom peft import PeftModel, PeftConfigfrom transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,)
peft_model_id = "dfurman/Mixtral-8x7B-Instruct-v0.1"config = PeftConfig.from_pretrained(peft_model_id)
tokenizer = AutoTokenizer.from_pretrained( peft_model_id, use_fast=True, trust_remote_code=True,)
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,)
model = AutoModelForCausalLM.from_pretrained( config.base_model_name_or_path, quantization_config=bnb_config, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,)
model = PeftModel.from_pretrained( model, peft_model_id)
messages = [ {"role": "user", "content": "Tell me a recipe for a mai tai."},]
print("\n\n*** Prompt:")input_ids = tokenizer.apply_chat_template( messages, tokenize=True, return_tensors="pt",)print(tokenizer.decode(input_ids[0]))
print("\n\n*** Generate:")with torch.autocast("cuda", dtype=torch.bfloat16): output = model.generate( input_ids=input_ids.to("cuda"), max_new_tokens=1024, return_dict_in_generate=True, )
response = tokenizer.decode( output["sequences"][0][len(input_ids[0]):], skip_special_tokens=True)print(response)
Outputs
"""*** Prompt:<s> [INST] Tell me a recipe for a mai tai. [/INST]
*** Generate:1.5 oz light rum2 oz dark rum1 oz lime juice0.5 oz orange curaçao0.5 oz orgeat syrup
In a shaker filled with ice, combine the light rum, dark rum, lime juice, orange curaçao, and orgeat syrup. Shake well.
Strain the mixture into a chilled glass filled with fresh ice.
Garnish with a lime wedge and a cherry."""
Evaluation
See src/eval
for all evaluation runs.
We evaluate models herein on 6 key benchmarks using the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models.
- Precision: fp4
Metric | Value |
---|---|
Avg. | 68.87 |
ARC (25-shot) | 67.24 |
HellaSwag (10-shot) | 86.03 |
MMLU (5-shot) | 68.59 |
TruthfulQA (0-shot) | 59.54 |
Winogrande (5-shot) | 80.43 |
GSM8K (5-shot) | 51.4 |
- Precision: fp16
Metric | Value |
---|---|
Avg. | 65.72 |
ARC (25-shot) | 69.62 |
HellaSwag (10-shot) | 86.82 |
MMLU (5-shot) | 69.18 |
TruthfulQA (0-shot) | 57.43 |
Winogrande (5-shot) | 83.9 |
GSM8K (5-shot) | 27.37 |
Base models and datasets
We finetune off of the following base models:
We use the following datasets:
- ehartford/dolphin
- jondurbin/airoboros-2.2.1
- garage-bAInd/Open-Platypus
- timdettmers/openassistant-guanaco
We use the following compute providers:
Описание
Lightweight demos for finetuning of instruct LLMs. Powered by transformers/accelerate and open-source datasets.
Языки
Jupyter Notebook
- Python