unsloth

Форк
0

9 месяцев назад
8 месяцев назад
7 месяцев назад
год назад
год назад
8 месяцев назад
README.md

unsloth logo

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supportsFree NotebooksPerformanceMemory use
Gemma 7b▶️ Start on Colab2.4x faster58% less
Mistral 7b▶️ Start on Colab2.2x faster62% less
Llama-2 7b▶️ Start on Colab2.2x faster43% less
TinyLlama▶️ Start on Colab3.9x faster74% less
CodeLlama 34b A100▶️ Start on Colab1.9x faster27% less
Mistral 7b 1xT4▶️ Start on Kaggle5x faster*62% less
DPO - Zephyr▶️ Start on Colab1.9x faster19% less

🦥 Unsloth.ai News

TypeLinks
📚 Wiki & FAQRead Our Wiki
📜 DocumentationRead The Doc
💾 Installationunsloth/README.md
  Twitter (aka X)Follow us on X
🥇 BenchmarkingPerformance Tables
🌐 Released ModelsUnsloth Releases
✍️ BlogRead our Blogs

⭐ Key Features

  • All kernels written in OpenAI's Triton language. Manual backprop engine.
  • 0% loss in accuracy - no approximation methods - all exact.
  • No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
  • Works on Linux and Windows via WSL.
  • Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
  • Open source trains 5x faster - see Unsloth Pro for 30x faster training!
  • If you trained a model with 🦥Unsloth, you can use this cool sticker!  

🥇 Performance Benchmarking

1 A100 40GB🤗Hugging FaceFlash Attention🦥Unsloth Open Source🦥Unsloth Pro
Alpaca1x1.04x1.98x15.64x
LAION Chip21x0.92x1.61x20.73x
OASST1x1.19x2.17x14.83x
Slim Orca1x1.18x2.22x14.82x
Free Colab T4Dataset🤗Hugging FacePytorch 2.1.1🦥Unsloth🦥 VRAM reduction
Llama-2 7bOASST1x1.19x1.95x-43.3%
Mistral 7bAlpaca1x1.07x1.56x-13.7%
Tiny Llama 1.1bAlpaca1x2.06x3.87x-73.8%
DPO with ZephyrUltra Chat1x1.09x1.55x-18.6%

💾 Installation Instructions

Conda Installation

Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. See this Github issue for help on debugging Conda installs.

conda create --name unsloth_env python=3.10
conda activate unsloth_env
conda install pytorch cudatoolkit torchvision torchaudio pytorch-cuda=<12.1/11.8> -c pytorch -c nvidia
conda install xformers -c xformers
pip install bitsandbytes
pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"

Pip Installation

Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

  1. Find your CUDA version via
import torch; torch.version.cuda
  1. For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"
  1. For Pytorch 2.1.1: Use the "ampere" path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
  1. For Pytorch 2.2.0: Use the "ampere" path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
  1. If you get errors, try the below first, then go back to step 1:
pip install --upgrade pip

📜 Documentation

  • Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
  • We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
  • We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support - 4x faster downloading!
fourbit_models = [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
] # Go to https://huggingface.co/unsloth for more 4-bit models!
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Cutomized chat templates

DPO Support

DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.

We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!

from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()

🥇 Detailed Benchmarking Tables

  • Click "Code" for fully reproducible examples
  • "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
  • For the full list of benchmarking tables, go to our website
1 A100 40GB🤗Hugging FaceFlash Attention 2🦥Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x1.04x1.98x2.48x5.32x15.64x
codeCodeCodeCodeCode
seconds1040100152541919667
memory MB182351536596318525
% saved15.7447.1853.25

Llama-Factory 3rd party benchmarking

  • Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
MethodBitsTGSGRAMSpeed
HF16239218GB100%
HF+FA216295417GB123%
Unsloth+FA216400716GB168%
HF424159GB101%
Unsloth+FA2437267GB160%
Click for specific model benchmarking tables (Mistral 7b, CodeLlama 34b etc.)

Mistral 7b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Mistral 7B Slim Orca1x1.15x2.15x2.53x4.61x13.69x
codeCodeCodeCodeCode
seconds18131571842718393132
memory MB32853193851246510271
% saved40.9962.0668.74

CodeLlama 34b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Code Llama 34BOOM ❌0.99x1.87x2.61x4.27x12.82x
code▶️ CodeCodeCodeCode
seconds195319821043748458152
memory MB40000332172741322161
% saved16.9631.4744.60

1 Tesla T4

1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
Alpaca1x1.09x1.69x1.79x2.93x8.3x
code▶️ CodeCodeCodeCode
seconds15991468942894545193
memory MB7199705964595443
% saved1.9410.2824.39

2 Tesla T4s via DDP

2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x0.99x4.95x4.44x7.28x20.61x
code▶️ CodeCodeCode
seconds98829946199622271357480
memory MB9176912869046782
% saved0.5224.7626.09

Performance comparisons on 1 Tesla T4 GPU:

Click for Time taken for 1 epoch

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T423h 15m56h 28m8h 38m391h 41m
Unsloth Open1 T413h 7m (1.8x)31h 47m (1.8x)4h 27m (1.9x)240h 4m (1.6x)
Unsloth Pro1 T43h 6m (7.5x)5h 17m (10.7x)1h 7m (7.7x)59h 53m (6.5x)
Unsloth Max1 T42h 39m (8.8x)4h 31m (12.5x)0h 58m (8.9x)51h 30m (7.6x)

Peak Memory Usage

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T47.3GB5.9GB14.0GB13.3GB
Unsloth Open1 T46.8GB5.7GB7.8GB7.7GB
Unsloth Pro1 T46.4GB6.4GB6.4GB6.4GB
Unsloth Max1 T411.4GB12.4GB11.9GB14.4GB
Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP: **Time taken for 1 epoch**

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T484h 47m163h 48m30h 51m1301h 24m *
Unsloth Pro2 T43h 20m (25.4x)5h 43m (28.7x)1h 12m (25.7x)71h 40m (18.1x) *
Unsloth Max2 T43h 4m (27.6x)5h 14m (31.3x)1h 6m (28.1x)54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T48.4GB | 6GB7.2GB | 5.3GB14.3GB | 6.6GB10.9GB | 5.9GB *
Unsloth Pro2 T47.7GB | 4.9GB7.5GB | 4.9GB8.5GB | 4.9GB6.2GB | 4.7GB *
Unsloth Max2 T410.5GB | 5GB10.6GB | 5GB10.6GB | 5GB10.5GB | 5GB *
  • Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.


Credits

  1. RandomInternetPreson for confirming WSL support
  2. 152334H for experimental DPO support
  3. atgctg for syntax highlighting

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.