Python bindings for the Transformer models implemented in C/C++ using GGML library.

Supported Models

ModelsModel TypeCUDAMetal
GPT-J, GPT4All-Jgptj
GPT-NeoX, StableLMgpt_neox
LLaMA, LLaMA 2llama
StarCoder, StarChatgpt_bigcode
Dolly V2dolly-v2


pip install ctransformers


It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to"))

Run in Google Colab

To stream the output, set stream=True:

for text in llm("AI is going to", stream=True):
print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo.


It is integrated into LangChain. See LangChain docs.


To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab


Install CUDA libraries using:

pip install ctransformers[cuda]


To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers


To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers


Note: This is an experimental feature and only LLaMA models are supported using ExLlama.

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If model name or path doesn't contain the word gptq then specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.



top_kintThe top-k value to use for sampling.40
top_pfloatThe top-p value to use for sampling.0.95
temperaturefloatThe temperature to use for sampling.0.8
repetition_penaltyfloatThe repetition penalty to use for sampling.1.1
last_n_tokensintThe number of last tokens to use for repetition penalty.64
seedintThe seed value to use for sampling tokens.-1
max_new_tokensintThe maximum number of new tokens to generate.256
stopList[str]A list of sequences to stop generation when encountered.None
streamboolWhether to stream the generated text.False
resetboolWhether to reset the model state before generating text.True
batch_sizeintThe batch size to use for evaluating tokens in a single prompt.8
threadsintThe number of threads to use for evaluating tokens.-1
context_lengthintThe maximum context length to use.-1
gpu_layersintThe number of layers to run on GPU.0

Note: Currently only LLaMA, MPT and Falcon models support the context_length parameter.

class AutoModelForCausalLM

classmethod AutoModelForCausalLM.from_pretrained

model_path_or_repo_id: str,
model_type: Optional[str] = None,
model_file: Optional[str] = None,
config: Optional[ctransformers.hub.AutoConfig] = None,
lib: Optional[str] = None,
local_files_only: bool = False,
revision: Optional[str] = None,
hf: bool = False,
) → LLM

Loads the language model from a local file or remote repo.


  • model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.
  • model_type: The model type.
  • model_file: The name of the model file in repo or directory.
  • config: AutoConfig object.
  • lib: The path to a shared library or one of avx2, avx, basic.
  • local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).
  • revision: The specific model version to use. It can be a branch name, a tag name, or a commit id.
  • hf: Whether to create a Hugging Face Transformers model.

Returns: LLM object.

class LLM

method LLM.__init__

model_path: str,
model_type: Optional[str] = None,
config: Optional[ctransformers.llm.Config] = None,
lib: Optional[str] = None

Loads the language model from a local file.


  • model_path: The path to a model file.
  • model_type: The model type.
  • config: Config object.
  • lib: The path to a shared library or one of avx2, avx, basic.

property LLM.bos_token_id

The beginning-of-sequence token.

property LLM.config

The config object.

property LLM.context_length

The context length of model.

property LLM.embeddings

The input embeddings.

property LLM.eos_token_id

The end-of-sequence token.

property LLM.logits

The unnormalized log probabilities.

property LLM.model_path

The path to the model file.

property LLM.model_type

The model type.

property LLM.pad_token_id

The padding token.

property LLM.vocab_size

The number of tokens in vocabulary.

method LLM.detokenize

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.


  • tokens: The list of tokens.
  • decode: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.

method LLM.embed

input: Union[str, Sequence[int]],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA and Falcon models support embeddings.


  • input: The input text or list of tokens to get embeddings for.
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

Returns: The input embeddings.

method LLM.eval

tokens: Sequence[int],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → None

Evaluates a list of tokens.


  • tokens: The list of tokens to evaluate.
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

method LLM.generate

tokens: Sequence[int],
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.


  • tokens: The list of tokens to generate tokens from.
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated tokens.

method LLM.is_eos_token

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.


  • token: The token to check.

Returns: True if the token is an end-of-sequence token else False.

method LLM.prepare_inputs_for_generation

tokens: Sequence[int],
reset: Optional[bool] = None
) → Sequence[int]

Removes input tokens that are evaluated in the past and updates the LLM context.


  • tokens: The list of input tokens.
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The list of tokens to evaluate.

method LLM.reset

reset() → None

Deprecated since 0.2.27.

method LLM.sample

top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None
) → int

Samples a token from the model.


  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1

Returns: The sampled token.

method LLM.tokenize

tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]

Converts a text into list of tokens.


  • text: The text to tokenize.
  • add_bos_token: Whether to add the beginning-of-sequence token.

Returns: The list of tokens.

method LLM.__call__

prompt: str,
max_new_tokens: Optional[int] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
stop: Optional[Sequence[str]] = None,
stream: Optional[bool] = None,
reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.


  • prompt: The prompt to generate text from.
  • max_new_tokens: The maximum number of new tokens to generate. Default: 256
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • stop: A list of sequences to stop generation when encountered. Default: None
  • stream: Whether to stream the generated text. Default: False
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated text.




