MixtralModel
Mixtral Sparse Mixture-of-Experts (SMoE) model for text generation via llama.cpp.
Mixtral 8x7B is a transformer language model with 8 expert feed-forward networks per layer; only 2 experts are activated per token, giving it the computational cost of a 12B-parameter dense model while retaining capacity equivalent to a 47B model. It matches or surpasses Llama 2 70B and GPT-3.5 on most benchmarks.
Models are loaded as GGUF quantized checkpoints via llama-cpp-python.
The Q4_K_M quantization requires approximately 26 GB of RAM.
References
- [1] Jiang et al. (2024) "Mixtral of Experts" https://arxiv.org/abs/2401.04088
- [2] https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Parameters
- model_name : string, default=
mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF - The Mixtral Instruct checkpoint to load in GGUF format. 'Mixtral-8x7B-Instruct-v0.1' is a Sparse Mixture-of-Experts (SMoE) model with 8 expert networks of 7B parameters each, activating 2 experts per token. It achieves quality comparable to larger dense models while being more efficient at inference. Warning: this model requires ~26 GB of RAM for the Q4_K_M quantization.
- filename : string, default=
Mixtral-8x7B-Instruct-v0.1.Q2_K.gguf - The specific GGUF file to load for the Mixtral model. The different quantization levels (Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0) represent various trade-offs between model size, inference speed, and output quality. Q4_K_M is a popular choice for balancing performance and resource requirements.
- max_tokens : integer, default=
100 - Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code.
- temperature : number, default=
0.7 - Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 outputs are deterministic. Around 0.7 balances quality and creativity.
- frequency_penalty : number, default=
0.1 - Penalizes tokens that have already appeared in the output based on frequency (range 0.0-2.0). Higher values discourage repetition.
- context_window : integer, default=
512 - Total token budget for a single forward pass, including both the input prompt and the generated response. Mixtral 8x7B supports up to 32K tokens natively.
- device : string, default=
CPU - Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM. A GPU option offloads all layers for faster inference. Due to the large size of Mixtral, a GPU with at least 24 GB VRAM is recommended for full GPU offloading.
Methods
generate(self, prompt: list[dict[str, str]]) -> List[str]
MixtralModelGenerate a reply for the given chat prompt.
Parameters
- prompt : list of dict
- Conversation history in OpenAI chat format. Each dict must contain at least
"role"("system","user", or"assistant") and"content"(the message text).
Returns
- list of str
- A single-element list containing the model's reply text, extracted from
choices[0]["message"]["content"].
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.