Skip to main content

MixtralModel

GenerativeModel
DashAI.back.models.hugging_face.MixtralModel

Mixtral Sparse Mixture-of-Experts (SMoE) model for text generation via llama.cpp.

Mixtral 8x7B is a transformer language model with 8 expert feed-forward networks per layer; only 2 experts are activated per token, giving it the computational cost of a 12B-parameter dense model while retaining capacity equivalent to a 47B model. It matches or surpasses Llama 2 70B and GPT-3.5 on most benchmarks.

Models are loaded as GGUF quantized checkpoints via llama-cpp-python. The Q4_K_M quantization requires approximately 26 GB of RAM.

References

Parameters

model_name : string, default=mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF
The Mixtral Instruct checkpoint to load in GGUF format. 'Mixtral-8x7B-Instruct-v0.1' is a Sparse Mixture-of-Experts (SMoE) model with 8 expert networks of 7B parameters each, activating 2 experts per token. It achieves quality comparable to larger dense models while being more efficient at inference. Warning: this model requires ~26 GB of RAM for the Q4_K_M quantization.
filename : string, default=Mixtral-8x7B-Instruct-v0.1.Q2_K.gguf
The specific GGUF file to load for the Mixtral model. The different quantization levels (Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0) represent various trade-offs between model size, inference speed, and output quality. Q4_K_M is a popular choice for balancing performance and resource requirements.
max_tokens : integer, default=100
Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code.
temperature : number, default=0.7
Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 outputs are deterministic. Around 0.7 balances quality and creativity.
frequency_penalty : number, default=0.1
Penalizes tokens that have already appeared in the output based on frequency (range 0.0-2.0). Higher values discourage repetition.
context_window : integer, default=512
Total token budget for a single forward pass, including both the input prompt and the generated response. Mixtral 8x7B supports up to 32K tokens natively.
device : string, default=CPU
Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM. A GPU option offloads all layers for faster inference. Due to the large size of Mixtral, a GPU with at least 24 GB VRAM is recommended for full GPU offloading.

Methods

generate(self, prompt: list[dict[str, str]]) -> List[str]

Defined on MixtralModel

Generate a reply for the given chat prompt.

Parameters

prompt : list of dict
Conversation history in OpenAI chat format. Each dict must contain at least "role" ("system", "user", or "assistant") and "content" (the message text).

Returns

list of str
A single-element list containing the model's reply text, extracted from choices[0]["message"]["content"].

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.

Compatible with