MistralModel
Mistral Instruct model for open-ended text generation via llama.cpp.
Mistral is a 7B-parameter transformer language model developed by Mistral AI, designed to deliver high performance with efficient inference. It uses grouped- query attention (GQA) for faster decoding and sliding-window attention (SWA) to handle long contexts efficiently. The 12B Mistral-Nemo variant, developed jointly with NVIDIA, extends the context window to 128 K tokens and improves multilingual capability.
Models are loaded as GGUF quantized checkpoints via llama-cpp-python,
allowing CPU and GPU inference without requiring a full PyTorch stack.
References
- [1] Jiang et al. (2023) "Mistral 7B" https://arxiv.org/abs/2310.06825
- [2] https://huggingface.co/mistralai
Parameters
- model_name : string, default=
bartowski/Mistral-7B-Instruct-v0.3-GGUF - The Mistral Instruct checkpoint to load in GGUF format. 'Mistral-7B-Instruct-v0.3' is a 7B-parameter instruction model that delivers strong performance for its size. 'Mistral-Nemo-Instruct-2407' is a 12B-parameter model jointly developed with NVIDIA, featuring a 128K context window and improved multilingual capabilities.
- max_tokens : integer, default=
100 - Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code.
- temperature : number, default=
0.7 - Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model picks the most likely token (deterministic). Around 0.7 balances quality and creativity. At 1.0 outputs are maximally varied.
- frequency_penalty : number, default=
0.1 - Penalizes tokens that have already appeared in the output based on frequency (range 0.0-2.0). Higher values discourage repetition.
- context_window : integer, default=
512 - Total token budget for a single forward pass, including prompt and response. Mistral-7B supports up to 32K tokens; Mistral-Nemo supports up to 128K tokens.
- device : string, default=
CPU - Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM. A GPU option offloads all layers for faster inference.
Methods
generate(self, prompt: list[dict[str, str]]) -> List[str]
MistralModelGenerate a reply for the given chat prompt.
Parameters
- prompt : list of dict
- Conversation history in OpenAI chat format. Each dict must contain at least
"role"("system","user", or"assistant") and"content"(the message text).
Returns
- list of str
- A single-element list containing the model's reply text, extracted from
choices[0]["message"]["content"].
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.