LlamaModel
Meta Llama 3.x instruction-tuned model for text generation via llama.cpp.
Wraps the Meta Llama 3.x family of open-weight instruction-tuned LLMs
loaded in Q4_K_M GGUF format using the llama-cpp-python library.
GGUF quantization enables efficient CPU and GPU inference without requiring
full-precision weights, making the models practical on consumer hardware.
Three sizes are available via bartowski's community quantizations: 1B (fastest, CPU-friendly), 3B (balanced), and 8B (highest quality).
References
- [1] Meta AI, "Llama 3", 2024. https://ai.meta.com/blog/meta-llama-3/
- [2] https://huggingface.co/bartowski
Parameters
- model_name : string, default=
bartowski/Llama-3.2-3B-Instruct-GGUF - The Meta Llama 3.x Instruct checkpoint to load in GGUF format via bartowski's community quantizations. 'Llama-3.2-1B' (~1B parameters) is the smallest and fastest, ideal for CPU-only systems. 'Llama-3.2-3B' (~3B parameters) offers a good speed/quality trade-off. 'Meta-Llama-3.1-8B' (~8B parameters) delivers the highest quality at the cost of more RAM and slower inference.
- max_tokens : integer, default=
100 - Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code. Must not exceed the context window minus the prompt length.
- temperature : number, default=
0.7 - Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model always picks the most likely token (greedy, fully deterministic). Around 0.7 is a good balance for conversational tasks. At 1.0 outputs are maximally varied and unpredictable.
- frequency_penalty : number, default=
0.1 - Penalizes tokens that have already appeared in the output based on how often they occur (range 0.0-2.0). At 0.0 there is no penalty and the model may repeat itself. Values around 0.1-0.3 gently discourage repetition. High values (1.5+) strongly prevent reuse of any word, which may produce less coherent text.
- context_window : integer, default=
512 - Total token budget for a single forward pass, including both the input prompt and the generated response. Larger values allow longer conversations but consume more RAM/VRAM. Llama 3.1 supports up to 128K tokens natively; Llama 3.2 models support up to 128K tokens.
- device : string, default=
CPU - Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. Selecting a GPU option offloads all layers for faster inference, setting n_gpu_layers=-1 so every transformer layer is GPU-accelerated.
Methods
generate(self, prompt: list[dict[str, str]]) -> List[str]
LlamaModelGenerate a reply for the given chat prompt.
Parameters
- prompt : list of dict
- Conversation history in OpenAI chat format. Each dict must contain at least
"role"("system","user", or"assistant") and"content"(the message text).
Returns
- list of str
- A single-element list containing the model's reply text, extracted from
choices[0]["message"]["content"].
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.