Skip to main content

LlamaModel

GenerativeModel
DashAI.back.models.hugging_face.LlamaModel

Meta Llama 3.x instruction-tuned model for text generation via llama.cpp.

Wraps the Meta Llama 3.x family of open-weight instruction-tuned LLMs loaded in Q4_K_M GGUF format using the llama-cpp-python library. GGUF quantization enables efficient CPU and GPU inference without requiring full-precision weights, making the models practical on consumer hardware.

Three sizes are available via bartowski's community quantizations: 1B (fastest, CPU-friendly), 3B (balanced), and 8B (highest quality).

References

Parameters

model_name : string, default=bartowski/Llama-3.2-3B-Instruct-GGUF
The Meta Llama 3.x Instruct checkpoint to load in GGUF format via bartowski's community quantizations. 'Llama-3.2-1B' (~1B parameters) is the smallest and fastest, ideal for CPU-only systems. 'Llama-3.2-3B' (~3B parameters) offers a good speed/quality trade-off. 'Meta-Llama-3.1-8B' (~8B parameters) delivers the highest quality at the cost of more RAM and slower inference.
max_tokens : integer, default=100
Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code. Must not exceed the context window minus the prompt length.
temperature : number, default=0.7
Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model always picks the most likely token (greedy, fully deterministic). Around 0.7 is a good balance for conversational tasks. At 1.0 outputs are maximally varied and unpredictable.
frequency_penalty : number, default=0.1
Penalizes tokens that have already appeared in the output based on how often they occur (range 0.0-2.0). At 0.0 there is no penalty and the model may repeat itself. Values around 0.1-0.3 gently discourage repetition. High values (1.5+) strongly prevent reuse of any word, which may produce less coherent text.
context_window : integer, default=512
Total token budget for a single forward pass, including both the input prompt and the generated response. Larger values allow longer conversations but consume more RAM/VRAM. Llama 3.1 supports up to 128K tokens natively; Llama 3.2 models support up to 128K tokens.
device : string, default=CPU
Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. Selecting a GPU option offloads all layers for faster inference, setting n_gpu_layers=-1 so every transformer layer is GPU-accelerated.

Methods

generate(self, prompt: list[dict[str, str]]) -> List[str]

Defined on LlamaModel

Generate a reply for the given chat prompt.

Parameters

prompt : list of dict
Conversation history in OpenAI chat format. Each dict must contain at least "role" ("system", "user", or "assistant") and "content" (the message text).

Returns

list of str
A single-element list containing the model's reply text, extracted from choices[0]["message"]["content"].

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.

Compatible with