Skip to main content

QwenModel

GenerativeModel
DashAI.back.models.hugging_face.QwenModel

Qwen 2.5 Instruct model for efficient text generation via llama.cpp.

Qwen 2.5 is a series of dense transformer language models from Alibaba Cloud, spanning 0.5B to 72B parameters. The DashAI integration exposes the 0.5B and 1.5B Instruct variants, which run comfortably on CPU. Both are trained on 18 trillion tokens with improved coding, mathematics, and multilingual capability over Qwen 2.

Models are loaded as GGUF Q8_0 quantized checkpoints via llama-cpp-python; the quantization file is selected automatically from the HuggingFace repo.

References

Parameters

model_name : string, default=Qwen/Qwen2.5-1.5B-Instruct-GGUF
The Qwen 2.5 Instruct checkpoint to load in GGUF format. '0.5B' (500M parameters) is faster and uses less memory, suitable for lightweight tasks on CPU. '1.5B' (1.5B parameters) is more capable and produces higher-quality responses at the cost of more memory and slightly slower inference.
max_tokens : integer, default=100
Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code. Must not exceed the context window minus the prompt length.
temperature : number, default=0.7
Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model always picks the most likely token (greedy, fully deterministic). Around 0.7 is a good balance for conversational tasks. At 1.0 outputs are maximally varied and unpredictable.
frequency_penalty : number, default=0.1
Penalizes tokens that have already appeared in the output based on how often they occur (range 0.0-2.0). At 0.0 there is no penalty and the model may repeat itself. Values around 0.1-0.3 gently discourage repetition. High values (1.5+) strongly prevent reuse of any word, which may produce less coherent text.
context_window : integer, default=512
Total token budget for a single forward pass, including both the input prompt and the generated response. Larger values allow longer conversations but consume more RAM/VRAM. Qwen 2.5 supports up to 32768 tokens natively; keep this at or below that limit.
device : string, default=CPU
Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. Selecting a GPU option offloads all layers for faster inference, setting n_gpu_layers=-1 so every transformer layer is GPU-accelerated.

Methods

generate(self, prompt: list[dict[str, str]]) -> List[str]

Defined on QwenModel

Generate a reply for the given chat prompt.

Parameters

prompt : list of dict
Conversation history in OpenAI chat format. Each dict must contain at least "role" ("system", "user", or "assistant") and "content" (the message text).

Returns

list of str
A single-element list containing the model's reply text, extracted from choices[0]["message"]["content"].

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.

Compatible with