QwenModel

GenerativeModel

DashAI.back.models.hugging_face.QwenModel

Qwen 2.5 Instruct model for efficient text generation via llama.cpp.

Qwen 2.5 is a series of dense transformer language models from Alibaba Cloud, spanning 0.5B to 72B parameters. The DashAI integration exposes the 0.5B and 1.5B Instruct variants, which run comfortably on CPU. Both are trained on 18 trillion tokens with improved coding, mathematics, and multilingual capability over Qwen 2.

Models are loaded as GGUF Q8_0 quantized checkpoints via llama-cpp-python; the quantization file is selected automatically from the HuggingFace repo.

References

[1] Qwen Team (2024). "Qwen2.5 Technical Report." https://arxiv.org/abs/2412.15115
[2] https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF

Parameters

model_name : string, default=Qwen/Qwen2.5-1.5B-Instruct-GGUF: The Qwen 2.5 Instruct checkpoint to load in GGUF format. '0.5B' (500M parameters) is faster and uses less memory, suitable for lightweight tasks on CPU. '1.5B' (1.5B parameters) is more capable and produces higher-quality responses at the cost of more memory and slightly slower inference.
max_tokens : integer, default=100: Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code. Must not exceed the context window minus the prompt length.
temperature : number, default=0.7: Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model always picks the most likely token (greedy, fully deterministic). Around 0.7 is a good balance for conversational tasks. At 1.0 outputs are maximally varied and unpredictable.
frequency_penalty : number, default=0.1: Penalizes tokens that have already appeared in the output based on how often they occur (range 0.0-2.0). At 0.0 there is no penalty and the model may repeat itself. Values around 0.1-0.3 gently discourage repetition. High values (1.5+) strongly prevent reuse of any word, which may produce less coherent text.
context_window : integer, default=512: Total token budget for a single forward pass, including both the input prompt and the generated response. Larger values allow longer conversations but consume more RAM/VRAM. Qwen 2.5 supports up to 32768 tokens natively; keep this at or below that limit.
device : string, default=CPU: Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. Selecting a GPU option offloads all layers for faster inference, setting n_gpu_layers=-1 so every transformer layer is GPU-accelerated.

Methods

generate(self, prompt: list[dict[str, str]]) -> List[str]

Defined on QwenModel

Generate a reply for the given chat prompt.

Parameters

prompt : list of dict: Conversation history in OpenAI chat format. Each dict must contain at least "role" ("system", "user", or "assistant") and "content" (the message text).

Returns

list of str: A single-element list containing the model's reply text, extracted from choices[0]["message"]["content"].

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

Compatible with

TextToTextGenerationTask

References​

Parameters​

Methods​

Compatible with​

References

Parameters

Methods

Compatible with