QwenModel
Qwen 2.5 Instruct model for efficient text generation via llama.cpp.
Qwen 2.5 is a series of dense transformer language models from Alibaba Cloud, spanning 0.5B to 72B parameters. The DashAI integration exposes the 0.5B and 1.5B Instruct variants, which run comfortably on CPU. Both are trained on 18 trillion tokens with improved coding, mathematics, and multilingual capability over Qwen 2.
Models are loaded as GGUF Q8_0 quantized checkpoints via llama-cpp-python;
the quantization file is selected automatically from the HuggingFace repo.
References
- [1] Qwen Team (2024). "Qwen2.5 Technical Report." https://arxiv.org/abs/2412.15115
- [2] https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
Parameters
- model_name : string, default=
Qwen/Qwen2.5-1.5B-Instruct-GGUF - The Qwen 2.5 Instruct checkpoint to load in GGUF format. '0.5B' (500M parameters) is faster and uses less memory, suitable for lightweight tasks on CPU. '1.5B' (1.5B parameters) is more capable and produces higher-quality responses at the cost of more memory and slightly slower inference.
- max_tokens : integer, default=
100 - Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. Set to 100-200 for short answers, 500-1000 for detailed explanations or code. Must not exceed the context window minus the prompt length.
- temperature : number, default=
0.7 - Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 the model always picks the most likely token (greedy, fully deterministic). Around 0.7 is a good balance for conversational tasks. At 1.0 outputs are maximally varied and unpredictable.
- frequency_penalty : number, default=
0.1 - Penalizes tokens that have already appeared in the output based on how often they occur (range 0.0-2.0). At 0.0 there is no penalty and the model may repeat itself. Values around 0.1-0.3 gently discourage repetition. High values (1.5+) strongly prevent reuse of any word, which may produce less coherent text.
- context_window : integer, default=
512 - Total token budget for a single forward pass, including both the input prompt and the generated response. Larger values allow longer conversations but consume more RAM/VRAM. Qwen 2.5 supports up to 32768 tokens natively; keep this at or below that limit.
- device : string, default=
CPU - Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. Selecting a GPU option offloads all layers for faster inference, setting n_gpu_layers=-1 so every transformer layer is GPU-accelerated.
Methods
generate(self, prompt: list[dict[str, str]]) -> List[str]
QwenModelGenerate a reply for the given chat prompt.
Parameters
- prompt : list of dict
- Conversation history in OpenAI chat format. Each dict must contain at least
"role"("system","user", or"assistant") and"content"(the message text).
Returns
- list of str
- A single-element list containing the model's reply text, extracted from
choices[0]["message"]["content"].
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.