SmolLMModel
SmolLM2 Instruct model for on-device text generation via llama.cpp.
SmolLM2 is a family of compact, instruction-tuned language models developed by Hugging Face TB, designed for efficient on-device and edge deployment. Unlike larger language models, SmolLM2 achieves competitive benchmark results at very small parameter counts by training on high-quality synthetic datasets including cosmopedia-v2, FineWeb-Edu, and StackEdu.
The DashAI integration exposes the 360M and 1.7B Instruct variants. The 360M model requires under 300 MB of RAM and runs comfortably on modest CPU hardware; the 1.7B model delivers higher-quality responses while remaining deployable without a GPU.
Models are loaded as GGUF quantized checkpoints via llama-cpp-python. The
quantization level is variant-dependent: Q8_0 for 360M (higher fidelity at small
size) and Q4_K_M for 1.7B (balanced quality/size trade-off). The filename is
resolved automatically from SMOLLM_FILENAME_MAP.
References
- [1] Allal, L.B. et al. (2024). "SmolLM2 — with great data, comes great performance." Hugging Face Blog. https://huggingface.co/blog/smollm2
- [2] https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF
- [3] https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct-GGUF
Parameters
- model_name : string, default=
HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF - The SmolLM2 Instruct checkpoint to load in GGUF format. 'SmolLM2-1.7B' is a 1.7B-parameter instruction model with strong performance for on-device and edge inference. 'SmolLM2-360M' is an ultra-compact 360M-parameter model for extremely fast CPU inference with minimal memory usage (~300 MB). Both models are trained on diverse synthetic datasets by Hugging Face.
- max_tokens : integer, default=
100 - Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. SmolLM2 models are optimized for short to medium-length responses.
- temperature : number, default=
0.7 - Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 outputs are deterministic. Around 0.7 balances quality and creativity.
- frequency_penalty : number, default=
0.1 - Penalizes tokens that have already appeared in the output based on frequency (range 0.0-2.0). Higher values discourage repetition.
- context_window : integer, default=
512 - Total token budget for a single forward pass, including both the input prompt and the generated response. SmolLM2 models support up to 8K tokens natively.
- device : string, default=
CPU - Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. SmolLM2 models are small enough to run efficiently on CPU even on modest hardware.
Methods
generate(self, prompt: list[dict[str, str]]) -> List[str]
SmolLMModelGenerate a reply for the given chat prompt.
Parameters
- prompt : list of dict
- Conversation history in OpenAI chat format. Each dict must contain at least
"role"("system","user", or"assistant") and"content"(the message text).
Returns
- list of str
- A single-element list containing the model's reply text, extracted from
choices[0]["message"]["content"].
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.