Skip to main content

SmolLMModel

GenerativeModel
DashAI.back.models.hugging_face.SmolLMModel

SmolLM2 Instruct model for on-device text generation via llama.cpp.

SmolLM2 is a family of compact, instruction-tuned language models developed by Hugging Face TB, designed for efficient on-device and edge deployment. Unlike larger language models, SmolLM2 achieves competitive benchmark results at very small parameter counts by training on high-quality synthetic datasets including cosmopedia-v2, FineWeb-Edu, and StackEdu.

The DashAI integration exposes the 360M and 1.7B Instruct variants. The 360M model requires under 300 MB of RAM and runs comfortably on modest CPU hardware; the 1.7B model delivers higher-quality responses while remaining deployable without a GPU.

Models are loaded as GGUF quantized checkpoints via llama-cpp-python. The quantization level is variant-dependent: Q8_0 for 360M (higher fidelity at small size) and Q4_K_M for 1.7B (balanced quality/size trade-off). The filename is resolved automatically from SMOLLM_FILENAME_MAP.

References

Parameters

model_name : string, default=HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF
The SmolLM2 Instruct checkpoint to load in GGUF format. 'SmolLM2-1.7B' is a 1.7B-parameter instruction model with strong performance for on-device and edge inference. 'SmolLM2-360M' is an ultra-compact 360M-parameter model for extremely fast CPU inference with minimal memory usage (~300 MB). Both models are trained on diverse synthetic datasets by Hugging Face.
max_tokens : integer, default=100
Maximum number of new tokens the model will generate per response. Roughly 1 token ≈ 0.75 English words. SmolLM2 models are optimized for short to medium-length responses.
temperature : number, default=0.7
Sampling temperature controlling output randomness (range 0.0-1.0). At 0.0 outputs are deterministic. Around 0.7 balances quality and creativity.
frequency_penalty : number, default=0.1
Penalizes tokens that have already appeared in the output based on frequency (range 0.0-2.0). Higher values discourage repetition.
context_window : integer, default=512
Total token budget for a single forward pass, including both the input prompt and the generated response. SmolLM2 models support up to 8K tokens natively.
device : string, default=CPU
Hardware device for llama.cpp inference. 'CPU' runs the model fully in RAM with no GPU requirement. SmolLM2 models are small enough to run efficiently on CPU even on modest hardware.

Methods

generate(self, prompt: list[dict[str, str]]) -> List[str]

Defined on SmolLMModel

Generate a reply for the given chat prompt.

Parameters

prompt : list of dict
Conversation history in OpenAI chat format. Each dict must contain at least "role" ("system", "user", or "assistant") and "content" (the message text).

Returns

list of str
A single-element list containing the model's reply text, extracted from choices[0]["message"]["content"].

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.

Compatible with