StableDiffusionV3Model
Multimodal Diffusion Transformer model for high-quality text-to-image generation.
Wraps the Stable Diffusion 3 and 3.5 family of checkpoints from Stability AI. These models use a Multimodal Diffusion Transformer (MMDiT) architecture that jointly processes text and image tokens, delivering significantly improved prompt adherence, typography, and overall image quality compared to U-Net-based predecessors.
Four variants are supported: SD3 Medium (2B), SD3.5 Medium (2B, improved), SD3.5 Large (8B, best quality), and SD3.5 Large Turbo (distilled, 4-8 steps). All produce images natively at 1024 x 1024 px. Access to these gated models requires a HuggingFace API key.
References
- [1] Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis", 2024. https://arxiv.org/abs/2403.03206
- [2] https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
Parameters
- model_name : string, default=
stabilityai/stable-diffusion-3-medium-diffusers - The SD3/SD3.5 checkpoint to load. 'sd-3-medium' is the baseline 2B-parameter model. 'sd-3.5-medium' improves quality at similar speed. 'sd-3.5-large' (8B) delivers the highest quality but needs more VRAM. 'sd-3.5-large-turbo' is a distilled large model that requires far fewer steps (4-8) for fast high-quality generation. All variants target 1024x1024 px natively.
- huggingface_key : string, default=
- Hugging Face read-access token required to download these gated models. To obtain one: accept the model license on huggingface.co/stabilityai, then go to Settings → Access Tokens and generate a token with 'Read' scope.
- negative_prompt
- num_inference_steps : integer, default=
15 - Number of denoising steps to run. More steps refine the image but increase generation time. Typical range: 20-40 for standard models; use only 4-8 steps with 'large-turbo'. Values above 50 rarely improve output for SD3/SD3.5.
- guidance_scale : number, default=
3.5 - Classifier-Free Guidance (CFG) scale. Controls how strictly the image follows the text prompt. SD3.5 works well at 3.5-4.5. The 'large-turbo' variant is designed for guidance_scale=1 (no CFG). Higher values enforce the prompt but may introduce oversaturation or artifacts.
- device : string, default=
CPU - Hardware device for inference. Select a GPU option for hardware acceleration, which is strongly recommended for diffusion models. Select 'CPU' on systems without a compatible GPU, but expect significantly longer generation times.
- seed : integer, default=
-1 - Random seed for reproducible generation. A fixed positive integer will always produce the same image for identical settings. Use a negative value (e.g. -1) for a random seed on each run.
- width : integer, default=
512 - Width of the output image in pixels. Must be a multiple of 8. SD3/SD3.5 models are natively trained at 1024x1024 px; using that resolution yields the best quality.
- height : integer, default=
512 - Height of the output image in pixels. Must be a multiple of 8. SD3/SD3.5 models are natively trained at 1024x1024 px; using that resolution yields the best quality.
- num_images_per_prompt : integer, default=
1 - How many images to generate from a single prompt in one batch. Increasing this value is more efficient than running multiple sessions, but requires proportionally more GPU memory.
Methods
generate(self, input: str) -> List[Any]
StableDiffusionV3ModelGenerate output from a generative model.
Parameters
- input : str
- Input data to be generated
Returns
- List[Any]
- Generated output images in a list
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.