PixArtSigmaModel
Diffusion Transformer model for high-efficiency text-to-image generation.
Wraps the PixArt-Sigma pipeline, which replaces the U-Net backbone used in Stable Diffusion with a scalable Diffusion Transformer (DiT) architecture. Text conditioning is provided by a T5-XXL encoder, enabling richer semantic understanding than CLIP-based models.
PixArt-Sigma achieves state-of-the-art image quality with 14-25 denoising steps (compared to 20-50 for comparable U-Net models) and supports flexible multi-scale resolutions up to 2048 px. Two checkpoint sizes are available: 512 px (lighter) and 1024 px (best quality).
References
- [1] Chen et al., "PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation", 2024. https://arxiv.org/abs/2403.04692
- [2] https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-1024-MS
Parameters
- model_name : string, default=
PixArt-alpha/PixArt-Sigma-XL-2-1024-MS - The PixArt-Sigma checkpoint to load. 'PixArt-Sigma-XL-2-1024-MS' is the high-resolution variant trained at 1024px with multi-scale support, delivering the best image quality. 'PixArt-Sigma-XL-2-512-MS' is the 512px variant, faster and lighter while still producing sharp results.
- negative_prompt
- num_inference_steps : integer, default=
20 - Number of denoising steps. PixArt-Sigma achieves good quality with 14-25 steps due to its efficient transformer architecture. More steps refine details but increase generation time.
- guidance_scale : number, default=
4.5 - Classifier-Free Guidance (CFG) scale. PixArt-Sigma works best with lower values (3.5-5.5) compared to U-Net models. Higher values enforce the prompt more strictly but may saturate colors. The default of 4.5 is recommended.
- device : string, default=
CPU - Hardware device for inference. GPU is strongly recommended. PixArt-Sigma uses a DiT (Diffusion Transformer) architecture with T5 text encoding, which is faster than U-Net on GPU.
- seed : integer, default=
-1 - Random seed for reproducible generation. A fixed positive integer always produces the same image. Use -1 for a random seed.
- width : integer, default=
1024 - Width of the output image in pixels. Must be a multiple of 8. PixArt-Sigma supports flexible resolutions up to 2048px.
- height : integer, default=
1024 - Height of the output image in pixels. Must be a multiple of 8. PixArt-Sigma supports flexible resolutions up to 2048px.
- num_images_per_prompt : integer, default=
1 - How many images to generate from a single prompt in one batch. Requires proportionally more GPU memory per additional image.
Methods
generate(self, input: str) -> List[Any]
PixArtSigmaModelGenerate images from a text prompt.
Parameters
- input : str
- Text prompt to generate an image from.
Returns
- List[Any]
- Generated output images in a list.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.