finally a qualitative text-to-image AI on-device

Stability AI has released open source Stable Diffusion 3 Medium, a text-to-image model executable with reduced hardware requirements.

Text-to-image generation arrives locally. Once reserved for servers equipped with GPUs costing tens of thousands of dollars, text-to-image AI models are opening up to consumer chips. Latest example with Stable Diffusion 3 Medium. Released on June 12 as open source, the Stability AI model can be easily run locally on computers with an intermediate configuration. In addition, Stable Diffusion 3 Medium (SD3M) is, according to Stability AI, its most advanced model for image generation… even though it only has 2 billion parameters.

For this new version, Stability AI has focused its efforts on photorealism, understanding and respecting the fidelity of prompt instructions, generating high-quality typography and easy adaptation to fine-tuning.

A model trained on more than a billion images

With SD3M, Stability introduces a new approach and moves away from traditional delivery models. AI is based on modern technology called “Rectified Flow”. Classic diffusion models work by gradually adding noise to an image and then learning to reverse this process to recreate the original image. Rectified Flows offer a more direct approach. Rather than relying on these foley and denoise steps, they use smoother, continuous (linear) transformations to generate images. This not only simplifies the process, but also produces images faster and often with better quality.

Stable Diffusion 3 Medium is also based on an innovative architecture: the Multimodal Diffusion Transformer or MMDiT. This independently manages textual and visual representations (distinct sets of weights for the image and the text), promoting a fluid dialogue between the two modalities. A major asset for understanding the instructions in the prompt in detail.

To achieve the best possible results in a variety of different scenes, the model was trained on a dataset of one billion images. A vast database previously filtered: all content of an explicit sexual or violent nature has been removed. The images were then deduplicated to avoid duplicates and the reproduction of images too close to the originals. Finally, the model was fine-tuned on 30 million high-quality images emphasizing specific content and visual styles (most certainly photographic styles, architecture, concepts, etc.). The goal was to refine the model’s capabilities on more creative tasks and better align its outputs with user preferences.

A locally executable model

Thanks to its MMDiT architecture and the separation of weights for each modality, the model avoids redundancies and memory waste. To run, the model only requires 5 GB of VRAM, Christian Laforte, co-CEO of Stability AI, told VentureBeat. However, the company recommends using a configuration with more VRAM, around 16 GB, for optimal use. Very concretely the model can be inferred on a classic computer equipped with an Nvidia RTX class GPU. Stability AI worked with the Santa Clara giant to optimize the model’s performance on RTXs. Versions optimized with Nvidia’s TensorRT SDK would be up to 50% more efficient.

Work has also been conducted at AMD to optimize model inference on the latest APUs (processors with integrated graphics), AMD consumer graphics cards, and powerful AMD MI-300X enterprise GPUs. By collaborating with the biggest GPU players and particularly on consumer GPU ranges, Stability is clearly targeting the local use of its model, on personal devices.

For fine-tuning, however, you will have to turn to GPU models with a greater quantity of VRAM.

Maximum quality/resource cost?

We were able to test Stable Diffusion 3 Medium from a Space on Hugging Face. The Stability model surprises at first glance with the overall quality of the production given the size of the model (remember, only 2 billion parameters). However, complex prompts should be avoided.

The Stability AI model is making real progress in generating photorealistic images. Example with the portrait below: the facial features are credible and the overall consistency is quite good. A real plus, the model manages to follow our prompt exactly. However, we note some artifacts typical of AI: random light management, more or less well-controlled bokeh as well as slightly sluggish saturation.

Prompt: Portrait of a Laotian man with marked facial features. The ocean in the background. Photorealistic. © Stabe Diffusion V3

The real strong point is indeed respect for the prompt. In the image below, we ask the AI to generate an image of a purple sports car on the Champs Elysees driving fast. Multiple elements all respected by the model.

An image containing vehicle, Land vehicle, sky, wheelAutomatically generated description — Prompt: A purple sports car speeds along the Champs Elysées in Paris. Photorealistic. © Stable Diffusion V3

Even more impressive, Stable Diffusion 3 Medium handles the integration of text into images quite faithfully. Here we ask him to place the text “Today’s AI news” on a newspaper. The model responds brilliantly with a quality image. However, some artifacts remain present in the image with distortions characteristic of AI.

An image containing cartoon, toy, interiorDescription automatically generated — Prompt: A penguin with glasses is reading a newspaper in a café. The main headline reads “Today’s AI news”. American cartoon style. © Stable Diffusion V3

SD3M was heavily criticized in the early days for its ability to generate inconsistent human images, particularly at the limb level. We have not reproduced the same patterns but nevertheless identify the limits of the model: prompts containing at least one human and requests that are too precise. For example, we ask the AI to generate a couple, a man and a woman, on a snowmobile in the Swiss Alps. The model manages to identify the request but fails to make the scene coherent. As a result, the two humans are generated but the coherence of the members in space is absent. The final image is therefore unrealistic and unusable.

An image containing snowmobile, outdoor, snow, personDescription automatically generated — Prompt: A man and a woman on a snowmobile speed through an enviable landscape in the Swiss Alps. Realistic painting. © Stable Diffusion V3

According to our tests, only a repeated iteration on the same prompt with a random seed makes it possible to obtain more consistent results. Damage.

What conditions of use?

The Stable Diffusion 3 Medium weights are distributed as open source under a non-commercial license. To have the rights to the images generated, Stability offers a “creator” license at $20 per month. The latter is reserved for professionals making less than a million dollars in annual revenue, benefiting from less than a million dollars in institutional funding and with less than a million monthly active users. If any of the three conditions are not met, Stability offers a custom-priced enterprise license.

Stable Diffusion 3 Medium has an excellent quality/material resources ratio. Its lightness and optimization make it the reference model for local image generation. Need for confidentiality, lack of connectivity, fine-tuning… There are numerous use cases. For more traditional image generation, a proprietary model like Dall-E 3 or Midjourney is always preferable.

A model trained on more than a billion images

A locally executable model

Maximum quality/resource cost?

What conditions of use?

Related posts