Somewhere between mathematics and magic, a diffusion model takes a screen full of pure random noise and gradually sculpts it into a photorealistic face, a Renaissance-style oil painting, or a cityscape at golden hour. If you have ever typed a sentence into an AI image generator and watched something appear, you have seen a diffusion model at work. But what is actually happening inside that process? The answer is surprisingly logical once you see the two phases it relies on.
The Idea Behind Diffusion Models
A diffusion model is trained to remove noise. That is the entire core concept. During training, the model is shown millions of images with increasing amounts of random noise added to them. It builds the ability, step by step, to reconstruct what the original image looked like before the noise was introduced. Once trained, you run the process in reverse: you start with pure noise, and the model walks backward through its internalized steps, stripping away noise at each stage until a coherent image emerges.
A strange kind of creativity
It sounds odd that this counts as "creativity." The model is not painting something from scratch the way a human artist would. Instead, it traces a path through an internalized map of what real images look like at various levels of corruption, walking from chaos toward clarity. The destination on that path, shaped by your text prompt, determines what you see at the end.
This is why the results feel so diverse and natural. The model is not retrieving stored images or stitching together fragments. It is constructing something new by following the statistical patterns of what a real image should look like at each stage of the denoising process.
Why noise is the starting point
The choice to start with noise is deliberate. Noise is mathematically well-defined: specifically, Gaussian noise, where pixel values follow a bell-curve distribution. That mathematical precision is what allows the model to train rigorously. It has been exposed to the noisy versions of millions of images at every level of corruption, so it can predict what should be removed at any given stage.
Note: Gaussian noise looks exactly like TV static. Every pixel is a random value drawn from a normal distribution. At maximum noise levels, the original image contributes zero information to what you see.

Forward Diffusion: From Image to Noise
The training phase starts by systematically destroying images. This is called forward diffusion, and it works by adding a small, mathematically precise amount of Gaussian noise to an image at each timestep. After hundreds of timesteps, the original image is completely unrecognizable. All that remains is pure static, indistinguishable from randomly generated noise with no image involved at all.
Adding noise, step by step
Think of it like placing sheets of increasingly opaque tracing paper over a photograph. After one sheet, the photo is still mostly visible. After ten sheets, it is blurry and faded. After a thousand sheets, you cannot see the photo at all. Forward diffusion does exactly this with pixel-level noise, in mathematically controlled, predictable increments.

The central insight is that each noisy version is paired with its timestep label. The model is shown "this is what the image looks like at step 200 out of 1000," and "this is what it looks like at step 800 out of 1000." Every single step, for every single image in the training set, becomes a labeled training example. This generates a vast, richly structured dataset from images that already exist.
What the noise schedule looks like
| Timestep | Noise Level | Image Visibility |
|---|
| 0 | None | Perfect original |
| 100 | Low | Slightly grainy |
| 400 | Medium | Blurry, outlines visible |
| 700 | High | Only vague shapes |
| 1000 | Maximum | Pure Gaussian noise |
The specific rate at which noise is added follows what is called the noise schedule. Different schedules, such as linear, cosine, or sigmoid, affect the quality of images the model can eventually produce. Most modern models use cosine scheduling, which adds noise more slowly at the extremes and faster in the middle, producing better-looking final outputs.
Reverse Diffusion: From Noise to Image
Once the model has internalized every level of noise for millions of images, you flip the script. You hand it pure noise and ask it to predict what noise should be removed to get closer to a real image. It removes a little. You ask again. It removes a little more. After 20 to 1000 of these denoising steps, depending on the sampler chosen, a real-looking image has materialized.
The denoising neural network
The neural network inside a diffusion model does not generate the image directly. It predicts the noise itself, specifically what noise needs to be subtracted at each step. The actual workhorse is typically a U-Net architecture: a neural network shaped like the letter U, with an encoder path that compresses the noisy image into abstract representations and a decoder path that reconstructs the clean prediction. The image emerges from what is left after the predicted noise is subtracted.

How many steps does it take?
The original DDPM (Denoising Diffusion Probabilistic Model) paper required 1000 denoising steps, which made generation painfully slow. Modern samplers like DDIM (Denoising Diffusion Implicit Models), DPM++, and Euler can produce high-quality images in just 20 to 50 steps by taking larger, smarter strides through the denoising process. This is why modern AI image tools feel nearly instant compared to early implementations.
Note: The sampler is a separate algorithm that sits on top of the trained model. Changing the sampler does not change what the model has internalized. It only changes how the model's predictions are used to walk from noise to image. Different samplers suit different needs: DDIM is fast and consistent, DPM++ delivers higher quality with more steps, Euler balances speed and fidelity well.
Inside the Latent Space
Running diffusion directly on raw pixel values at high resolution would be extremely slow and computationally expensive. A 512x512 image has over 786,000 individual pixel values to process at every denoising step. Modern diffusion models solve this with a critical efficiency trick: they perform the denoising in a compressed representation called latent space rather than pixel space.
The encoder and decoder
Before diffusion begins, a separate neural network called the encoder compresses the image into a much smaller grid of values. A 512x512 pixel image typically becomes a 64x64 latent representation, a compression factor of 64. All the denoising happens at this smaller size, which is dramatically faster. When the denoising finishes, a second network called the decoder expands the latent representation back into a full-resolution image.

The encoder and decoder together form a Variational Autoencoder (VAE). The latent space it creates is not arbitrary compression; it is a structured, information-rich representation that preserves the most important visual features of the image while discarding what is redundant. Think of it like a film negative: all the information of a full-resolution photograph compressed onto a thin, small strip of material.
Why latent space changed everything
This architecture is why the model family is officially called Latent Diffusion Models (LDMs), even though most people just say "diffusion models." Stable Diffusion XL and Stable Diffusion 3 are both latent diffusion models. So is Flux Pro. Working in latent space rather than pixel space is what made high-resolution AI image generation practical on consumer hardware rather than requiring server farms.
How Text Controls the Output
A diffusion model without text conditioning would generate images of random, statistically plausible-looking content every time. Text prompts work by conditioning the denoising process, steering the model toward specific regions of the image space where the content matches your description.
Text encoders and embeddings
The text you type is first converted into a dense numerical representation by a text encoder. Models like CLIP or T5 read your prompt and produce a sequence of high-dimensional vectors called embeddings, each capturing the semantic meaning of words and phrases in the context of the full sentence. These embedding vectors are passed into the U-Net at every denoising step, so the noise prediction is constantly influenced by what you asked for.

Classifier-free guidance
The strength of the text's influence is controlled by a setting called guidance scale (also called CFG scale, short for Classifier-Free Guidance). At low values around 2 to 4, the model follows the text loosely and produces more varied, sometimes surprising results. At high values around 12 to 20, the model follows the text very strictly but can produce overly saturated or distorted images. Most AI generators default to a guidance scale between 7 and 12, balancing prompt fidelity with visual quality.
Note: Classifier-free guidance works by running the denoising twice per step: once with your text condition and once without it. The model then amplifies the difference between the two predictions. A higher guidance scale means a larger amplification factor, which pulls the result more strongly toward your prompt.
The Training Process
Training a diffusion model from scratch requires enormous datasets and substantial compute. Models like Stable Diffusion XL were trained on hundreds of millions of image-caption pairs scraped from the internet, requiring clusters of hundreds of GPUs running for weeks or months. What the model builds from that process is not a database of images but a rich internal representation of visual structure.
What the model actually internalizes
During training, for each image in the dataset, the process runs as follows:
- Sample a random timestep between 1 and 1000
- Add the corresponding amount of Gaussian noise to the image
- Feed the noisy image to the U-Net along with the timestep number and the text caption
- Ask the U-Net to predict the noise that was added
- Calculate how wrong the prediction was (the loss function)
- Adjust the model's weights to make future predictions more accurate

This cycle repeats for billions of image-timestep pairs. The result is a model that has internalized the visual structure of an enormous slice of human visual culture and can move from noise toward any image that falls within the distribution it was trained on.
The role of random seeds
The reverse denoising process is not deterministic. Even with an identical prompt, running the model twice with different random seeds produces entirely different images. This is because the starting noise is random, and small differences in that initial noise lead to very different paths through the denoising process. This is a feature, not a flaw: it is what gives you visual variety across multiple generations of the same prompt.

Diffusion vs. Other Generative Models
Diffusion models did not emerge in a vacuum. Two older approaches dominated AI image generation before them, and comparing all three sheds light on why diffusion became the standard.
Against GANs
Generative Adversarial Networks (GANs) use two competing networks: a generator that creates images and a discriminator that tries to detect fakes. They can produce sharp, high-fidelity images quickly but are notoriously difficult to train and prone to "mode collapse," where the generator stops producing diverse outputs and fixates on a narrow range of images that reliably fool the discriminator.
Diffusion models are slower at inference but produce more diverse, controllable outputs and are far more stable to train. The ability to condition on arbitrary text with high fidelity is also far more natural with diffusion than with GANs, which require architectural tricks and workarounds to achieve meaningful prompt control.
Against VAEs
Variational Autoencoders also work with latent representations and can generate images by sampling from the latent space. However, they tend to produce blurry results because they optimize for average pixel similarity, and the "average" of many plausible images is often a washed-out blend of all of them. Diffusion models sidestep this entirely by optimizing for noise prediction rather than pixel reconstruction, which naturally produces sharper, more detailed outputs.
| Model Type | Speed | Quality | Text Control | Training Stability |
|---|
| GAN | Fast | Sharp but narrow | Limited | Difficult |
| VAE | Fast | Often blurry | Limited | Stable |
| Diffusion | Moderate | High fidelity, diverse | Excellent | Very stable |

Real Models, Real Results
The models you use on AI image platforms are all built on these diffusion principles. Here is how the major ones map to what you have read above.
Stable Diffusion and its variants
The Stable Diffusion XL architecture introduced a two-stage pipeline: a base model handles broad composition and structure, and a refiner model sharpens fine details in a second pass. Stable Diffusion 3 replaced the classic U-Net with a DiT (Diffusion Transformer) architecture, which improves text rendering accuracy and compositional control significantly.
These models are the foundation for hundreds of fine-tuned variants specialized in portraits, architectural visualization, product photography, and illustration. Fine-tuning does not change the core diffusion process; it shifts the model's internalized distribution toward a specific visual style or subject domain.
Flux and the new generation
Flux Pro and Flux Schnell represent the current state of the art in diffusion transformers. They produce exceptional photorealism, accurate text rendering within generated images, and strong adherence to prompt details. Flux Redux Dev extends this by allowing variation generation on top of an existing reference image, so you can iterate from a starting point rather than from pure noise.
All of these models still follow the exact same core process: forward noise during training, reverse denoising at generation time, latent space compression for speed, and text conditioning to shape the output. Architectural improvements change how well the model walks that process, not what the process is.
What You Can Create Right Now
You do not need to train a model or write a single line of code to put all of this into practice. Every model described above is available directly in the browser on Picasso IA, with no setup required.

Start by typing a detailed, specific prompt into Flux Pro and watching the denoising process translate your words into a photorealistic scene. Adjust the guidance scale up and down to see how strongly the text shapes the result. Then swap to Stable Diffusion XL with the same prompt and compare the aesthetic differences between two models that share the same core architecture but differ in how they walk it.
Generate the same prompt twice with different seeds. Notice how different the outputs are, even though the underlying process is identical. That variation is not a flaw. It is the probabilistic nature of diffusion doing exactly what it was designed to do: move from chaos to clarity, guided by what you asked for, through a path no two generations will ever share.
Every image you generate is a walk through noise. Now you know exactly what happens along the way.