Maat ScanMaat Scan

Technical

How AI Image Generators Work: Diffusion Models Explained

By Maat Scan · May 5, 2026

In 2022, generating a photorealistic face from a text prompt required a high-end GPU, careful prompt engineering, and about a minute of compute time. By 2026, the same task runs in 4.5 seconds on Flux 1.1 Pro via a cloud API, at quality that passes visual inspection for most viewers.1 Understanding what changed — and what is happening inside these models each time they produce an image — matters for anyone trying to understand or detect AI-generated content.

The Core Idea: Learning to Remove Noise

A diffusion model does not learn to paint an image from scratch. It learns to remove noise.

During training, the model is shown real images that have been progressively corrupted by adding random noise — step by step, until nothing recognizable remains, just static. The model's task is to learn the reverse: given a slightly noisy version of an image, predict what the slightly less noisy version looks like. Repeat this across hundreds of small steps, and a model trained this way can start from pure noise and walk backwards to a coherent image.

The original DDPM paper (Denoising Diffusion Probabilistic Models), published by Ho et al. in 2020, formalized this approach and showed it produced higher quality outputs than the GAN models that had dominated image generation for the previous five years.2 GANs trained a generator and a discriminator against each other — a process prone to instability and mode collapse, where the model learns to produce only a narrow range of outputs. Diffusion models sidestep both problems by framing generation as a sequence of small, learnable denoising steps.

How Text Gets Into the Picture

A model trained purely on the noise/denoise process will generate images that resemble its training data, but at random. Adding text control requires another step.

The dominant approach uses a text encoder — typically a version of CLIP or a language model — to convert a text prompt into a numerical representation. This representation is injected into the denoising process at each step, biasing the noise removal toward the visual concept the text describes. The model has learned, during training on billions of image-text pairs, which visual features correlate with which text tokens.

Stable Diffusion was trained on LAION-5B, a dataset of roughly 5 billion image-text pairs scraped from the public web.3 That training data explains why these models reflect the biases of internet imagery: overrepresentation of certain demographics, aesthetics, and visual styles. When a generated face looks like a stock photo, it is because the model learned from millions of stock photos.

Latent Diffusion: Why Speed Improved So Dramatically

The earliest diffusion models ran the denoising process directly on pixels. A 512×512 image has 786,432 pixel values. Running hundreds of denoising steps across all of them was computationally expensive — slow enough to be impractical for most users.

Latent diffusion models, introduced with Stable Diffusion in 2022, solved this by first compressing the image into a smaller "latent" representation using an encoder.4The denoising steps run in this compressed space — typically 64×64 features rather than a full-resolution pixel grid — and only at the end is the result decoded back to full resolution. This compression reduced compute requirements dramatically, making the models practical on consumer hardware and, later, fast enough for sub-10-second cloud generation.

The Main Generators and How They Differ

The major image generation tools in 2025-2026 share the diffusion foundation but diverge in architecture, training, and design philosophy.

Stable Diffusion (Stability AI)

Open-source, runs locally on consumer hardware, and supports extensive customization via LoRA fine-tuning adapters and layout control via ControlNet. The 3.5 Large version has 8.1 billion parameters and uses a Multimodal Diffusion Transformer (MMDiT) architecture that processes image and text tokens jointly rather than separately.

Midjourney

Proprietary, runs through a web interface. Uses a custom diffusion architecture not publicly documented. Known for strong aesthetic coherence and photographic quality, but not available for external inspection or fine-tuning.

DALL-E 3 (OpenAI)

Diffusion-based with tight integration into ChatGPT for prompt rewriting and content filtering. Notably reliable for rendering readable text within images — an area where most other generators still produce garbled characters.

Flux (Black Forest Labs)

Uses flow matching rather than standard DDPM, combined with a transformer backbone. Flux.1 Pro has roughly 12 billion parameters.5 As of 2026, benchmarks consistently rank it highest for prompt fidelity and text rendering. It is also the hardest generator for detection tools to catch: correctly identified only 18-30% of the time in a February 2026 benchmark study, compared to 80-95% for older generators.6

Why This Matters for Detection

The generation process explains why detection is technically difficult. There are no painted brush strokes or copy-pasted regions — the image is assembled from learned statistical patterns in compressed feature space. The artifacts that remain are statistical: unusual frequency distributions in texture, geometry that deviates subtly from physical constraints, skin that lacks the micro-variation produced by real camera optics.

When a new model architecture changes how it processes and decodes those features, the statistical signatures shift. Detection systems trained on one model family can fail on another. Every architectural improvement is also, inadvertently, a step away from the patterns that made its predecessors detectable.

This is not a problem that better detection alone can solve. Provenance standards like C2PA, which embed cryptographic attestation at the point of generation, address a different part of the chain — not "does this look AI-generated" but "can we verify where it came from." Both approaches are needed, because neither works without the other.

Sources

  1. freeacademy.ai, "Midjourney vs DALL-E vs Stable Diffusion vs Flux 2026: Complete AI Image Generator Comparison," Freeacademy.ai, 2026.
  2. Ho et al., "Denoising Diffusion Probabilistic Models," arXiv 2006.11239, 2020.
  3. Schuhmann et al., "LAION-5B: An open large-scale dataset for training next generation image-text models," arXiv 2210.08402, 2022.
  4. Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models," CVPR, 2022.
  5. Anakin.ai, "FLUX vs MidJourney vs DALL-E vs Stable Diffusion: Which AI Image Generator Should You Choose?," Anakin.ai, 2025.
  6. arXiv 2602.07814, "Open-Source AI-Generated Image Detection Benchmark," February 2026.