Comparison of diffusion-based text-to-image models using benchmark prompts

open/close control panel <-- want to change how measurement units are displayed? give it a try

metric units imperial units

Paralinguistic/connotation marking on Paralinguistic/connotation marking off

abbreviation expansion on abbreviation expansion off

light text on dark background dark text on light background

Laconic/Executive summary/CliffsNotes Abstract Short Normal Full Detail

If you want more modern webdesign features, click here

Click here to send an edit suggestion

In the hope of getting rid of cherry-picked presentations for image-generation checkpoints on Civitai and other places and using something more objective for evaluating checkpoints, I created 100 benchmark prompts and tested them on many checkpoints. Their purpose is not to measure speed, but to compare the quality of the output and the prompt adherence:

As can be seen, there are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data.

How many images achieve a "pass" rating based on my subjective evaluation:

epiCRealism Natural Sin RC1 VAE: 36 passes
DreamShaper 8: 39 passes
Hephaistos_NextGENXL v2.0: 62 passes
Juggernaut XL version XI: 66 passes
SDXL 4GB/2GB: 49 passes
CyberRealistic CyberIllustrious v7.0: 25 passes
Pony Realism v2.2 Hyper 8 steps: 21 passes
Stable Diffusion 3.5 Medium GGUF Q8_0: 51 passes
FLUX.1 Dev GGUF Q4_0: 63 passes
QWEN GGUF Q2_K: 71 passes
Wan 2.1 T2V 14B Q2_K: 43 passes

I must conclude that open-source image generation models usable on affordable consumer hardware are not yet good enough for general usage.

On Civitai and other places, new variants of Stable Diffusion 1.5 , Stable Diffusion XL, Stable Diffusion 3.x, FLUX.1 and more are published daily and it is difficult to keep track of them all and choose one that is suitable for yourself. The creators of these variants, usually called checkpoints, typically try to present their creations in the best possible light, often cherry-picking images created with their checkpoints. There seems to be no clear guideline for how checkpoints should be presented or the guidelines are not followed, leading to widely varying styles and qualities of the presentation. If one doesn't trust these presentations and wants to verify the quality of a checkpoint, one has to log in, download the checkpoint and create images with it using many different prompts, which could take hours. Considering the vast number available, comparing all available checkpoints is a hopeless task.

I don't really like that situation and couldn't find good comparisons of checkpoints, forcing me to download and test dozens of checkpoints. In the hope of improving the situation, I created 100 benchmark prompts and created images using these prompts with many of the checkpoints I had downloaded. The purpose is not to measure the speed, but to compare the quality of the output and the prompt adherence. I chose the prompts to test many concepts that a well-rounded checkpoint ideally should understand and be able to paint.

These prompts should test the understanding of many different concepts and avoid prompt engineering and anything that would give an advantage to any one particular checkpoint, even if that means not being able to tease the best out of every checkpoint:

Tested concepts are: Single humans; groups of humans; different ages; men & women; different ethnicities; animals; landscapes; foods & drinks; machinery; pop culture elements; celebrities; art styles; photography terminology; different body postures; facial expressions; memes. The prompts are intended to test the limits of the checkpoint's understanding and it is not expected that all prompts will be understood correctly. It is clear that the prompts I chose are not capable of teasing the top performance out of every checkpoints. Many users have shown that with enough prompt engineering, stunning images can be produced. But I do not think that such prompts should be the basis of a benchmark. First, the benchmark prompts should represent the level of knowledge of a beginner. Second, highly engineered prompts are unlikely to be transferable to other checkpoints. Third, the longer a prompt, the more likely it is to produce an image that is the result of overfitting and thus not representative of the general performance of the checkpoint. Consequently I think benchmark prompts should be concise, easy to understand and transferable to many different checkpoints, not giving any particular checkpoint an unfair advantage. It is of course impossible to find prompts that fairly evaluate all possible checkpoints, and thus some checkpoints expecting a special prompting style will be disadvantaged by these benchmark prompts. Here are the prompts I tested, I release them into the public domain and encourage everyone to use them (or something similar) for evaluating checkpoints:

Full body shot of a man in a forest
Full body shot of a woman in a modern city street
Full body shot of an old man in casual clothing sitting in a garden
Full body shot of an old woman in her living room
Full body shot of a young man training in a gym
Full body shot of a young woman in a bikini lying on a beach
Closeup of the face of a smiling eastern asian girl
Closeup of the face of a frowning eastern asian man
Closeup of the face of a laughing european woman
Closeup of the face of an angry european young man
Closeup of the face of a shocked african woman
Closeup of the face of a skeptical african young man raising one eyebrow
A group of three people sitting around a table
The inside of an unused industrial building in disrepair
A bunny sitting on lush green grassland in a mountain valley
Professional macro photograph of a single clover sprouting from the ground, depth of field
Overlooking waterfall, gorgeous scenery, forest, vivid, picturesque, scenic, volumetric shadows, volumetric lighting, intricate detail, high resolution
Wide shot of erupting volcano, rivers of lava, towering plume of smoke
A glass of hot chocolate with whipped cream on top, surrounded by roasted coffee beans
A luxurious bowl of ramen
A plate of lettuce, cherry tomatoes and pork steak with sear marks on it
Closeup of a glass of traditional martini
Cthulhu's head on the horizon, rising through the waves of a stormy sea
Glowing galaxy filling the night sky
Skeleton of a camel half buried in the sand of the desert
Simple black and white drawing of a bonfire, no background, line art
An old western house in a forest, ethereal glow,bright colors, winter
The dragon smaug resting atop his immense hoard of gold inside the halls of the lonely mountain
An intricate metal cage swinging from a chain, seen from below
A serene japanese garden with cherry blossoms in full bloom and a koi pond
Rusted-through and fallen-apart container ship resting on the dried sea floor of a post-apocalyptic harbor
Thick black billows of smoke rising from the towering flames of a burning factory
A group of glass-covered skyscrapers seen from the ground, looking up, blue sky, contrail
Horror, scarecrow, dark shadowed face, completely dressed in black, glowing eyes, full moon in the background, night, darkness, low light
An extravagant bouquet of flowers in a ceramic vase
A woman running away from a large horde of zombies, toward the camera
A very muscular man doing a handstand
A woman meditating cross-legged
A closeup shot of two men shaking hands
A woman crossing her arms and pouting
A husband kissing his bride
Girl playing the guitar, tattoo, very long hair, dyed hair, colored hair
Elon musk smoking a cigar
A top-down view of a second world war bomber plane
A modern fighter jet
A closeup of a modern main battle tank
An assault rifle amidst some scattered ammunition
A Ferrari, sunset, reflection
A truck with trailer on the road, kicking up dust
Two chess players playing a game of chess in a smoke-filled bar
An elephant in a skin-tight latex catsuit
Closeup of a human eye
Photo of a geode filled with colored crystals
Polar lights illuminating a fjord
Double rainbow over a field of wheat
The millennium falcon fleeing from a star destroyer, darth vader's head in the background, star wars, movie poster
The USS enterprise fighting against a borg cube, star trek
Pikachu drinking lemonade through a drinking straw, in the style of leonid afremov
Dickbutt in the style of van gogh
Photo of a wall full of graffiti, with "kilroy was here" graffito
Impressionist painting of a cyborg, in the style of monet
A mandala in the style of jackson pollock
A cubistic tyrannosaurus rex, in the style of picasso
An elf in the style of wassily kandinsky
An expressionist painting of a centaur in the style of franz marc
A watercolor painting of a unicorn
A pointilistic painting of an orc in the style of georges seurat
A biblically correct angel with many wings and many eyes, non-humanoid
Top-down view of a go board with a half-finished game of go, baduk, weiqi
An impasto style painting of the multi-armed god shiva, many arms
A fractal shape like the Mandelbrot set
A photo of a cat, bokeh
A full body shot of a giraffe
An elephant
Rough pencil sketch of a raccoon
A panda bear eating bamboo
3d rendering of a dog, CGI, computer-generated
Owl, lens flare
A praying mantis holding a dragonfly in its claws
Macro shot of a spider in its net
A ladybug taking off from a leaf, spreading its wings
An eagle carrying a hobbit in its claws
A tortoise carrying a frog on its back
A shark chasing a goldfish
Cinematic shot of a horse
A grove of palm trees
Some daisies in between many ferns
Macro shot of some plastic figurines
A leprechaun
A gorgon
A sphinx
A purple and black mech towering over the houses of a city
A succubus
A sorceress from a manga casting a spell, anime
A chibi character eating an oversized donut
A chunky and wide cyberpunk space ship crashlanding in a city
A girl covered in body paint seen from behind
A very muscular man posing for the camera
Spiderman, superman and batman posing for a promotion shot
A cacodemon from doom

Full body shot of a man in a forest
Full body shot of a woman in a modern city street
Full body shot of an old man in casual clothing sitting in a garden
Full body shot of an old woman in her living room
Full body shot of a young man training in a gym
Full body shot of a young woman in a bikini lying on a beach
Closeup of the face of a smiling eastern asian girl
Closeup of the face of a frowning eastern asian man
Closeup of the face of a laughing european woman
Closeup of the face of an angry european young man
Closeup of the face of a shocked african woman
Closeup of the face of a skeptical african young man raising one eyebrow
A group of three people sitting around a table
The inside of an unused industrial building in disrepair
A bunny sitting on lush green grassland in a mountain valley
Professional macro photograph of a single clover sprouting from the ground, depth of field
Overlooking waterfall, gorgeous scenery, forest, vivid, picturesque, scenic, volumetric shadows, volumetric lighting, intricate detail, high resolution
Wide shot of erupting volcano, rivers of lava, towering plume of smoke
A glass of hot chocolate with whipped cream on top, surrounded by roasted coffee beans
A luxurious bowl of ramen
A plate of lettuce, cherry tomatoes and pork steak with sear marks on it
Closeup of a glass of traditional martini
Cthulhu's head on the horizon, rising through the waves of a stormy sea
Glowing galaxy filling the night sky
Skeleton of a camel half buried in the sand of the desert
Simple black and white drawing of a bonfire, no background, line art
An old western house in a forest, ethereal glow,bright colors, winter
The dragon smaug resting atop his immense hoard of gold inside the halls of the lonely mountain
An intricate metal cage swinging from a chain, seen from below
A serene japanese garden with cherry blossoms in full bloom and a koi pond
Rusted-through and fallen-apart container ship resting on the dried sea floor of a post-apocalyptic harbor
Thick black billows of smoke rising from the towering flames of a burning factory
A group of glass-covered skyscrapers seen from the ground, looking up, blue sky, contrail
Horror, scarecrow, dark shadowed face, completely dressed in black, glowing eyes, full moon in the background, night, darkness, low light
An extravagant bouquet of flowers in a ceramic vase
A woman running away from a large horde of zombies, toward the camera
A very muscular man doing a handstand
A woman meditating cross-legged
A closeup shot of two men shaking hands
A woman crossing her arms and pouting
A husband kissing his bride
Girl playing the guitar, tattoo, very long hair, dyed hair, colored hair
Elon musk smoking a cigar
A top-down view of a second world war bomber plane
A modern fighter jet
A closeup of a modern main battle tank
An assault rifle amidst some scattered ammunition
A Ferrari, sunset, reflection
A truck with trailer on the road, kicking up dust
Two chess players playing a game of chess in a smoke-filled bar
An elephant in a skin-tight latex catsuit
Closeup of a human eye
Photo of a geode filled with colored crystals
Polar lights illuminating a fjord
Double rainbow over a field of wheat
The millennium falcon fleeing from a star destroyer, darth vader's head in the background, star wars, movie poster
The USS enterprise fighting against a borg cube, star trek
Pikachu drinking lemonade through a drinking straw, in the style of leonid afremov
Dickbutt in the style of van gogh
Photo of a wall full of graffiti, with "kilroy was here" graffito
Impressionist painting of a cyborg, in the style of monet
A mandala in the style of jackson pollock
A cubistic tyrannosaurus rex, in the style of picasso
An elf in the style of wassily kandinsky
An expressionist painting of a centaur in the style of franz marc
A watercolor painting of a unicorn
A pointilistic painting of an orc in the style of georges seurat
A biblically correct angel with many wings and many eyes, non-humanoid
Top-down view of a go board with a half-finished game of go, baduk, weiqi
An impasto style painting of the multi-armed god shiva, many arms
A fractal shape like the Mandelbrot set
A photo of a cat, bokeh
A full body shot of a giraffe
An elephant
Rough pencil sketch of a raccoon
A panda bear eating bamboo
3d rendering of a dog, CGI, computer-generated
Owl, lens flare
A praying mantis holding a dragonfly in its claws
Macro shot of a spider in its net
A ladybug taking off from a leaf, spreading its wings
An eagle carrying a hobbit in its claws
A tortoise carrying a frog on its back
A shark chasing a goldfish
Cinematic shot of a horse
A grove of palm trees
Some daisies in between many ferns
Macro shot of some plastic figurines
A leprechaun
A gorgon
A sphinx
A purple and black mech towering over the houses of a city
A succubus
A sorceress from a manga casting a spell, anime
A chibi character eating an oversized donut
A chunky and wide cyberpunk space ship crashlanding in a city
A girl covered in body paint seen from behind
A very muscular man posing for the camera
Spiderman, superman and batman posing for a promotion shot
A cacodemon from doom

I tested these prompts on the following checkpoints. All images were generated using ComfyUI, a fixed seed of 0, a size of 1024x1024 for SDXL and SD3.5 based checkpoints, 512x512 for all other checkpoints. The negative prompt is "text, watermark":

I tested these prompts on the following checkpoints. There is no cherry-picking - if the output is crap, that's what will be shown. All images were generated using ComfyUI (commit 0963493a9c3b6565f8537288a0fb90991391ec41 & possibly some other commits, unfortunately I didn't keep track of all details) , a fixed seed of 0, a size of 1024x1024 for Stable Diffusion XL and Stable Diffusion 3.5 based checkpoints, 512x512 for all other checkpoints (due to the limited VRAM of my GPU (Graphics Processing Unit)) and otherwise the settings stated next to the checkpoint. The negative prompt is ComfyUI's default "text, watermark" unless stated otherwise:

DreamShaper 8 I forgot the exact settings I used, but Euler Normal, 7.0 CFG, 30 steps produces perfectly acceptable results
epiCRealism Natural Sin RC1 VAE I forgot the exact settings I used, but Euler Normal, 7.0 CFG, 30 steps produces perfectly acceptable results
PVC Figurerizer V2.0 , only included to show that a SD1.5 checkpoint can look very different. I forgot the exact settings I used.
Hephaistos_NextGENXL v2.0 with DPM++ SDE sampler, Karras schedule, 3.0 CFG, 12 steps
Juggernaut XL version XI with DPM++ 2M SDE sampler, Normal schedule, 6.0 CFG, 40 steps
SDXL 4GB/2GB with DPM++ 2M SDE sampler, Karras schedule, 5.0 CFG, 40 steps
CyberRealistic CyberIllustrious v7.0 , a representative of the "Illustrious" family of completely retrained SDXL checkpoints, with DPM++ SDE sampler, Karras schedule, 5.0 CFG, 40 steps
Pony Realism v2.2 Hyper 8 steps , a representative of the "Pony" family of completely retrained SDXL checkpoints, with Euler Ancestral sampler, Normal schedule, 3.0 CFG, 9 steps
Stable Diffusion 3.5 Medium GGUF Q8_0 and T5xxl FP8 E4M3FN with DPM++ 2M sampler, Normal schedule, 4.5 CFG, 20 steps
FLUX.1 Dev GGUF Q4_0 and T5xxl FP8 E4M3FN with Euler sampler, Beta schedule, 3.5 CFG, 20 steps
QWEN GGUF Q2_K and Qwen 2.5 VL It 7B GGUF Q4_KM with Euler sampler, Simple schedule, 2.5 CFG, 10 steps (20 steps in the official ComfyUI workflow, but it just takes too long)
Wan 2.1 T2V 14B Q2_K and UMT5xxl GGUF Q5_KM and FusionX LoRA with LCM sampler, Simple schedule, 1.5 CFG, 12 steps in single-frame mode (i.e. you input a latent video with just 1 frame). Without LoRA, it simply takes too long. For this I used the following negative prompt, taken directly from the official ComfyUI workflow: "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" which according to Google Translate means "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, painting, image, still, overall grayish, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, distorted limbs, fingers fused together, static image, cluttered background, three legs, many people in the background, walking backwards". If it hadn't been in the official workflow, I would already consider this unacceptable prompt engineering.

DreamShaper 8 I forgot the exact settings I used, but Euler Normal, 7.0 CFG, 30 steps produces perfectly acceptable results
epiCRealism Natural Sin RC1 VAE I forgot the exact settings I used, but Euler Normal, 7.0 CFG, 30 steps produces perfectly acceptable results
PVC Figurerizer V2.0 , only included to show that a SD1.5 checkpoint can look very different. I forgot the exact settings I used.
Hephaistos_NextGENXL v2.0 with DPM++ SDE sampler, Karras schedule, 3.0 CFG, 12 steps
Juggernaut XL version XI with DPM++ 2M SDE sampler, Normal schedule, 6.0 CFG, 40 steps
SDXL 4GB/2GB with DPM++ 2M SDE sampler, Karras schedule, 5.0 CFG, 40 steps
CyberRealistic CyberIllustrious v7.0 , a representative of the "Illustrious" family of completely retrained SDXL checkpoints, with DPM++ SDE sampler, Karras schedule, 5.0 CFG, 40 steps
Pony Realism v2.2 Hyper 8 steps , a representative of the "Pony" family of completely retrained SDXL checkpoints, with Euler Ancestral sampler, Normal schedule, 3.0 CFG, 9 steps
Stable Diffusion 3.5 Medium GGUF Q8_0 and T5xxl FP8 E4M3FN with DPM++ 2M sampler, Normal schedule, 4.5 CFG, 20 steps
FLUX.1 Dev GGUF Q4_0 and T5xxl FP8 E4M3FN with Euler sampler, Beta schedule, 3.5 CFG, 20 steps
QWEN GGUF Q2_K and Qwen 2.5 VL It 7B GGUF Q4_KM with Euler sampler, Simple schedule, 2.5 CFG, 10 steps (20 steps in the official ComfyUI workflow, but it just takes too long)
Wan 2.1 T2V 14B Q2_K and UMT5xxl GGUF Q5_KM and FusionX LoRA with LCM sampler, Simple schedule, 1.5 CFG, 12 steps in single-frame mode (i.e. you input a latent video with just 1 frame). Without LoRA, it simply takes too long. For this I used the following negative prompt, taken directly from the official ComfyUI workflow: "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" which according to Google Translate means "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, painting, image, still, overall grayish, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, distorted limbs, fingers fused together, static image, cluttered background, three legs, many people in the background, walking backwards". If it hadn't been in the official workflow, I would already consider this unacceptable prompt engineering.

I also could have tested the following checkpoints, but did not:

SDXL Turbo I wanted to see whether the unmodified SDXL is better than the many derivatives, but the Turbo version is just not up to par. I will mention the number of images with a "pass" rating in the results, but will not present the generated images.
Stable Diffusion 3.5 Large GGUF Q4_0 It is apparently incapable of producing decent output at 512x512. And at 1024x1024, it is too slow for me. SD 3.5 Medium already takes almost 2 hours for the 100 benchmark prompts and thus SD 3.5 Large would have taken many hours. The couple of test images I created with it also didn't look very impressive. It is possible I used the wrong settings, but I used the official ComfyUI workflow recommended by Stability AI and then retried it with several small changes.
Wan 2.1 T2V 1.3B Using CFG 7.0 and 20 steps of Euler Beta at resolution 1024x1024 plus "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" as negative prompt, this text to video model can produce single images when run in single-frame mode (i.e. you input a latent video with just 1 frame), but it is not a general-purpose model and will not produce any recognizable output whatsoever for some prompts and distorted or extremely simplistic output for many more prompts. I will not present the resulting images, but I'll mention the number of images that receive a "pass" rating in the results section. Changing the negative prompt will produce a few additional images with a "pass" rating, but I excluded any possibility for prompt engineering, so I will not count them.
Wan 2.1 T2V 1.3B SelfForcing DMD FP8 E4M3FN Using CFG 1.0, 4 steps of LCM Simple. Just to give another checkpoint of this model a try, but the results were very similar.
Wan 2.2 TI2V 5B GGUF Q6_K Using CFG 5.0, 30 steps of Uni PC Simple at resolution 1280x1280 plus "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" as negative prompt, run in single-frame mode. Yet another video model. I will mention the number of images that receive a "pass" rating.

Discussing the images

There are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data. Also, for some reason models after SDXL seem to have lost their understanding of painter's styles. Maybe the base models all lack this ability and only community effort restored it.

Prompts misunderstood by all checkpoints: the camel skeleton, the container ship, the Star Trek scene, dick butt, the centaur, the angel, the Go board, the Mandelbrot-like fractal, the mantis, the hobbit-carrying eagle, the Cacodemon.

Prompts understood only by one or two checkpoints: the raised eyebrow, the martini, the handstand, the bomber plane, the elephant in latex, the Star Wars poster, the ladybug, the frog on tortoise, the shark, the chibi character, the crashlanding space ship, the superhero promo.

Nota bene: Images that turned out "not safe for work" were replaced with a NSFW (not safe for work) placeholder for display here, but are used unaltered for CLIP similarity calculation. The evaluation of prompts was greatly eased by the "Text Multiline" and "Text Load Line From File" nodes from https://github.com/WASasquatch/was-node-suite-comfyui and I would recommend to use them if you want to evaluate the prompts on your favorite checkpoint. Careful, the linked image is humongous.

Discussing the images

It is pretty clear that not all concepts are understood by all checkpoints and also that the checkpoints are quite close in their capabilities. There are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data. It is also obvious that the SDXL based checkpoints have a much broader understanding than the SD1.5 based checkpoints, no surprises there. What might be surprising is how similar the outputs of the SD1.5 checkpoints look. One might expect more variety, given how many checkpoints were produced. And even different SDXL checkpoints like Hephaistos and Juggernaut fail in very similar ways on certain prompts, even though both are described as being retrained, not just the result of LoRA merges. This could indicate that even intensely trained checkpoints remain beholden to the limitations of their base models. Part of the similarity can of course be explained by the same seed being used for all images, but other details can not. For example both Hephaistos and Juggernaut paint the "old woman in her living room" as barefoot, even though that was not mentioned in the prompt. Details like that are too small to be controlled by initial noise distribution. In the "shark chasing a goldfish" image, both checkpoints misinterpret the prompt in the same way while varying the placement of the fish, which can't result from the initial noise pattern alone.

It is also obvious that prompts describing interactions between multiple objects/people are difficult for all SD1.5 & SDXL checkpoints. The same is true for prompts describing a specific number of people/things or text that is supposed to appear in the image, but those limitations are already widely known. But not all combinations of concepts seem to be problematic, as can be seen in the "elephant in latex" image. Finally, all SD1.5 & SDXL checkpoints seem to fail hard for prompts describing humans in "unusual" poses like a handstand. I can only assume that images depicting humans in these poses were very rare in the training data.

For some reason, models after SDXL seem to have lost their understanding of painter's styles. FLUX.1 fails at all of them except maybe Van Gogh and watercolor. Qwen manages a weak Van Gogh, Monet and watercolor. Only its Seurat imitation is decent. Wan fails at all of them. Juggernaut and Hephaistos are the best at imitating painter's styles, though far from perfect. Maybe the base models all lack this ability (removed on purpose?) and only community effort restored it.

The outputs of Qwen and Wan sometimes look undetailed and noisy. This is likely due to the extreme quantization I used, the reduced number of steps and possibly the low resolution. I recreated some of the uglier Qwen images with a higher number of steps and confirmed that the output quality improves markedly for those images. If the Qwen output image shown above would receive a "fail" rating due to ugliness, I evaluated the improved output instead. In contrast to Qwen, Wan did not produce noticeably better output with a higher number of steps.

The following prompts are misunderstood by all tested checkpoints: the camel skeleton, the container ship (all images lack the harbor), the Star Trek scene, dick butt, the centaur, the angel, the Go board (Qwen manages only a very rough approximation), the Mandelbrot-like fractal, the mantis, the hobbit-carrying eagle, the Cacodemon. A few more prompts are understood only by one or two checkpoints: the raised eyebrow, the martini, the handstand, the bomber plane, the elephant in latex, the Star Wars poster, the ladybug, the frog on tortoise, the shark, the chibi character, the crashlanding space ship, the superhero promo.

Numerical Results

I present here how many prompts achieve a "pass" rating according to my subjective evaluation. For an image to pass, there can't be glaringly obvious mistakes, including deviations from the prompt, nor any unwarranted additions. In all cases, the criterion for "fail" is that a non-expert viewer using just common sense would be able to tell without a second glance that the image is wrong in a manner that wouldn't appear in an image produced by an expert human artist.

epiCRealism Natural Sin RC1 VAE: 36 passes
DreamShaper 8: 39 passes
Hephaistos_NextGENXL v2.0: 62 passes
Juggernaut XL version XI: 66 passes
SDXL 4GB/2GB: 49 passes
CyberRealistic CyberIllustrious v7.0: 25 passes
Pony Realism v2.2 Hyper 8 steps: 21 passes
Stable Diffusion 3.5 Medium GGUF Q8_0: 51 passes
FLUX.1 Dev GGUF Q4_0: 63 passes
QWEN GGUF Q2_K: 71 passes
Wan 2.1 T2V 14B Q2_K: 43 passes

Because my evaluation of prompt following is subjective, I also ran all images and prompts through OpenCLIP. I got the following average cosine similarities between prompts and corresponding images:

epiCRealism Natural Sin RC1 VAE: 0.353
DreamShaper 8: 0.366
PVC Figurerizer V2.0: 0.276
Hephaistos_NextGENXL v2.0: 0.387
Juggernaut XL version XI: 0.387
SDXL 4GB/2GB: 0.374
CyberRealistic CyberIllustrious v7.0: 0.327
Pony Realism v2.2 Hyper 8 steps: 0.304
Stable Diffusion 3.5 Medium GGUF Q8_0: 0.375
FLUX.1 Dev GGUF Q4_0: 0.366
QWEN GGUF Q2_K: 0.381
Wan 2.1 T2V 14B Q2_K: 0.338

Because of OpenCLIP's limitations in prompt adherence, I also evaluated prompt adherence using ImageReward . This model was trained specifically for the purpose of evaluating prompt adherence and produced the following averages for the evaluated checkpoints:

epiCRealism Natural Sin RC1 VAE: 0.687
Dreamshaper 8: 0.840
PVC Figurerizer V2.0: -0.823
Hephaistos_NextGENXL v2.0: 1.24
Juggernaut XL version XI: 1.26
SDXL 4GB/2GB: 0.877
CyberRealistic CyberIllustrious v7.0: 0.410
Pony Realism v2.2 Hyper 8 steps: 0.086
Stable Diffusion 3.5 Medium GGUF Q8_0: 0.997
FLUX.1 Dev GGUF Q4_0: 1.12
QWEN GGUF Q2_K: 1.13
Wan 2.1 T2V 14B Q2_K: 0.698

Numerical Results

I present here how many prompts achieve a "pass" rating according to my subjective evaluation. For an image to pass, there can't be glaringly obvious mistakes, including deviations from the prompt, nor any unwarranted additions. What counts as a mistake depends on the prompt. In the handshake image, the focus is on the hands and it was included specifically to test how well the checkpoint can draw hands (which was a well-known shortcoming of older checkpoints) . Consequently, an obviously incorrect number of fingers produces a "fail" rating. In the second world war bomber image, a "pass" can be achieved if the result looks superficially like a bomber plane from that period. It is not necessary for every nut and bolt to be in the right place, because a non-expert wouldn't immediately notice that mistake. If on the other hand the propellers were attached to the wingtips or the plane lacked a cockpit, that would be a "fail". In particular, the plane doesn't have to match exactly a real bomber type, it is enough to look like a plane that could have existed back then. For the chess image, a "pass" requires a checkered board with 8 times 8 squares and black and white pieces, at least superficially arranged like in a real chess game, namely most white pieces on the side of one player, most black pieces on the opponent's side. When specifying an exact number, like "three people sitting around a table", that number of objects must be present. But when specifying just "an elephant", it is not a "fail" to include another elephant in the background, as long as one elephant dominates the image. A group of 10 elephants would be a "fail" for that prompt. In all cases, the criterion for "fail" is that a non-expert viewer using just common sense would be able to tell without a second glance that the image is wrong in a manner that wouldn't appear in an image produced by an expert human artist.

epiCRealism Natural Sin RC1 VAE: 36 passes
DreamShaper 8: 39 passes
Hephaistos_NextGENXL v2.0: 62 passes
Juggernaut XL version XI: 66 passes
SDXL 4GB/2GB: 49 passes
SDXL Turbo: 50 passes
CyberRealistic CyberIllustrious v7.0: 25 passes
Pony Realism v2.2 Hyper 8 steps: 21 passes
Stable Diffusion 3.5 Medium GGUF Q8_0: 51 passes
FLUX.1 Dev GGUF Q4_0: 63 passes
QWEN GGUF Q2_K: 71 passes
Wan 2.1 T2V 1.3B: 25 passes
Wan 2.1 T2V 14B Q2_K: 43 passes
Wan 2.2 TI2V 5B GGUF Q6_K: 26 passes

Because my evaluation of prompt following is subjective, I also ran all images and prompts through OpenCLIP (ViT-H-14-378-quickgelu to be precise) . After wasting a few hours hunting a stupid bug because I'm retarded, I calculated the average cosine similarity between each image and the corresponding prompt and got the following numbers:

epiCRealism Natural Sin RC1 VAE: 0.353
DreamShaper 8: 0.366
PVC Figurerizer V2.0: 0.276
Hephaistos_NextGENXL v2.0: 0.387
Juggernaut XL version XI: 0.387
SDXL 4GB/2GB: 0.374
CyberRealistic CyberIllustrious v7.0: 0.327
Pony Realism v2.2 Hyper 8 steps: 0.304
Stable Diffusion 3.5 Medium GGUF Q8_0: 0.375
FLUX.1 Dev GGUF Q4_0: 0.366
QWEN GGUF Q2_K: 0.381
Wan 2.1 T2V 14B Q2_K: 0.338

A few caveats are necessary: OpenCLIP is not capable of understanding all the small details I'm looking for in an image, in part because it was trained on real images, not AI-generated images. That means it won't pay attention to hands with the wrong number of fingers and other artifacts. Its input is also resized to 378x378, reducing the detail and introducing a bit of blur due to the resizing. And of course it won't know pop culture concepts that were not contained in the training data, just like for the image generation models. That means it might never have seen a "cacodemon from doom" and thus can't evaluate how closely the image follows that prompt. OpenCLIP is probably not capable of distinguishing between individual painter's styles either. And OpenCLIP was not trained explicitly to evaluate prompt following, although its training task of matching images to their descriptions is similar.

Because OpenCLIP can be used to compare images to each other, not just to prompts, I have also used it to compare all tested checkpoints to each other. In the table below, for each pair of checkpoints, the cosine similarity of their output images for the same prompt is calculated and then averaged over all prompts to calculate an average cosine similarity for that pair of checkpoints. The idea is that some checkpoints are derived from each other and are obviously producing very similar outputs (see above), so the cosine similarity of such related checkpoints would be expected to be high. And it is indeed:

-	dreamshaper	epicrealism	pvcfigurerizer	juggernaut	hephaistos	2gbxl	cyberillustrious	pony	sd35m	flux	qwen	wan
dreamshaper	1.0	0.817	0.640	0.778	0.781	0.789	0.710	0.669	0.764	0.778	0.767	0.746
epicrealism	0.817	1.0	0.600	0.749	0.757	0.758	0.700	0.654	0.735	0.733	0.734	0.684
pvcfigurerizer	0.640	0.600	1.0	0.572	0.575	0.565	0.563	0.548	0.575	0.595	0.588	0.572
juggernaut	0.778	0.749	0.572	1.0	0.906	0.864	0.756	0.687	0.839	0.795	0.809	0.735
hephaistos	0.781	0.757	0.575	0.906	1.0	0.868	0.765	0.688	0.828	0.796	0.802	0.727
2gbxl	0.789	0.758	0.565	0.864	0.868	1.0	0.778	0.700	0.799	0.770	0.785	0.715
cyberillustrious	0.710	0.700	0.563	0.756	0.765	0.778	1.0	0.721	0.727	0.715	0.709	0.664
pony	0.669	0.654	0.548	0.687	0.688	0.700	0.721	1.0	0.666	0.671	0.657	0.639
sd35m	0.764	0.735	0.575	0.839	0.828	0.799	0.727	0.666	1.0	0.791	0.794	0.736
flux	0.778	0.733	0.595	0.795	0.796	0.770	0.715	0.671	0.791	1.0	0.833	0.797
qwen	0.767	0.734	0.588	0.809	0.802	0.785	0.709	0.657	0.794	0.833	1.0	0.769
wan	0.746	0.684	0.572	0.735	0.727	0.715	0.664	0.639	0.736	0.797	0.769	1.0

A few clusters become apparent. Dreamshaper clusters with Epicrealism, because both are based on SD1.5. Juggernaut clusters with Hephaistos and 2GB XL, because all 3 are based on SDXL. The apparent cluster consisting of FLUX.1 and Qwen is spurious, because these models are not related to each other. The high similarity might be caused by both being trained on similar training data or by stylistic similarities (all models of course attempt to look more realistic, less artificial, and some succeed better than older models). If models get better and better, I would expect them to converge to higher similarity to each other (i.e. stronger apparent clustering) simply due to improved prompt following. Once models achieve 100% accuracy in prompt following, their output differences would be mainly stylistic.

epiCRealism Natural Sin RC1 VAE: 0.687
Dreamshaper 8: 0.840
PVC Figurerizer V2.0: -0.823
Hephaistos_NextGENXL v2.0: 1.24
Juggernaut XL version XI: 1.26
SDXL 4GB/2GB: 0.877
CyberRealistic CyberIllustrious v7.0: 0.410
Pony Realism v2.2 Hyper 8 steps: 0.086
Stable Diffusion 3.5 Medium GGUF Q8_0: 0.997
FLUX.1 Dev GGUF Q4_0: 1.12
QWEN GGUF Q2_K: 1.13
Wan 2.1 T2V 14B Q2_K: 0.698

Caveats: ImageReward is already a bit old and was apparently trained on data that is older than SD1.5. That makes its ability to evaluate newer models questionable. It assigns higher values than SD1.5 models get to the usual suspects, but that it assigns almost the same value to FLUX.1 and Qwen is not appropriate IMHO. Qwen is the only tested model that gets the raised eyebrow, the chess players, the Star Wars poster, the ladybug and the chibi character right, is the only model that has seen a Go board before and is one of only two models getting the handstand, the frog on tortoise and the superhero promo right. This should result in a significant difference between the scores of FLUX.1 and Qwen. And assigning the highest scores to Hephaistos and Juggernaut is also nonsense and is probably due to the fact that SDXL-based models output images that are much closer to what ImageReward has seen during training. In particular, ImageReward training data contains no real world photos AFAIK, which might make evaluating realistic-looking output difficult. Looking not just at averages, but individual images also confirms to me that sometimes ImageReward is way off the mark. But the general expectations are confirmed: SD1.5-based models score lower than SDXL-based models, pvc figurerizer doesn't follow prompts at all, Pony and Cyberillustrious produce too much NSFW (not safe for work) output and thus score lower than Hephaistos and Juggernaut.

Conclusion

Apart from that, I hope it is obvious that the benchmark prompts have become useless for benchmarking, now that they are published .

The future

If you want to see a better comparison, giving me access to a server with enough VRAM, RAM (random access memory) and disk space would help.

Conclusion

Based on the performance of the tested checkpoints for the benchmark prompts, I must conclude that open-source image generation models usable on affordable consumer hardware are not yet good enough for general usage. You can get great images with LoRAs, prompt engineering, rerunning the generation with many different random seeds and other helpers, but it is unlikely you will get what you want on the first try in most cases if you avoid all of that additional effort. I have asked other users and they told me that they usually generate over 100 images before they get one they find worthy of publishing, and that matches my experience. You can generate an image in <1 minute, but if only every 100th image is usable, the effective generation time is on the order of an hour. Larger models like Qwen, Wan 2.2, FLUX.2 and so on are probably much better, but they're also close to unusable on affordable consumer GPU (Graphics Processing Unit)s. And an RTX 5090 is simply far outside the budget of most people. If you can afford that kind of GPU (Graphics Processing Unit), you might as well just pay a monthly subscription for one of the many commercial image generation services, which remove much of the hassle. Maybe 32GB GPU (Graphics Processing Unit)s will eventually come down in price, but currently it doesn't look like it.

Apart from that, I hope it is obvious that the benchmark prompts have become useless for benchmarking, now that they are published. Every new checkpoint coming out in the future can be finetuned to produce optimal output for these benchmark prompts while not improving in general. That is the fate of every benchmark . So future checkpoints should be evaluated with a different set of benchmark prompts, ideally with both a public and a private set, to discover deliberate benchmark performance manipulation attempts.

The future

If you want to see a better comparison (e.g. using unquantized versions of Qwen, Wan and the other checkpoints) , giving me access to a server with enough VRAM, RAM (random access memory) and disk space would help. I can't really afford paying thousands of dollars for the big boi GPU (Graphics Processing Unit) and even just renting such a GPU (Graphics Processing Unit) for a few days is expensive. Running the benchmark prompts with just one checkpoint using unrestrictive settings (high resolution, high number of steps) takes many hours for the larger checkpoints.

Click here to send an edit suggestion

Paralinguistic/connotation key:

Mocking
Sarcasm, e.g. "Homeopathy fans are a really well-educated bunch"
Statement not to be taken literally, e.g. "There is a trillion reasons not to go there"
Non-serious/joking statement, e.g. "I'm a meat popsicle"
Personal opinion, e.g. "I think Alex Jones is an asshole"
Personal taste, e.g. "I like Star Trek"
If I remember correctly
Hypothesis/hypothetical speech, e.g. "Assuming homo oeconomicus, advertisement doesn't work"
Unsure, e.g. "The universe might be infinite"
2 or more synonyms (i.e. not alternatives), e.g. "aubergine or eggplant"
2 or more alternatives (i.e. not synonyms), e.g. "left or right"
A proper name, e.g. "Rome"

One always hopes that these wouldn't be necessary, but in the interest of avoiding ambiguity and aiding non-native English speakers, here they are. And to be clear: These are not guesses or suggestions, but rather definite statements made by the author. For example, if you think a certain expression would not usually be taken as a joke, but the author marks it as a joke, the expression shall be understood as a joke, i.e. the paralinguistic/connotation key takes precedence over the literal text. Any disagreement about the correct/incorrect usage of the expression may be ascribed to a lack of education and/or lack of tact on the part of the author if it pleases you.