Click here to send an edit suggestion
In the hope of getting rid of cherry-picked presentations for image-generation checkpoints on Civitai and other places and using something more objective for evaluating checkpoints, I created 100 benchmark prompts and tested them on many checkpoints. Their purpose is not to measure speed, but to compare the quality of the output and the prompt adherence:
As can be seen, there are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data.
How many images achieve a "pass" rating based on my subjective evaluation:
I must conclude that open-source image generation models usable on affordable consumer hardware are not yet good enough for general usage.
In the hope of getting rid of cherry-picked presentations for image-generation checkpoints on Civitai and other places and using something more objective for evaluating checkpoints, I created 100 benchmark prompts and tested them on many checkpoints. Their purpose is not to measure speed, but to compare the quality of the output and the prompt adherence.
On Civitai and other places, new variants of Stable Diffusion 1.5 , Stable Diffusion XL, Stable Diffusion 3.x, FLUX.1 and more are published daily and it is difficult to keep track of them all and choose one that is suitable for yourself. The creators of these variants, usually called checkpoints, typically try to present their creations in the best possible light, often cherry-picking images created with their checkpoints. There seems to be no clear guideline for how checkpoints should be presented or the guidelines are not followed, leading to widely varying styles and qualities of the presentation. If one doesn't trust these presentations and wants to verify the quality of a checkpoint, one has to log in, download the checkpoint and create images with it using many different prompts, which could take hours. Considering the vast number available, comparing all available checkpoints is a hopeless task.
I don't really like that situation and couldn't find good comparisons of checkpoints, forcing me to download and test dozens of checkpoints. In the hope of improving the situation, I created 100 benchmark prompts and created images using these prompts with many of the checkpoints I had downloaded. The purpose is not to measure the speed, but to compare the quality of the output and the prompt adherence. I chose the prompts to test many concepts that a well-rounded checkpoint ideally should understand and be able to paint.
These prompts should test the understanding of many different concepts and avoid prompt engineering and anything that would give an advantage to any one particular checkpoint, even if that means not being able to tease the best out of every checkpoint:
Tested concepts are: Single humans; groups of humans; different ages; men & women; different ethnicities; animals; landscapes; foods & drinks; machinery; pop culture elements; celebrities; art styles; photography terminology; different body postures; facial expressions; memes. The prompts are intended to test the limits of the checkpoint's understanding and it is not expected that all prompts will be understood correctly. It is clear that the prompts I chose are not capable of teasing the top performance out of every checkpoints. Many users have shown that with enough prompt engineering, stunning images can be produced. But I do not think that such prompts should be the basis of a benchmark. First, the benchmark prompts should represent the level of knowledge of a beginner. Second, highly engineered prompts are unlikely to be transferable to other checkpoints. Third, the longer a prompt, the more likely it is to produce an image that is the result of overfitting and thus not representative of the general performance of the checkpoint. Consequently I think benchmark prompts should be concise, easy to understand and transferable to many different checkpoints, not giving any particular checkpoint an unfair advantage. It is of course impossible to find prompts that fairly evaluate all possible checkpoints, and thus some checkpoints expecting a special prompting style will be disadvantaged by these benchmark prompts. Here are the prompts I tested, I release them into the public domain and encourage everyone to use them (or something similar) for evaluating checkpoints:
I tested these prompts on the following checkpoints. All images were generated using ComfyUI, a fixed seed of 0, a size of 1024x1024 for SDXL and SD3.5 based checkpoints, 512x512 for all other checkpoints. The negative prompt is "text, watermark":
I tested these prompts on the following checkpoints. There is no cherry-picking - if the output is crap, that's what will be shown. All images were generated using ComfyUI (commit 0963493a9c3b6565f8537288a0fb90991391ec41 & possibly some other commits, unfortunately I didn't keep track of all details) , a fixed seed of 0, a size of 1024x1024 for Stable Diffusion XL and Stable Diffusion 3.5 based checkpoints, 512x512 for all other checkpoints (due to the limited VRAM of my GPU (Graphics Processing Unit)) and otherwise the settings stated next to the checkpoint. The negative prompt is ComfyUI's default "text, watermark" unless stated otherwise:
I also could have tested the following checkpoints, but did not:
There are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data. Also, for some reason models after SDXL seem to have lost their understanding of painter's styles. Maybe the base models all lack this ability and only community effort restored it.
Prompts misunderstood by all checkpoints: the camel skeleton, the container ship, the Star Trek scene, dick butt, the centaur, the angel, the Go board, the Mandelbrot-like fractal, the mantis, the hobbit-carrying eagle, the Cacodemon.
Prompts understood only by one or two checkpoints: the raised eyebrow, the martini, the handstand, the bomber plane, the elephant in latex, the Star Wars poster, the ladybug, the frog on tortoise, the shark, the chibi character, the crashlanding space ship, the superhero promo.
Nota bene: Images that turned out "not safe for work" were replaced with a NSFW (not safe for work) placeholder for display here, but are used unaltered for CLIP similarity calculation. The evaluation of prompts was greatly eased by the "Text Multiline" and "Text Load Line From File" nodes from https://github.com/WASasquatch/was-node-suite-comfyui and I would recommend to use them if you want to evaluate the prompts on your favorite checkpoint. Careful, the linked image is humongous.
It is pretty clear that not all concepts are understood by all checkpoints and also that the checkpoints are quite close in their capabilities. There are many concepts that are understood either by none or by all the checkpoints. This points to a lack of diversity in the training data. It is also obvious that the SDXL based checkpoints have a much broader understanding than the SD1.5 based checkpoints, no surprises there. What might be surprising is how similar the outputs of the SD1.5 checkpoints look. One might expect more variety, given how many checkpoints were produced. And even different SDXL checkpoints like Hephaistos and Juggernaut fail in very similar ways on certain prompts, even though both are described as being retrained, not just the result of LoRA merges. This could indicate that even intensely trained checkpoints remain beholden to the limitations of their base models. Part of the similarity can of course be explained by the same seed being used for all images, but other details can not. For example both Hephaistos and Juggernaut paint the "old woman in her living room" as barefoot, even though that was not mentioned in the prompt. Details like that are too small to be controlled by initial noise distribution. In the "shark chasing a goldfish" image, both checkpoints misinterpret the prompt in the same way while varying the placement of the fish, which can't result from the initial noise pattern alone.
It is also obvious that prompts describing interactions between multiple objects/people are difficult for all SD1.5 & SDXL checkpoints. The same is true for prompts describing a specific number of people/things or text that is supposed to appear in the image, but those limitations are already widely known. But not all combinations of concepts seem to be problematic, as can be seen in the "elephant in latex" image. Finally, all SD1.5 & SDXL checkpoints seem to fail hard for prompts describing humans in "unusual" poses like a handstand. I can only assume that images depicting humans in these poses were very rare in the training data.
For some reason, models after SDXL seem to have lost their understanding of painter's styles. FLUX.1 fails at all of them except maybe Van Gogh and watercolor. Qwen manages a weak Van Gogh, Monet and watercolor. Only its Seurat imitation is decent. Wan fails at all of them. Juggernaut and Hephaistos are the best at imitating painter's styles, though far from perfect. Maybe the base models all lack this ability (removed on purpose?) and only community effort restored it.
The outputs of Qwen and Wan sometimes look undetailed and noisy. This is likely due to the extreme quantization I used, the reduced number of steps and possibly the low resolution. I recreated some of the uglier Qwen images with a higher number of steps and confirmed that the output quality improves markedly for those images. If the Qwen output image shown above would receive a "fail" rating due to ugliness, I evaluated the improved output instead. In contrast to Qwen, Wan did not produce noticeably better output with a higher number of steps.
The following prompts are misunderstood by all tested checkpoints: the camel skeleton, the container ship (all images lack the harbor), the Star Trek scene, dick butt, the centaur, the angel, the Go board (Qwen manages only a very rough approximation), the Mandelbrot-like fractal, the mantis, the hobbit-carrying eagle, the Cacodemon. A few more prompts are understood only by one or two checkpoints: the raised eyebrow, the martini, the handstand, the bomber plane, the elephant in latex, the Star Wars poster, the ladybug, the frog on tortoise, the shark, the chibi character, the crashlanding space ship, the superhero promo.
I present here how many prompts achieve a "pass" rating according to my subjective evaluation. For an image to pass, there can't be glaringly obvious mistakes, including deviations from the prompt, nor any unwarranted additions. In all cases, the criterion for "fail" is that a non-expert viewer using just common sense would be able to tell without a second glance that the image is wrong in a manner that wouldn't appear in an image produced by an expert human artist.
Because my evaluation of prompt following is subjective, I also ran all images and prompts through OpenCLIP. I got the following average cosine similarities between prompts and corresponding images:
Because of OpenCLIP's limitations in prompt adherence, I also evaluated prompt adherence using ImageReward . This model was trained specifically for the purpose of evaluating prompt adherence and produced the following averages for the evaluated checkpoints:
I present here how many prompts achieve a "pass" rating according to my subjective evaluation. For an image to pass, there can't be glaringly obvious mistakes, including deviations from the prompt, nor any unwarranted additions. What counts as a mistake depends on the prompt. In the handshake image, the focus is on the hands and it was included specifically to test how well the checkpoint can draw hands (which was a well-known shortcoming of older checkpoints) . Consequently, an obviously incorrect number of fingers produces a "fail" rating. In the second world war bomber image, a "pass" can be achieved if the result looks superficially like a bomber plane from that period. It is not necessary for every nut and bolt to be in the right place, because a non-expert wouldn't immediately notice that mistake. If on the other hand the propellers were attached to the wingtips or the plane lacked a cockpit, that would be a "fail". In particular, the plane doesn't have to match exactly a real bomber type, it is enough to look like a plane that could have existed back then. For the chess image, a "pass" requires a checkered board with 8 times 8 squares and black and white pieces, at least superficially arranged like in a real chess game, namely most white pieces on the side of one player, most black pieces on the opponent's side. When specifying an exact number, like "three people sitting around a table", that number of objects must be present. But when specifying just "an elephant", it is not a "fail" to include another elephant in the background, as long as one elephant dominates the image. A group of 10 elephants would be a "fail" for that prompt. In all cases, the criterion for "fail" is that a non-expert viewer using just common sense would be able to tell without a second glance that the image is wrong in a manner that wouldn't appear in an image produced by an expert human artist.
Because my evaluation of prompt following is subjective, I also ran all images and prompts through OpenCLIP (ViT-H-14-378-quickgelu to be precise) . After wasting a few hours hunting a stupid bug because I'm retarded, I calculated the average cosine similarity between each image and the corresponding prompt and got the following numbers:
A few caveats are necessary: OpenCLIP is not capable of understanding all the small details I'm looking for in an image, in part because it was trained on real images, not AI-generated images. That means it won't pay attention to hands with the wrong number of fingers and other artifacts. Its input is also resized to 378x378, reducing the detail and introducing a bit of blur due to the resizing. And of course it won't know pop culture concepts that were not contained in the training data, just like for the image generation models. That means it might never have seen a "cacodemon from doom" and thus can't evaluate how closely the image follows that prompt. OpenCLIP is probably not capable of distinguishing between individual painter's styles either. And OpenCLIP was not trained explicitly to evaluate prompt following, although its training task of matching images to their descriptions is similar.
Because OpenCLIP can be used to compare images to each other, not just to prompts, I have also used it to compare all tested checkpoints to each other. In the table below, for each pair of checkpoints, the cosine similarity of their output images for the same prompt is calculated and then averaged over all prompts to calculate an average cosine similarity for that pair of checkpoints. The idea is that some checkpoints are derived from each other and are obviously producing very similar outputs (see above), so the cosine similarity of such related checkpoints would be expected to be high. And it is indeed:
| - | dreamshaper | epicrealism | pvcfigurerizer | juggernaut | hephaistos | 2gbxl | cyberillustrious | pony | sd35m | flux | qwen | wan |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dreamshaper | 1.0 | 0.817 | 0.640 | 0.778 | 0.781 | 0.789 | 0.710 | 0.669 | 0.764 | 0.778 | 0.767 | 0.746 |
| epicrealism | 0.817 | 1.0 | 0.600 | 0.749 | 0.757 | 0.758 | 0.700 | 0.654 | 0.735 | 0.733 | 0.734 | 0.684 |
| pvcfigurerizer | 0.640 | 0.600 | 1.0 | 0.572 | 0.575 | 0.565 | 0.563 | 0.548 | 0.575 | 0.595 | 0.588 | 0.572 |
| juggernaut | 0.778 | 0.749 | 0.572 | 1.0 | 0.906 | 0.864 | 0.756 | 0.687 | 0.839 | 0.795 | 0.809 | 0.735 |
| hephaistos | 0.781 | 0.757 | 0.575 | 0.906 | 1.0 | 0.868 | 0.765 | 0.688 | 0.828 | 0.796 | 0.802 | 0.727 |
| 2gbxl | 0.789 | 0.758 | 0.565 | 0.864 | 0.868 | 1.0 | 0.778 | 0.700 | 0.799 | 0.770 | 0.785 | 0.715 |
| cyberillustrious | 0.710 | 0.700 | 0.563 | 0.756 | 0.765 | 0.778 | 1.0 | 0.721 | 0.727 | 0.715 | 0.709 | 0.664 |
| pony | 0.669 | 0.654 | 0.548 | 0.687 | 0.688 | 0.700 | 0.721 | 1.0 | 0.666 | 0.671 | 0.657 | 0.639 |
| sd35m | 0.764 | 0.735 | 0.575 | 0.839 | 0.828 | 0.799 | 0.727 | 0.666 | 1.0 | 0.791 | 0.794 | 0.736 |
| flux | 0.778 | 0.733 | 0.595 | 0.795 | 0.796 | 0.770 | 0.715 | 0.671 | 0.791 | 1.0 | 0.833 | 0.797 |
| qwen | 0.767 | 0.734 | 0.588 | 0.809 | 0.802 | 0.785 | 0.709 | 0.657 | 0.794 | 0.833 | 1.0 | 0.769 |
| wan | 0.746 | 0.684 | 0.572 | 0.735 | 0.727 | 0.715 | 0.664 | 0.639 | 0.736 | 0.797 | 0.769 | 1.0 |
A few clusters become apparent. Dreamshaper clusters with Epicrealism, because both are based on SD1.5. Juggernaut clusters with Hephaistos and 2GB XL, because all 3 are based on SDXL. The apparent cluster consisting of FLUX.1 and Qwen is spurious, because these models are not related to each other. The high similarity might be caused by both being trained on similar training data or by stylistic similarities (all models of course attempt to look more realistic, less artificial, and some succeed better than older models). If models get better and better, I would expect them to converge to higher similarity to each other (i.e. stronger apparent clustering) simply due to improved prompt following. Once models achieve 100% accuracy in prompt following, their output differences would be mainly stylistic.
Because of OpenCLIP's limitations in prompt adherence, I also evaluated prompt adherence using ImageReward . This model was trained specifically for the purpose of evaluating prompt adherence and produced the following averages for the evaluated checkpoints:
Caveats: ImageReward is already a bit old and was apparently trained on data that is older than SD1.5. That makes its ability to evaluate newer models questionable. It assigns higher values than SD1.5 models get to the usual suspects, but that it assigns almost the same value to FLUX.1 and Qwen is not appropriate IMHO. Qwen is the only tested model that gets the raised eyebrow, the chess players, the Star Wars poster, the ladybug and the chibi character right, is the only model that has seen a Go board before and is one of only two models getting the handstand, the frog on tortoise and the superhero promo right. This should result in a significant difference between the scores of FLUX.1 and Qwen. And assigning the highest scores to Hephaistos and Juggernaut is also nonsense and is probably due to the fact that SDXL-based models output images that are much closer to what ImageReward has seen during training. In particular, ImageReward training data contains no real world photos AFAIK, which might make evaluating realistic-looking output difficult. Looking not just at averages, but individual images also confirms to me that sometimes ImageReward is way off the mark. But the general expectations are confirmed: SD1.5-based models score lower than SDXL-based models, pvc figurerizer doesn't follow prompts at all, Pony and Cyberillustrious produce too much NSFW (not safe for work) output and thus score lower than Hephaistos and Juggernaut.
Based on the performance of the tested checkpoints for the benchmark prompts, I must conclude that open-source image generation models usable on affordable consumer hardware are not yet good enough for general usage. You need LoRAs, prompt engineering and rerunning the generation with many different random seeds to produce good outputs in many cases.
Apart from that, I hope it is obvious that the benchmark prompts have become useless for benchmarking, now that they are published .
If you want to see a better comparison, giving me access to a server with enough VRAM, RAM (random access memory) and disk space would help.
Based on the performance of the tested checkpoints for the benchmark prompts, I must conclude that open-source image generation models usable on affordable consumer hardware are not yet good enough for general usage. You can get great images with LoRAs, prompt engineering, rerunning the generation with many different random seeds and other helpers, but it is unlikely you will get what you want on the first try in most cases if you avoid all of that additional effort. I have asked other users and they told me that they usually generate over 100 images before they get one they find worthy of publishing, and that matches my experience. You can generate an image in <1 minute, but if only every 100th image is usable, the effective generation time is on the order of an hour. Larger models like Qwen, Wan 2.2, FLUX.2 and so on are probably much better, but they're also close to unusable on affordable consumer GPU (Graphics Processing Unit)s. And an RTX 5090 is simply far outside the budget of most people. If you can afford that kind of GPU (Graphics Processing Unit), you might as well just pay a monthly subscription for one of the many commercial image generation services, which remove much of the hassle. Maybe 32GB GPU (Graphics Processing Unit)s will eventually come down in price, but currently it doesn't look like it.
Apart from that, I hope it is obvious that the benchmark prompts have become useless for benchmarking, now that they are published. Every new checkpoint coming out in the future can be finetuned to produce optimal output for these benchmark prompts while not improving in general. That is the fate of every benchmark . So future checkpoints should be evaluated with a different set of benchmark prompts, ideally with both a public and a private set, to discover deliberate benchmark performance manipulation attempts.
If you want to see a better comparison (e.g. using unquantized versions of Qwen, Wan and the other checkpoints) , giving me access to a server with enough VRAM, RAM (random access memory) and disk space would help. I can't really afford paying thousands of dollars for the big boi GPU (Graphics Processing Unit) and even just renting such a GPU (Graphics Processing Unit) for a few days is expensive. Running the benchmark prompts with just one checkpoint using unrestrictive settings (high resolution, high number of steps) takes many hours for the larger checkpoints.
Click here to send an edit suggestion
Written by the author; Date 04.05.2026; Updated 24.05.2026; © 2026 spinningsphinx.com