For 4x superresolution, use UltraSharpV2.

I evaluated the following 4x superresolution methods on 90 images I had lying around: Real-ESRGAN , Remacri , 4x UltraSharpV2 , LDSR , StableSR . Of these methods, I think 4x UltraSharpV2 is the best.

I evaluated the following 4x superresolution methods on 90 images I had lying around:

There are constantly new solutions for superresolution being pushed out and I would like to just use the best one, but unfortunately these new solutions rarely come with good showcases and I also don't know about good third-party reviews. Most of the people just publish 2 or 3 images together with their products, as if that means anything.

For me to make a decision, I need comparisons with competing techniques (including the base case, a non-intelligent simple superresolution algorithm), based on a large variety of images, both examples presenting the new technique in the best possible light as well as failure cases.

I chose 90 images that I had lying around and applied the following algorithms to them to produce 4x the input resolution (both width & height, i.e. 16x the number of pixels):

None at all, i.e. the original input, or rather a point filter a.k.a. nearest neighbor filtering
ImageMagick's implementation of Lanczos filtering. As close to the mathematically correct way as is possible for finite images. No artificially intelligence, but the fastest of all approaches.
Real-ESRGAN . The simplest of the smart algorithms. A single-step feed-forward artificial neural network specialized for this task. Reasonably fast.
Remacri . Real-ESRGAN retrained on different images and possibly other hyperparameters. I couldn't find much detail about Remacri.
4x UltraSharpV2 . Another retraining of Real-ESRGAN.
LDSR a.k.a. Latent Diffusion Super Resolution. A really slow multi-step approach that mimics how Stable Diffusion & other diffusion models work. It's also pretty chonky compared to Real-ESRGAN & siblings, which makes it even slower. You can reduce the step count to speed it up slightly, but it will never be faster than Real-ESRGAN.
StableSR . Similar to LDSR. Also a multi-step solution, but a bit faster than LDSR.

None at all, i.e. the original input, or rather a point filter a.k.a. nearest neighbor filtering
ImageMagick's implementation of Lanczos filtering. As close to the mathematically correct way as is possible for finite images. No artificially intelligence, but the fastest of all approaches.
Real-ESRGAN . The simplest of the smart algorithms. A single-step feed-forward artificial neural network specialized for this task. Reasonably fast.
Remacri . Real-ESRGAN retrained on different images and possibly other hyperparameters. I couldn't find much detail about Remacri.
4x UltraSharpV2 . Another retraining of Real-ESRGAN.
LDSR a.k.a. Latent Diffusion Super Resolution. A really slow multi-step approach that mimics how Stable Diffusion & other diffusion models work. It's also pretty chonky compared to Real-ESRGAN & siblings, which makes it even slower. You can reduce the step count to speed it up slightly, but it will never be faster than Real-ESRGAN.
StableSR . Similar to LDSR. Also a multi-step solution, but a bit faster than LDSR.

I cropped the resulting images for easier comparison. If you want to evaluate the uncropped images, download all results .

I didn't test SeeSR , CCSR , SUPIR , because they were too annoying to install or have too high hardware requirements for me or just didn't advertise their abilities very well.

I looked at the resulting images very closely and cut out one part (or sometimes multiple parts) from each image for comparison. This part is zoomed in to make comparison easier for you, which of course means the rest of the image isn't visible. I did this just for comparison, the superresolution methods were run on the uncropped image. If you want to compare the images in their entirety, you can download all results . I'm not the copyright holder for the images, so I can't just post a download link to the original hires images, but if you want to make your own comparison, contact me and I might be able to make the originals available to you. And be aware: Now that this comparison is published, future superresolution techniques can not be compared using the same images, because it is possible to tune these new techniques such that they excel on this benchmark while being mediocre for everything else.

Results

Careful, the linked image is humongous. But if you want to see the differences clearly, you have to look at it. This preview loses too much detail.

Results

Careful, the linked image is humongous. But if you want to see the differences clearly, you have to look at it. This preview loses too much detail.

Conclusion

Based on these images, I think 4x superresolution UltraSharpV2 is the best of the tested methods.

Conclusion

Based on these images, you will probably understand that I think 4x superresolution UltraSharpV2 is the best of the tested methods. None of these images are NSFW (not safe for work), but I tested it on some NSFW (not safe for work) images as well and can say that UltraSharpV2 beats the other methods in that area as well. If you don't see how that's relevant: ANN (artificial neural network)-based methods only excel on the data they have been trained on and thus you can't expect them to produce good results on data they have not been trained on. And unfortunately, you just can't generalize from a small subset of all training data to the rest (e.g. housecats have vertical pupils, tigers have round pupils, even though both are cats). Also, different datasets push towards different preferences, because the data are incompatible. A good example is real-world photos compared to manga/anime/cartoons: Drawn images want very sharp, regular, high contrast lines surrounding mostly flat areas, but photos rarely contain such features. An ANN (artificial neural network) thus has to make the decision whether to reconstruct some detail as photorealistic or as a drawing. Consequently, showing you images from just one subset (SFW (safe for work) in this case) does not allow you to draw conclusions about the output quality for other input image subsets.

Some thoughts about the different methods: The heavy-weight/slow methods LDSR/StableSR do not seem to outperform the fast methods except for very few examples and underperform significantly for many, which makes the extra effort questionable. They often hallucinate small details like text, which is probably due to the multi-step nature. The StableSR implementation I used (ComfyUI) seems to have a bug which results in some parts of the output image not being processed at all (i.e. remaining at the low input resolution). I did not try to fix or work around that, because StableSR does not produce great results even in the parts of the image where it does work. LDSR seems to be unable to produce certain output resolutions exactly, probably because being limited to multiples of a base number (just like Stable Diffusion can only produce output dimensions divisible by 8 because its latent is 8x downsampled relative to the input resolution). Because of that, some of the LDSR crops are misaligned. I have not tried to hide that, because a limitation like that is relevant for a comparison. Maybe it's the fault of the LDSR implementation I used. The other methods do not seem to have that limitation.

Real-ESRGAN erases or blurs a lot of detail (like the veins in the dragonfly wing). Remacri does not do that, but it goes too far in the other direction and adds too many hallucinations and exaggerated features (e.g. the penguin colony and the hair on the drinking monkey's head). That UltraSharpV2 strikes a good balance between these extremes is the main reason why I consider it the winner. It is also clear that even the best method often doesn't add much beyond edge enhancement/sharpening relative to the Lanczos output, so if you need a really fast and memory-efficient solution, don't feel bad for using Lanczos + Sharpening (not among the tested methods). And finally it should be clear that even the best method does not come close to the ground truth for most inputs, and so downsampling followed by superresolution can not be a replacement for the original. It should not be surprising. Some small details are unreconstructible, because you would have to guess. Examples:

The writing on the yellow CGI vehicle, namely "Come Home Safe".
In the cyberpunk-image that looks like a "Blade Runner" screenshot, there is small writing saying "doll house" in Japanese. On the same building there is a larger sign saying "Doll House" in English. A really high level understanding of the image could maybe give the right hint for reconstructing the Japanese text, but that is too much to ask from a superresolution method that runs reasonably fast and with reasonable memory requirements.
The caravan entering the gate to the city is so small that reconstructing riders on top of the camel's backs is speculative. The pixels on their backs could be cargo or background pixels. We just assume riders must be there because camels rarely run around alone.
The individual penguins in the penguin colony can not be separated. Reconstructing this first requires recognizing that they are indeed penguins (maybe they are penguin-looking aliens), then making assumptions about the number of penguins present and their orientation relative to the camera and finally requires knowing how a highres penguin looks (and which one exactly: emperor penguin or king penguin?).
The background pattern in the "Pulp Fiction" poster is completely averaged out in the superresolution input. Reconstructing it should be considered an artifact.
Reconstructing all the small faces in the "Akira" poster requires recognizing that they are actually faces, not background features, and then knowing that they are wearing gas masks. This requires understanding artist's intent and/or knowing cultural references.
Reconstructing a fractal requires knowing the formula that created it.

If you want to run the winner at home interactively, I would recommend ComfyUI .

What is superresolution

In short: Low resolution image in, high resolution image out. That means the algorithm has to guess detail that isn't actually there in the source image. There are two very different approaches: Either try to make a high resolution image that is representative of all high resolution images that correspond to the low resolution input and that does not try to create details that you cannot actually know without having access to the probability distribution of all real world high resolution images. The other approach does try to create such unknowable detail so as to make the final image more realistic, at the cost of not perfectly representing all high resolution images corresponding to the input. The second approach essentially chooses one high resolution image among an infinitude of possible choices, the first approach instead tries to average all possible choices. The first approach would not produce a realistic human face when trying to 4x or 8x upscale a very low resolution (something like 8x8 pixels) face, but the second approach would attempt that, as long as it can recognize the face as a face.

Click here to send an edit suggestion

Paralinguistic/connotation key:

Mocking
Sarcasm, e.g. "Homeopathy fans are a really well-educated bunch"
Statement not to be taken literally, e.g. "There is a trillion reasons not to go there"
Non-serious/joking statement, e.g. "I'm a meat popsicle"
Personal opinion, e.g. "I think Alex Jones is an asshole"
Personal taste, e.g. "I like Star Trek"
If I remember correctly
Hypothesis/hypothetical speech, e.g. "Assuming homo oeconomicus, advertisement doesn't work"
Unsure, e.g. "The universe might be infinite"
2 or more synonyms (i.e. not alternatives), e.g. "aubergine or eggplant"
2 or more alternatives (i.e. not synonyms), e.g. "left or right"
A proper name, e.g. "Rome"

One always hopes that these wouldn't be necessary, but in the interest of avoiding ambiguity and aiding non-native English speakers, here they are. And to be clear: These are not guesses or suggestions, but rather definite statements made by the author. For example, if you think a certain expression would not usually be taken as a joke, but the author marks it as a joke, the expression shall be understood as a joke, i.e. the paralinguistic/connotation key takes precedence over the literal text. Any disagreement about the correct/incorrect usage of the expression may be ascribed to a lack of education and/or lack of tact on the part of the author if it pleases you.