W.i.P / Nontrinsic

Image generation is not image creation. When it comes to novel artistic direction that isn't easily sourced from an online repository, excellent prompting and countless iterations does not replace artistic knowledge. The ability to transliterate abstract thought into visual form is enhanced by machine, but without the experience to guide the process, successful artistic generation is the curation of happy accidents. If the prompter lacks the skill to craft what they asked AI to create, the prompter does not possess creative freedom. Instead, they are faced with creative limitation.

The Outcome:

A series of three portraits, exploring a collaborative process of elevating old artistic works into new forms that better represent how I've grown as an artist. None of these works are without post-generation editing, image manipulation or digital painting. While the major metrics of this study are subjective, the process below did result in a 80 percent reduction of time compared to previous generation-to-post flows. The value toward ideation in general, likely incalculable.

That process A-to-Z continues below, after the images.

Surrealism, Realistically.

The initial concept emerged from a decades old series of academic paintings collecting dust. They were an exercise in creating surrealist portraits that juxtapose human subjects with unexpected objects, creating visual tension and narrative curiosity. I did not want a "machine" to upscale and upsell my previous work, but instead help me take the concepts, and create new iterations of that work.

So, I focused on discerning the best prompt to use, before I introduced more variables like image and style references or reruns. Agents were asked to summarize into short, medium and long prompts the source material.

A hyper-realistic photographic portrait of a man in his mid-30s with short dark hair and a stubbled beard, sitting in a mid-century wooden chair. He is wearing a dark navy button-up shirt with a nautical pattern. His head is turned to the side in a three-quarter view, looking subtly toward the camera with a composed expression. A tri-colored ferret sits on his shoulder, visibly wrinkling the fabric beneath it. The background features wood paneling, warm side lighting and muted earth tones. The background is slightly out of focus for a painterly depth effect.

However, as the intent of this exercise is to elevate existing work and push it into a new direction, not recreate it, I went through a series of prompt engineering exercies.

A photorealistic painted portrait. The lighting is studio, warm, and directional from the side. The subject is a mid 30's in age man. The angle of the face is looking to the side at a three-fourth view. The zoom of the portrait shows the full head, neck and upper torso of the body. The overall color of the portrait is vivid pastels. The setting of the portrait is a mid-century parlor. The figure is seated in a wooden chair. The background of the painting is light, airy and out of focus plants. There is a bright yellow rubber ducky on the shoulder of the subject.

The prompt, and the outcome, are obviously different. Too much detail in the prompt, or too little detail, creates at best half-correct outcomes and at worse, shifts into the uncanny valley.

When ideating the prompt, tools like Claude were employed (but not paid, because it's an AI agent.)

A Semi-Well Oiled Model

Diffusion models form the basis of many popular image generation tools, like Midjourney. Stable Diffusion, being one of the most well known, represents a latent diffusion model that transforms random noise into coherent imagery. That is done through iterative denoising guided by embedded text prompts. So, like a lot of art, it starts from a place of chaos and works backwards. Despite their remarkable capabilities, that chaos presents as much as a problem as it does an opportunity.

Diffusion models form the basis of many popular image generation tools, like Midjourney. Stable Diffusion, being one of the most well-known, represents a latent diffusion model that transforms random noise into coherent imagery. This is done through iterative denoising guided by embedded text prompts. So, like a lot of art, it starts from a place of chaos and works backwards. Despite their remarkable capabilities, that chaos presents as much of a problem as it does an opportunity.

Start with Gaussian noise

Model predicts added noise (ε)

Subtract predicted noise

Repeat over T steps

Final image appears

And those problems show up often. While not shown, there was a consistent need to negative prompt to reduce the likelihood of the odd things that were endearing in the beginning, from cropping up in tailoring sessions.

A realistic, well painted, almost photographic portrait. The lighting is directional sunlight, warm, and directional from the side. The subject is a mid 30's in age man, dark hair with a stubble beard. The angle of the face is looking to the side at a three-fourth view. The zoom of the portrait shows the full head, neck and upper torso of the body. The overall color of the portrait is vivid pastels. This color profile would be lilac, grass green, pink, cyan and other similar colors. The setting of the portrait is a mid-century parlor. The figure is seated in a wooden chair. The background of the painting is light, airy and out of focus plants and flamingo adorned wallpaper. There is a bright yellow rubber ducky on the shoulder of the subject, at around the size of 4 to 5 inches in diameter. That bright yellow ducky has top hat in the style US president Abraham Lincoln. The yellow ducky displaces and wrinkles the clothing around the spot where it sits.

There are many diffusion model image generators out there, but for the majority of this study Midjourney was used.

Greenlit Series

We're getting close to something workable, something that matches my vision properly. However, there are still some areas that require attention, like the quality of the beard and the headless flamingo necks. Yet, some happy accidents emerged that enhanced and crafted a more atypical portrait, like the soulless, eye free, rubber ducky. This unexpected result contributes very well to the overall vision in a way I had not imagined.

Engaging in final prompt engineering, variation runs and in-image editing sessions to generate fixes took me close to the finish line, but couldn't get me across. It needed a bit more polish.

Using tools like Luminar Neo and Adobe Photoshop I worked to give more texture, establish a softer ambience and paint in the subtle fixes you saw at the start of the study. Post-editing was required because ideating a sunburn effect (this is Florida, after all), or isolated hair adjustments were simply less effective than doing the fix myself with a Wacom tablet. Despite this post-work, I still see flaws I want to correct or adjust. But, this is a study, not a full-fledged art show. I needed a good stopping point, so I could go on to create a series, not just a singular image.

Realism, Surrealistically

One of the greatest frustrations in the area of generation, is where the exerted effort no longer translates into equal results. Using reference imagery or successful prompts, or both, cannot create (to my satisfaction) a consistent series when a moderate amount of variables are changed in a diffusion model based project. The process requires compromise, focusing on the realistic outcome not the idealistic one. It is when you copy well known styles, instead developing your own, that the diffusion models used can generate more effectively.

While Diffusion models were used, there are alternatives. GPT-4o is a multimodal AI system that combines an autoregressive language model with DALL-E 3's diffusion-based image generation capabilities. The text generation component uses tokenization and sequence prediction, which then guides the image creation process. This hybrid approach can produce more visually consistent designs when working from detailed text prompts. However, this consistency comes with tradeoffs: the system may exhibit less spontaneous variation than pure diffusion models, a major downside in ideation exploration. Additionally, without proper in-editing capabilities to adjust and refine specific sections, you're forced to generate entirely new images, introducing unwanted variations in late-stage development.

Start with empty image grid

Generate token (patch) left to right

Token is coarse representation

Apply local diffusion refinement

Repeat for next token

Start with empty image grid

Generate token (patch) left to right

Token is coarse representation

Apply local diffusion refinement

Repeat for next token

At the time of this study, the amount of generations needed can be severely limited by the type of plan you are on (turning a daily sprint on Midjourney into a weeks long journey on ChatGPT). However, by doing most of the heavy lifting in a diffusion model tool with proper in-editing tools, and any post work in an image editor like Photoshop, that reference serves as a great building block for autoregressive models to generate a more coherent series. Using an autoregressive model to produce the other two final images, shown at the top of the page, reduced a 10 hour flow for two unhappy results, down to 2 hours.

With the speed of development of AI products, at anytime you read this, it could be out-of-date. Nonetheless, I still see value in showing the process, even if one day it may be comparable to a study of print design before Adobe products existed.

WIP Home