Why CLIP Dreams Collapse Without Regularization

The Core Optimization Problem

In CLIP-guided image transformation, we solve:

\hat{I} = \min_I \mathcal{L}_{CLIP}(I, T)

Where:

( I ) = image being optimized
( T ) = text prompt
( \hat{I} ) = final image

CLIP tells us:

“Change the image so it scores higher for this text.”

But CLIP does not care about:

Smoothness
Realism
Visual coherence
Natural image statistics

Without constraints, the optimizer will find any pixel configuration that maximizes similarity.

Result?

Noise explosions
Oversharpened artifacts
Psychedelic distortions

The Need for a Regularizer

A pure CLIP loss leads to chaotic optimization.

We introduce:

\hat{I} = \underbrace{\mathcal{L}_{CLIP}(I, T)}_{\text{what}} + \underbrace{\lambda_{tv} \mathcal{L}_{MTV}(I, M)}_{\text{how}}

CLIP defines semantic direction.
Masked TV defines structural constraints.

Standard TV Loss

Total Variation Loss:

\mathcal{L}_{TV}(I) = \sum_{x,y} \left( |I_{x,y} - I_{x+1,y}| + |I_{x,y} - I_{x,y+1}| \right)

If image is:

Smooth → small TV
Noisy → large TV

Problem:

TV punishes all sharp changes equally.

But edges are important:

Object boundaries
Texture
Structure

We must preserve them.

Masked TV Loss

We introduce a detail mask ( M(x,y) ).

Mask indicates:

How much we trust a region to contain meaningful structure.

Final formulation:

\mathcal{L}_{MTV}(I, M) = \sum_{x,y} M(x,y) (\Delta I(x,y))^2

Squared penalty ensures:

Small gradients → ignored
Large gradients → heavily punished

Effect by region:

Region	ΔI	M	Effect
Noise	High	High	Removed
Edge	High	Low	Preserved
Flat	Low	High	Stable
Texture	Medium	Medium	Controlled

This stabilizes hallucinations while preserving structure.

The Color Explosion Problem

Even after noise is controlled, another issue appears:

RGB Instability

CLIP works in embedding space.
It does not understand physical colors.

Optimizer may push pixels like:

R = 7
G = -4
B = 19

Mathematically valid.
Physically impossible.

Solution: Clamping

After each update:

img_tensor.clamp_(0, 1)

This keeps values within valid image range.

Result:

Stable colors
Reduced saturation artifacts
Human-acceptable outputs

Instead of optimizing at full resolution immediately, we use:

Low resolution → establish structure
High resolution → refine details

\hat{I} = I_0 + \sum_{k=1}^{N} \Delta_k

This prevents early noise amplification.

Final Pipeline Summary

Encode text prompts into embedding space
- Positive = Attractor
- Negative = Repellor
Optimize image pixels using:
- Multi-scale cutouts
- Masked TV regularization
- Cosine LR scheduling
- Gradient normalization
Clamp pixel range
Progress through octaves

Philosophical Takeaway

Most generative models learn how to produce images.

This project asks:

What if we don’t teach a model to generate — but instead force it to imagine by optimizing reality itself?

The result is not sampling.

It is pressure-driven hallucination.

A machine dream emerging from gradients.