Break down of AI induced Dreams

Why CLIP Dreams Collapse Without Regularization

The Core Optimization Problem

In CLIP-guided image transformation, we solve:

I^=min⁑ILCLIP(I,T)\hat{I} = \min_I \mathcal{L}_{CLIP}(I, T)

Where:

  • ( I ) = image being optimized
  • ( T ) = text prompt
  • ( \hat{I} ) = final image

CLIP tells us:

β€œChange the image so it scores higher for this text.”

But CLIP does not care about:

  • Smoothness
  • Realism
  • Visual coherence
  • Natural image statistics

Without constraints, the optimizer will find any pixel configuration that maximizes similarity.

Result?

  • Noise explosions
  • Oversharpened artifacts
  • Psychedelic distortions

The Need for a Regularizer

A pure CLIP loss leads to chaotic optimization.

We introduce:

I^=LCLIP(I,T)⏟what+λtvLMTV(I,M)⏟how\hat{I} = \underbrace{\mathcal{L}_{CLIP}(I, T)}_{\text{what}} + \underbrace{\lambda_{tv} \mathcal{L}_{MTV}(I, M)}_{\text{how}}

CLIP defines semantic direction.
Masked TV defines structural constraints.


Standard TV Loss

Total Variation Loss:

LTV(I)=βˆ‘x,y(∣Ix,yβˆ’Ix+1,y∣+∣Ix,yβˆ’Ix,y+1∣)\mathcal{L}_{TV}(I) = \sum_{x,y} \left( |I_{x,y} - I_{x+1,y}| + |I_{x,y} - I_{x,y+1}| \right)

If image is:

  • Smooth β†’ small TV
  • Noisy β†’ large TV

Problem:

TV punishes all sharp changes equally.

But edges are important:

  • Object boundaries
  • Texture
  • Structure

We must preserve them.


Masked TV Loss

We introduce a detail mask ( M(x,y) ).

Mask indicates:

How much we trust a region to contain meaningful structure.

Final formulation:

LMTV(I,M)=βˆ‘x,yM(x,y)(Ξ”I(x,y))2\mathcal{L}_{MTV}(I, M) = \sum_{x,y} M(x,y) (\Delta I(x,y))^2

Squared penalty ensures:

  • Small gradients β†’ ignored
  • Large gradients β†’ heavily punished

Effect by region:

RegionΞ”IMEffect
NoiseHighHighRemoved
EdgeHighLowPreserved
FlatLowHighStable
TextureMediumMediumControlled

This stabilizes hallucinations while preserving structure.


The Color Explosion Problem

Even after noise is controlled, another issue appears:

RGB Instability

CLIP works in embedding space.
It does not understand physical colors.

Optimizer may push pixels like:

  • R = 7
  • G = -4
  • B = 19

Mathematically valid.
Physically impossible.


Solution: Clamping

After each update:

img_tensor.clamp_(0, 1)

This keeps values within valid image range.

Result:

  • Stable colors
  • Reduced saturation artifacts
  • Human-acceptable outputs

Octave-Based Semantic Refinement

Instead of optimizing at full resolution immediately, we use:

Low resolution β†’ establish structure
High resolution β†’ refine details

I^=I0+βˆ‘k=1NΞ”k\hat{I} = I_0 + \sum_{k=1}^{N} \Delta_k

This prevents early noise amplification.


Final Pipeline Summary

  1. Encode text prompts into embedding space

    • Positive = Attractor
    • Negative = Repellor
  2. Optimize image pixels using:

    • Multi-scale cutouts
    • Masked TV regularization
    • Cosine LR scheduling
    • Gradient normalization
  3. Clamp pixel range

  4. Progress through octaves


Philosophical Takeaway

Most generative models learn how to produce images.

This project asks:

What if we don’t teach a model to generate β€” but instead force it to imagine by optimizing reality itself?

The result is not sampling.

It is pressure-driven hallucination.

A machine dream emerging from gradients.