CVPR paper:  Instruct-Imagen: Image Generation with Multi-modal Instruction 

Innovations: 
- Multi-modal instruction for image generation: A new format that uses natural language to combine different modalities (text, edge, style, subject, etc.) to articulate complex generation intents in a uniform way.

- Two-stage training approach for Instruct-Imagen:
a) Retrieval-augmented training: Adapts a pre-trained text-to-image model to handle multi-modal inputs using retrieved similar (image, text) pairs.
b) Multi-modal instruction-tuning: Fine-tunes the adapted model on diverse image generation tasks paired with multi-modal instructions.

- Unified model architecture that can handle various image generation tasks - through multi-modal instructions, without task-specific designs.

- Zero-shot generalization capability to unseen and more complex image generation tasks.

- Adaptability to new tasks through fine-tuning on small datasets.

source:

CVPR paper:  Instruct-Imagen: Image Generation with Multi-modal Instruction 

Innovations: 
- Multi-modal instruction for image generation: A new format that uses natural language to combine different modalities (text, edge, style, subject, etc.) to articulate complex generation intents in a uniform way.

- Two-stage training approach for Instruct-Imagen:
a) Retrieval-augmented training: Adapts a pre-trained text-to-image model to handle multi-modal inputs using retrieved similar (image, text) pairs.
b) Multi-modal instruction-tuning: Fine-tunes the adapted model on diverse image generation tasks paired with multi-modal instructions.

- Unified model architecture that can handle various image generation tasks - through multi-modal instructions, without task-specific designs.

- Zero-shot generalization capability to unseen and more complex image generation tasks.

- Adaptability to new tasks through fine-tuning on small datasets.

source: https://arxiv.org/abs/2401.01952