Today I learnt #6 

Myth: Text to image/ video tools like Midjourney, Sora are also built on large language models similar to GPT (next text prediction)

Fact: Text to image/ video is possible because of an algorithm class "diffusion models".

Co founder @Intract - private data stack for AI & crypto use cases. Documenting my journey to become 100x AI first founder. Growth nerd, part time guitar player

3. The neural network tries to predict the noise, given the noisy image & caption.

Initially it performs poorly but improves with feedback in each iteration (comparing the noise it predicted with the actual noise i.e. loss function)

4. This repeats for 1000s of image resulting in a neural network that has learnt how to predict noise given a noisy image & caption

Step 2 (Decoding)

Now when you give Midjourney a prompt ("create a meme to explain the impact of text to image models for non technical audience") 
1. It generates a random noise (a random picture - just by different combinations of red blue green pixels )

2. Now given that random noise & your prompt, it tries to predict noise (your prompt acting like a caption)

3. The predicted noise is repeated from the initial image & the process repeats till noise predicted ->0 

PS: Image generated from Open AI Dall E (alternative  to Midjourney)

What is diffusion algorithm? 

Step 1 (Encoding)
For one image & caption:
1. You take that image and add a series of noise to it - from too little to total chaos.

2. Then you take each instance of a  (noisy image , caption, noise) & pass it through neural network (a set of mathematical calculations modelled on your brain neurons)