Apurv
@apurvkaushal
Today I learnt #6 Myth: Text to image/ video tools like Midjourney, Sora are also built on large language models similar to GPT (next text prediction) Fact: Text to image/ video is possible because of an algorithm class "diffusion models".
1 reply
0 recast
2 reactions
Apurv
@apurvkaushal
What is diffusion algorithm? Step 1 (Encoding) For one image & caption: 1. You take that image and add a series of noise to it - from too little to total chaos. 2. Then you take each instance of a (noisy image , caption, noise) & pass it through neural network (a set of mathematical calculations modelled on your brain neurons)
1 reply
0 recast
0 reaction
Apurv
@apurvkaushal
3. The neural network tries to predict the noise, given the noisy image & caption. Initially it performs poorly but improves with feedback in each iteration (comparing the noise it predicted with the actual noise i.e. loss function) 4. This repeats for 1000s of image resulting in a neural network that has learnt how to predict noise given a noisy image & caption
1 reply
0 recast
0 reaction