agus on Farcaster

Content pfp

0 reply

0 recast

0 reaction

Agus pfp

Over the past few weeks, Ilja and I have been working on a tool that simplifies the process of creating tutorials for your platform. Here's our product journey so far 🧵

5 replies

0 recast

5 reactions

Agus pfp

The tool works like this: you prompt it with what you want to teach (e.g., how to create a new collection of products on Shopify) and it gives you back a step-by-step guide that includes screenshots of each step (eventually also screen recordings); all powered by AI

1 reply

0 recast

0 reaction

Agus pfp

It’s funny how one would think that with the current AI demos out there, it should be trivial to implement something like this, but there are actually many things to take into consideration 😄

1 reply

0 recast

0 reaction

Agus pfp

Like with any product out there, the devil is in the details, which is also what makes developing products so fun (and sometimes painful!) Here’s how it went for us:

1 reply

0 recast

0 reaction

Agus pfp

The task was straightforward: you prompt it with a flow to teach, and you get a step-by-step guide on how to execute that flow with screenshots of each step

1 reply

0 recast

0 reaction

Agus pfp

The main question is, how will an LLM get this knowledge and obtain the screenshots of each step to generate the guide? There are two options here:

1 reply

0 recast

0 reaction

Agus pfp

1. A human takes the screenshots of each step, and you embed these as docs that you can use in a RAG pipeline to generate the guide. • Pros: Straightforward • Cons: Doesn’t scale for any flow. You can’t generate variants with different inputs, paths taken, screens, etc.

1 reply

0 recast

0 reaction

Agus pfp

2. You use a browsing agent (like Voyager) to go through the flow and take the screenshots for you while explaining each step it takes when creating the guide • Pros: Scales for any flow

1 reply

0 recast

0 reaction

Agus pfp

• Cons: Browsing agents out there get stuck very often. You need to update them to now take screenshots at each step and explain what they’re doing

1 reply

0 recast

0 reaction

Agus pfp

To get the best of both worlds, we went with a hybrid approach. A human would record themselves doing the flow once, and that flow would be used to teach a custom browsing agent how to navigate the site and where to take screenshots

1 reply

0 recast

0 reaction

Agus pfp

Following this plan, we started by creating a Chrome Plugin that would allow you to record yourself going through a flow

1 reply

0 recast

0 reaction

Agus pfp

Each action taken (e.g., clicks, writing text, scrolling, etc.) would be saved, including a screenshot of the page at that time and a few other metadata such as the element it clicked on and the URL it was on

1 reply

0 recast

0 reaction

Agus pfp

Here you are already facing a lot of small details you need to take care of. What metadata is useful to describe what action you are taking to an LLM? When do you take screenshots of events that are delayed by animations such as opening a modal?

1 reply

0 recast

0 reaction

Agus pfp

How do you handle knowing where someone clicked based on the page position? How do you signal to the user that the plugin is recording but avoid showing that part of the UX in the screenshots you take?

1 reply

0 recast

0 reaction

Agus pfp

These might sound trivial (and some of them are) but these are things you need to solve for, and probably didn’t account for when you first thought about the problem 😄 It’s not only about discovering the problem but deciding on a good solution for them!

1 reply

0 recast

0 reaction

Agus pfp

Following up, we would use GPT-4 to summarise all this data recorded into a single step-by-step guide of the flow

1 reply

0 recast

0 reaction