Dan Romero on Warpcast

Dan Romero pfp

Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?

11 replies

0 recast

0 reaction

Venkatesh Rao ☀️ pfp

Venkatesh Rao ☀️

I doubt it. We’re at the start of an arms race between training and membership inference algorithms. https://arxiv.org/abs/2301.09956 Even if Western majors respect regulatory type regimes and respect robots.txt directives many won’t. The only defense is encryption not regulation.

1 reply

0 recast

0 reaction

0xbyron pfp

I'm curious what's the law around crawling sites that disregard robots.txt and post mirrors of content.

1 reply

0 recast

0 reaction

Shashank pfp

might be interesting if chatgpt can include citations in the results but it might become more like Google at that point

0 reply

0 recast

0 reaction

Adam Baybutt pfp

How do LLMs incentivize users to give feedback on answer quality? Offer fee? But then just max number of feedbacks. Offer token for ~shared rev? Incentivize credible feedback.

0 reply

0 recast

0 reaction

phil pfp

I don’t think so. If we continue to see model sizes increase I would expect GPT-4, 5 to also be trained on a similar corpus with better results. What ~might~ happen is that new webpages have protection against this kind of scraping. Hard to do retroactively since the data is probably already cached

0 reply

0 recast

0 reaction

🎩 MxVoid 🎩 pfp

🎩 MxVoid 🎩

Could be. Microsoft is already being sued for CoPilot; StabilityAI, Midjourney, and Deviant Art are being sued for Stable Diffusion; it’s just a matter of time before OpenAI gets sued for their products, too. When the lawsuits start flying, so do the CYA measures.

1 reply

0 recast

0 reaction

William Saar pfp

If AIs can generate enough value, it might be worth paying armies of Mechanical Turk-style workers to manually visit and rewrite web sites for copyright-approved training Facts and ideas can't be copyrighted, only particular expression

1 reply

0 recast

0 reaction

Justin Hunter pfp

@polluterofminds

Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC

0 reply

0 recast

0 reaction

Heath pfp

First, I’m impressed by the thoughtfulness of your responses - very bullish on what you’re building here. Secondly, I think the knee jerk reactions will settle down.

0 reply

0 recast

0 reaction

wake pfp

0 reply

0 recast

0 reaction

timdaub pfp

lol we must be listening to all in around the same time mark

0 reply

0 recast

0 reaction