Dan Romero pfp
Dan Romero
@dwr.eth
Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?
12 replies
0 recast
0 reaction

Venkatesh Rao ☀️ pfp
Venkatesh Rao ☀️
@vgr
I doubt it. We’re at the start of an arms race between training and membership inference algorithms. https://arxiv.org/abs/2301.09956 Even if Western majors respect regulatory type regimes and respect robots.txt directives many won’t. The only defense is encryption not regulation.
1 reply
0 recast
0 reaction

0xbyron pfp
0xbyron
@byron
I'm curious what's the law around crawling sites that disregard robots.txt and post mirrors of content.
1 reply
0 recast
0 reaction

Shashank  pfp
Shashank
@0xshash
might be interesting if chatgpt can include citations in the results but it might become more like Google at that point
0 reply
0 recast
0 reaction

Adam Baybutt pfp
Adam Baybutt
@baybutt
How do LLMs incentivize users to give feedback on answer quality? Offer fee? But then just max number of feedbacks. Offer token for ~shared rev? Incentivize credible feedback.
0 reply
0 recast
0 reaction

phil pfp
phil
@phil
I don’t think so. If we continue to see model sizes increase I would expect GPT-4, 5 to also be trained on a similar corpus with better results. What ~might~ happen is that new webpages have protection against this kind of scraping. Hard to do retroactively since the data is probably already cached
0 reply
0 recast
0 reaction

keccers pfp
keccers
@keccers.eth
I hope so. I wasn’t attuned to the risk previously but now that I am I don’t want MegaCorpLLM getting a scrap, save for my illegible tweets lmao
0 reply
0 recast
0 reaction

wake pfp
wake
@wake
Crawl me baby.
0 reply
0 recast
0 reaction

timdaub pfp
timdaub
@timdaub.eth
lol we must be listening to all in around the same time mark
0 reply
0 recast
0 reaction

🎩 MxVoid 🎩 pfp
🎩 MxVoid 🎩
@mxvoid
Could be. Microsoft is already being sued for CoPilot; StabilityAI, Midjourney, and Deviant Art are being sued for Stable Diffusion; it’s just a matter of time before OpenAI gets sued for their products, too. When the lawsuits start flying, so do the CYA measures.
1 reply
0 recast
0 reaction

William Saar pfp
William Saar
@saarw.eth
If AIs can generate enough value, it might be worth paying armies of Mechanical Turk-style workers to manually visit and rewrite web sites for copyright-approved training Facts and ideas can't be copyrighted, only particular expression
1 reply
0 recast
0 reaction

Justin Hunter pfp
Justin Hunter
@polluterofminds
Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC
1 reply
0 recast
0 reaction

Heath pfp
Heath
@hackley01
First, I’m impressed by the thoughtfulness of your responses - very bullish on what you’re building here. Secondly, I think the knee jerk reactions will settle down.
0 reply
0 recast
0 reaction