Dan Romero
@dwr.eth
Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?
12 replies
0 recast
0 reaction
Justin Hunter
@polluterofminds
Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC
1 reply
0 recast
0 reaction
kenny 🎩
@kenny
Yes robots.txt is only a suggestion. Google will still index blocked pages, especially if they have a large amount of links pointing at them. Real crawling restrictions done at the server level.
0 reply
0 recast
0 reaction