AI is proving divisive amongst content creators. Many applaud it’s ability to generate swathes of impressive prose from a short prompt. Others are concerned that the technology is being trained on proprietary content, and the rights holders were never given the option of whether they wanted their content to be used in this way, to be paraphrased and regurgitated with questionable accuracy, and possibly cited as the author of the AI-bodged sysnopsis.
As a Mallow Web Design company, we’re of the opinion that anything on the web is in the public domain anyway, but your business may, for many reasons, not want your website content used to train AIs.
While the initial dataset used to train ChatGPT seems to have been much of the accessible world wide web, this dataset only contains data up until 2021, and it is possible to tell ChatGPT’s web crawler NOT to crawl your site content.
All you need to do is edit the robots.txt file at the root of your public HTML folder (or create the file if it doesn’t exist already) and add the following:
This will stop ChatGPT, but there are a batallion of other AI bots and general bots out there crawling the web. If you’re particular about what crawls your website, you may want to set up a custom robots.txt that denies access to everything by default and then whitelists the common search engine bots that you DO want to crawl your site.
It’s also important to know that robots.txt will not prevent a malign bot from crawling your site. The robots construct is an optional one – bots need to check the robots file to see if they are welcome or not. The good bots respect the file, the bad bots likely don’t even read it!