Why Your Site Needs To Be Eligible For Common Crawl For ChatGPT Optimization

Transcript:

Hey, I’m Chris Long with Nectiv, and today’s GEO tip is to really understand how ChatGPT actually trains its models. One of the biggest sources of influence is Common Crawl.

For those who don’t know, Common Crawl is a nonprofit organization that openly and freely crawls the web and gives that access to anyone who wants it. In theory, you could use Common Crawl’s data to train your own model or use it however you want.

What’s really interesting is that if you look at the Wikipedia page for GPT-3—one of ChatGPT’s earlier models—you’ll see that 60% of its weighted training data came from Common Crawl. That’s a key point: if you want to be included in ChatGPT’s knowledge base, your content most likely needs to be eligible for Common Crawl. It likely needs to be in that database.

Of course, OpenAI uses other data sources as well, but Common Crawl is one of the biggest. If you’re trying to influence content that isn’t there—like LinkedIn or Twitter—you might not see your content show up in ChatGPT, because it’s not influencing the training data.

Curiously, if you look at the training data for GPT-4, OpenAI has been much less transparent. They didn’t release any technical details about GPT-4. But the takeaway is this: if you’re trying to influence ChatGPT’s training data, you should focus on making sure your content is accessible via Common Crawl. That’s one of the key ways to influence the underlying training data.