How AI Training Data Decides Who Gets Cited

Everyone wants to get cited by ChatGPT. By Perplexity. By Gemini. By whatever model ships next quarter.

Most of the advice online is about formatting. Write clear headings. Use schema markup. Publish FAQ sections. Optimize for "AI SEO."

That advice is not wrong. But it misses the real question. The question that determines whether you can ever show up in an AI answer at all.

Where were you before the training cutoff?

That is the question. And most people do not like the answer.

The pipeline nobody talks about

An LLM does not browse the internet when you ask it a question. Not in its default mode. It generates answers from patterns learned during training. Patterns extracted from a specific set of data sources, assembled months or years before you typed your prompt.

The pipeline looks like this:

graph LR A["Data Sources
Common Crawl, Wikipedia,
arXiv, Books, Code"] --> B["Curation &
Filtering"] B --> C["Training
Cutoff Date"] C --> D["Model Weights
(Frozen Knowledge)"] D --> E["User Prompt"] E --> F["Citation
Behavior"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#191918,stroke:#6b8f71,color:#6b8f71 style D fill:#222221,stroke:#c8a882,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#191918,stroke:#c8a882,color:#c8a882

Everything to the left of "Training Cutoff Date" happened in the past. The model cannot learn new things after that point. Not in its base form.

Yes, some models have web browsing. ChatGPT can search Bing. Perplexity crawls the web in real time. But even with retrieval-augmented generation (RAG), the model's baseline understanding of who you are, what you do, and whether you are credible was formed during training.

RAG retrieves. But the model decides what to trust based on what it already knows.

This has a practical consequence that most people ignore: if you did not exist in high-authority data sources before the training cutoff, the model has no reason to trust you when it encounters you during retrieval. You are an unknown entity. And unknown entities do not get cited with confidence.

What is actually in the training data

Let's be specific. A 2024 Mozilla Foundation report analyzed 47 generative LLMs released between 2019 and 2023. Their finding: 64% of those models used at least one filtered version of Common Crawl data. For GPT-3, over 80% of training tokens came from filtered Common Crawl ^[1].

But Common Crawl is not the only source. LLM training datasets are assembled from multiple pools, each carrying different authority weight.

The LLaMA training data breakdown tells the story clearly:

Source	Authority Level	Inclusion in Major LLMs	What Gets Extracted
Common Crawl (filtered)	Variable. Depends on domain authority of crawled sites	GPT-3, GPT-4, LLaMA, Gemini, Claude (via derivatives like C4, RefinedWeb)	Full page text, entity mentions, factual claims, relationships between concepts
Wikipedia	Very high. Community-edited, citation-backed, neutral	Every major LLM. Often upsampled 3-5x relative to size	Entity definitions, relationships, biographical data, organizational facts, structured infoboxes
arXiv	High. Peer-adjacent, institutional affiliations	GPT-3, LLaMA, Gopher, Chinchilla, PaLM	Research findings, methodologies, author-institution mappings, citation networks
Zenodo / institutional repositories	High. DOI-backed, persistent identifiers	Included via Common Crawl and specialized academic crawls	Datasets, whitepapers, author metadata, DOI-linked publications
ORCID	High. Verified researcher identity	Indirectly via academic crawls and cross-references	Author disambiguation, institutional affiliations, publication lists, verified identity links
Google Scholar profiles	Moderate-High. Aggregates citation data	Indirectly via Common Crawl and academic metadata	Citation counts, h-index signals, publication metadata, co-author networks
Books (digitized corpora)	High. Editorial review, publisher backing	GPT-3 (Books1, Books2), LLaMA (Gutenberg), PaLM	Long-form knowledge, author authority, subject depth
StackExchange / Reddit	Moderate. Community-vetted, upvote-filtered	LLaMA, GPT models, most open-source models	Q&A patterns, practical knowledge, community consensus signals

Notice the pattern. The highest-authority sources are the ones with institutional backing, persistent identifiers, and editorial or community review processes. Your personal blog, even if it has great content, sits inside Common Crawl alongside billions of other pages. Wikipedia gets upsampled. arXiv gets its own dedicated dataset. Your blog gets treated like noise unless something else validates it.

Key concept: LLMs do not evaluate your content in isolation. They evaluate it in the context of where it appeared in the training data. The same paragraph on Wikipedia and on your blog carries different weight. The platform is the signal.

The authority hierarchy in AI citations

Not all training data sources contribute equally to citation behavior. Research from multiple studies paints a consistent picture. An Analyze AI study of 83,670 citations across ChatGPT, Claude, and Perplexity found that Wikipedia accounts for 47.9% of ChatGPT's citations ^[2]. Claude favors blog content at 43.8%. Perplexity leans on Reddit at 46.7%.

But across all platforms, certain source types consistently outperform others when it comes to being cited with attribution.

These numbers are illustrative, synthesized from multiple studies. The exact percentages vary by platform and query type. But the shape of the distribution is consistent: Wikipedia and academic sources dominate. Personal websites account for a tiny fraction.

This is not a ranking algorithm you can game. This is a reflection of what was in the training data and how heavily it was weighted.

The training cutoff problem

Here is where it gets uncomfortable for people who are just starting to think about AI visibility.

Every model has a training cutoff. GPT-4's original cutoff was September 2021. Claude's cutoff moves forward with each release. LLaMA models have their own cutoffs. The exact dates vary. But the principle is the same.

Anything that existed in the training data before the cutoff is "known" to the model. Anything that appeared after is unknown until the model encounters it through retrieval.

And here is the thing about unknown entities: the model treats them differently.

When ChatGPT retrieves information about a well-known entity through web browsing, it can cross-reference that information against its training data. It already has a representation of that entity. It can verify. It can confirm. It can cite with confidence.

When it retrieves information about an unknown entity, it has nothing to cross-reference against. The only information it has is what it just retrieved. And it knows, statistically, that web-retrieved information is less reliable than training data. So it hedges. It uses weaker language. It avoids direct citation. Or it skips you entirely and cites someone it already knows.

This is not a conspiracy. This is statistics. The model is doing exactly what it was trained to do: assign higher confidence to information it has seen multiple times from multiple sources.

Common Crawl is not democratic

People assume Common Crawl is a neutral snapshot of the internet. It is not.

Common Crawl uses Harmonic Centrality, a graph-theoretic measure, to decide crawl priority. Domains with higher centrality scores get crawled more frequently. More frequent crawling means more appearances in monthly archives. More appearances in archives means greater representation in the training data that LLM companies build from those archives ^[3].

The top domains in Common Crawl's web graph are exactly who you would expect: Facebook, Google, YouTube, Wikipedia, Amazon, Twitter. These domains appear thousands of times more frequently than the average website.

A Ziff Davis analysis found that the curation process for LLM training datasets dramatically increases the proportion of high-Domain Authority content. OpenWebText, one of the datasets used to train GPT-2 and GPT-3, was specifically filtered to include only pages linked from Reddit posts with 3+ upvotes ^[4]. That is not a democratic selection process. That is a popularity filter that amplifies already-popular sources.

So when you hear that "64% of LLMs were trained on Common Crawl data," understand what that means. It means they were trained on a filtered, authority-weighted subset of the internet where high-centrality domains are massively overrepresented.

If your domain sits in the long tail of Common Crawl's web graph, ranked below position 1,000,000, you have an invisible ceiling. Good content does not matter if the pipeline never sees it.

Wikipedia is not just another source

Every major LLM is trained on Wikipedia. Every single one. The Wikimedia Foundation confirmed this directly in 2023: "To date, every LLM is trained on Wikipedia content, and it is almost always the largest source of training data in their data sets" ^[5].

But it gets worse. Or better, depending on where you sit.

Wikipedia is not just included. It is upsampled. Stella Biderman, an AI researcher, explained why: "High quality factual information should be upsampled to achieve the best performance, as the repeated exposure to facts increases a LLM's ability to answer those factual questions correctly." In the LLaMA training mixture, Wikipedia accounts for 4.5% of the data by weight despite being a fraction of a percent of Common Crawl by volume.

This means entities that exist on Wikipedia are represented in the training data at 10-50x the rate you would expect from their raw internet footprint. That is an enormous advantage. It means the model has seen your entity described, defined, and contextualized dozens of times during training. It has learned your relationships, your attributes, your significance.

An entity without a Wikipedia page is fighting against this weight with one hand tied.

I write about this not as someone who has a Wikipedia page. I do not. Not yet. But I understand the mechanics, and understanding the mechanics is the first step toward building the entity infrastructure that eventually makes a Wikipedia page possible. That is the work I describe in the Trust Chain Methodology.

The two-mode reality of modern AI

Modern AI systems operate in two modes, and most people conflate them.

Mode 1: Training data only. The model answers from frozen knowledge. No web access. No retrieval. What you get is whatever the model learned during training. If you were not in the training data, you do not exist in this mode.

Mode 2: Retrieval-augmented generation (RAG). The model searches the web (via Bing for ChatGPT, its own crawler for Perplexity, Brave Search for Claude) and incorporates retrieved content into its answer. This is where "AI SEO" advice focuses.

Here is what the AI SEO crowd gets wrong: they treat Mode 2 as if Mode 1 does not exist. They optimize for retrieval without building the foundation of training data presence.

But Mode 1 shapes Mode 2. The model's prior beliefs, formed during training, influence how it interprets and weights retrieved information. If the model already "knows" a brand from training data, retrieved mentions reinforce an existing representation. If the model has never encountered a brand, retrieved mentions are treated with baseline skepticism.

This is why brand search volume, not backlinks, is the strongest predictor of LLM citations. A Digital Bloom analysis found a 0.334 correlation between brand search volume and LLM mentions ^[2]. Backlinks showed almost no correlation. Domain Authority was negligible.

Brand search volume is a proxy for training data presence. If many people searched for your brand on the web, your brand appeared on many pages. Those pages got crawled by Common Crawl. That data made it into the training pipeline. The model knows you.

As I discussed in AI Search Is Not SEO, the signals that matter for AI citation are fundamentally different from the signals that matter for Google ranking. This is the core of why traditional SEO practitioners are struggling with AI visibility.

What actually gets extracted

When training data is processed, the pipeline does not just dump raw HTML into the model. It extracts specific types of information:

Entity definitions. Who or what is this entity? What category does it belong to? What are its attributes?

Relationships. Who is this entity connected to? What organizations, people, concepts, and events are linked to it?

Factual claims. What specific, verifiable statements are made about this entity? Dates, numbers, locations, credentials.

Consensus signals. Does multiple sources agree on these facts? Wikipedia says X. A news article says X. A government database says X. Three independent confirmations create high confidence.

Authority markers. Does this content come from a source with institutional backing? Does it have persistent identifiers (DOIs, ORCID IDs, ISBN numbers)? Is it part of a review process?

This is why platforms like ORCID, Zenodo, and OSF matter so much for entity infrastructure. I covered this in detail in Three Platforms That AI Trusts More Than Your Website. Those platforms provide exactly the signals that training pipelines extract and weight highly: persistent identifiers, institutional metadata, and cross-referenced verification.

Your website provides content. Those platforms provide trust signals. The model needs both, but it needs the trust signals first.

The 17-year domain age signal

Here is a data point that should make newer practitioners uncomfortable. Research from Ekamoira found that the average domain age of ChatGPT-cited sources is 17 years ^[6]. Seventeen years. That means established entities receive preferential treatment not because the model is biased toward old things, but because older domains have had more time to accumulate training data presence.

A domain that has existed since 2009 has been crawled by Common Crawl across dozens of monthly archives. It has had time to be referenced by other sources, picked up by Wikipedia editors, indexed by academic crawlers. Its entity representation in the training data is dense and cross-referenced.

A domain registered last year has maybe one or two Common Crawl snapshots. No Wikipedia mentions. Minimal cross-referencing. The model barely knows it exists.

This is not a death sentence for new entities. But it means you need to compensate through other channels. Academic repositories. Wikidata entries. Published books with ISBNs. Institutional affiliations. ORCID profiles. You need to exist in the places that training pipelines prioritize, because you cannot rely on your website alone to build sufficient training data presence.

Multi-platform presence as a multiplier

Research consistently shows that brands appearing on 4+ platforms are 2.8x more likely to appear in ChatGPT responses than single-platform brands ^[2]. Only 11% of domains are cited by both ChatGPT and Perplexity for the same query.

This means two things.

First, platform diversification matters. You cannot optimize for one AI system and expect coverage across all of them. Each platform has different retrieval architecture, different source preferences, and different training data composition.

Second, multi-platform presence is a signal in itself. When the model encounters your entity on Wikipedia, on ORCID, on Zenodo, on Google Scholar, on LinkedIn, and on your website, each occurrence reinforces the others. The model builds a richer, more confident representation. Single-source entities are fragile. Multi-source entities are durable.

This is the core principle behind entity infrastructure. It is not about building a better website. It is about building a presence across the databases, platforms, and repositories that training pipelines actually consume.

What you can do about this

I am a practitioner. I build these systems. So let me give you the practical framework.

Step 1: Audit your training data presence. Check if your domain appears in Common Crawl archives. Search the Common Crawl index at index.commoncrawl.org. If your domain has minimal coverage, you have a baseline problem that content optimization cannot fix.

Step 2: Establish institutional anchors. Get on ORCID. Publish something on Zenodo with a DOI. Create a Wikidata entry for your organization. These are not vanity exercises. These are entries into high-authority training data sources.

Step 3: Build the Wikipedia path. You probably cannot create a Wikipedia page today. Most people and companies do not meet notability requirements. But you can start building the evidence base: published works, press coverage, institutional affiliations, documented impact. Wikipedia is downstream of these activities.

Step 4: Publish where training pipelines look. arXiv if you do research. Google Play Books if you publish. Medium or LinkedIn for industry analysis (these are crawled heavily by Common Crawl). GitHub for technical work. Each of these platforms has higher crawl frequency and training data representation than your personal domain.

Step 5: Create consensus signals. The same facts about your entity should appear consistently across multiple sources. Your name, title, affiliations, and expertise should match on your website, LinkedIn, ORCID, Wikidata, and any publication profiles. Inconsistencies reduce AI citation confidence by 30-40%.

Step 6: Think in training cycles. New model versions ship every 3-6 months. Each new version has a more recent training cutoff. What you publish today may not appear in the current model, but it will appear in the next one. This is a long game. Entity infrastructure built today compounds over every subsequent training cycle.

The uncomfortable truth

Most AI visibility advice focuses on what you can do after the model already exists. Formatting. Schema markup. FAQ sections. These are real tactics for Mode 2 retrieval. They help.

But the foundational layer, the one that determines whether you are a known entity or an unknown one, is determined by the training data. And the training data was assembled months or years ago from sources you may never have thought about.

If you want AI systems to cite you with confidence, you need to exist in the places they learned from. Not just the places they search.

That is entity infrastructure. Not SEO for AI. Infrastructure.

I build it. I document it. I run three companies and I have spent years figuring out which levers actually move the needle on entity visibility. The answer is not a checklist of formatting tricks. The answer is a systematic approach to making yourself findable in the datasets that AI models actually train on.

That work starts before the training cutoff. And it compounds with every cycle.

Frequently Asked Questions

Does optimizing my website for AI actually help if I am not in the training data?

It helps for retrieval-augmented generation (Mode 2), where the model searches the web in real time. But your ceiling is lower. Models assign higher confidence to entities they already know from training data. Website optimization without training data presence is like polishing a resume that nobody will read. You need both layers: training data presence for credibility, and website optimization for discoverability.

How do I check if my content is in Common Crawl?

Use the Common Crawl Index Server at index.commoncrawl.org. Select a recent crawl period, enter your domain, and see which pages were captured. If your domain has few or no results, Common Crawl's crawler is not prioritizing you. This means your content is likely underrepresented in LLM training datasets. Building links from higher-centrality domains can improve your crawl priority over time.

Why does Wikipedia matter so much more than other sources for AI?

Three reasons. First, every major LLM includes Wikipedia in its training data. No exceptions. Second, Wikipedia is upsampled during training, meaning entities on Wikipedia are represented at 10-50x their proportional internet footprint. Third, Wikipedia follows strict sourcing requirements, which means the training pipeline treats it as a pre-vetted, high-confidence source. An entity on Wikipedia starts with a credibility advantage that no amount of website optimization can replicate.

Can I get cited by Perplexity without being in training data?

Yes. Perplexity uses real-time web retrieval for every query, so it can cite content regardless of training data presence. But citation likelihood still depends on content quality, structural clarity, and third-party validation. Perplexity's reranker evaluates freshness, semantic relevance, and authority signals. Having a strong entity presence across multiple platforms improves your chances. But Perplexity is one platform. ChatGPT, Claude, and Gemini all rely more heavily on training data for their baseline responses.

How long does it take for new content to appear in AI training data?

It depends on the model release cycle. Major LLMs update their training data every 3-6 months. Content published today might appear in the next model version's training data if it gets crawled by Common Crawl and included in the filtered dataset. But this is not guaranteed. High-authority sources get crawled more frequently and are more likely to be included. The practical timeline for building meaningful training data presence is 6-18 months for most entities starting from scratch.

References

Mozilla Foundation. "Training Data for the Price of a Sandwich." 2024. Analysis of 47 generative LLMs and their training data sources. Link
The Digital Bloom. "2025 AI Visibility Report: How LLMs Choose What Sources to Mention." 2025. Link
Common Crawl. "How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals." 2025. Link
Ziff Davis. "The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs." 2024. PDF
Wikimedia Foundation. "Wikipedia's Value in the Age of Generative AI." 2023. Link
Ekamoira. "The Science of AI Citations: How LLMs Choose What Sources to Reference." 2026. Link

How AI Training Data Decides Who Gets Cited

The pipeline nobody talks about

What is actually in the training data

The authority hierarchy in AI citations

The training cutoff problem

Common Crawl is not democratic

Wikipedia is not just another source

The two-mode reality of modern AI

What actually gets extracted

The 17-year domain age signal

Multi-platform presence as a multiplier

What you can do about this

The uncomfortable truth

Frequently Asked Questions

References

Linked from

Related notes

How AI Training Data Decides Who Gets Cited

The pipeline nobody talks about

What is actually in the training data

The authority hierarchy in AI citations

The training cutoff problem

Common Crawl is not democratic

Wikipedia is not just another source

The two-mode reality of modern AI

What actually gets extracted

The 17-year domain age signal

Multi-platform presence as a multiplier

What you can do about this

The uncomfortable truth

Frequently Asked Questions

References

Further reading

Linked from

Related notes