Everyday Users Power Modern AI Responses
In an era where artificial intelligence (AI) systems like ChatGPT, Grok, and Gemini seem to possess near-encyclopedic knowledge, it's easy to overlook the origins of their "intelligence."
Contrary to the myth of AI as an autonomous thinker, these models are fundamentally products of vast datasets scraped from the internet—predominantly user-generated content from platforms like Reddit, forums, and social media.
With this letter, I want to explore how AI relies on sources such as Reddit and "literal users" (i.e., everyday individuals contributing posts, comments, and discussions) to train models and generate answers.
The Foundation: Pre-Training on Massive Internet Datasets
AI models, particularly large language models (LLMs), undergo a "pre-training" phase where they ingest enormous volumes of text to learn patterns in language, facts, and reasoning.
The primary source for this is the open web, aggregated through datasets like Common Crawl (THE OG !)—a nonprofit archive that has scraped over 100 billion webpages since 2008.
Common Crawl forms the backbone of many LLMs, comprising up to 60% of the training data in models like GPT-3.
Within this, user-generated content dominates, as forums and social sites provide diverse, conversational data that mimics human interaction.
Reddit, in particular, stands out as a goldmine. With over 17 billion posts, it's a treasure trove of natural language discussions on everything from science to pop culture.
In GPT-3's training mix, for instance, the "WebText2" dataset—derived from highly upvoted Reddit links—accounts for about 20% of the data (sheesh), emphasizing Reddit's role in shaping model outputs.
This isn't coincidental; Reddit's threaded discussions, memes, and debates provide "human-like" data that helps models understand context, humor, and nuance.
Reddit’s influence:
This table illustrates how user-driven platforms like Reddit amplify AI's ability to handle informal queries, as models "predict" responses based on patterns seen in training data—e.g., completing "Eiffel Tower is in..." with "Paris" due to frequent occurrences in user posts, etc.
From Parametric Knowledge to Retrieval-Augmented Generation
Once trained, AI doesn't "remember" facts like a database; it generates responses probabilistically based on training patterns. For questions, it draws from:
Parametric Knowledge: Embedded from training data. If a query aligns with Reddit-heavy patterns (e.g., "best hiking tips"), the response might echo subreddit advice like r/hiking.
Retrieval-Augmented Generation (RAG): Advanced systems search live sources, including Reddit, to fetch up-to-date info. Tools like Perplexity AI or Grok use web searches to cite Reddit threads directly.
But, this reliance introduces biases: Models trained on Reddit sometimes turn out like brainrot (low-quality or meme-heavy content), leading to quirky or inaccurate outputs.
I mean we all remember the example we saw from that indian call center hiring 12 software devs to masquerade as an AI agent solving code problems in real-time…that was at least funnier than paying reddit $60 mil (drops in a bucket) to train data harvesting LLM’s from human behavior and quirks to turn around and mimic human behavior to humans...
This should also empower you as a human again. We are literally the center of all creation. All AI can do is copy!
We should also create things ourselves knowing the best that AI can do, is speed up our lives by hypersourcing human creation. Our creations should mimic God’s creation on a smaller scale.
Obviously, the latest models like Claude 4 do so well with semantic language its almost a moot point to now worry about how it was trained the fact of the matter is that it was trained and is currently capable of complex code tasks. I am not saying there is no use for AI or anything of the sort, but it is remarkably blown out of proportion with respect it to “solving the world’s problems”.
It shines best when paired with an IDE that can simplify coding with automation. That is basically how i like to use AI, or, to help me research for some of these letters. But even then, i check the sources of some of these AI scrapers and its like 10 of the sources were just pinterest posts or something ridiculous and im like…tf am i doing.
I still recommend everyone to build their own PRD.MD first prior to trying to one-shot an app on replit or something because then you actually created something yourself and then are just trying to use AI to bring it to life. As opposed to just asking AI “build me 10bil company lol”.
WE HAVE ALL THE POWER IF WE EMBRACE IT AND DO THE WORK!
In summary, AI's "smarts" stem largely from platforms like Reddit and the contributions of literal users, as evidenced by training compositions, billion-dollar deals, and retrieval mechanisms. This democratizes knowledge but underscores the need for ethical data practices. As AI evolves, understanding these roots will be key to navigating its outputs responsibly.
MORAL: THINK FOR YOURSELVES, RESEARCH YOUR NEW STARTUP IDEA, CREATE YOUR MVP WITHOUT ANY AI WHATSOEVER, ONLY THEN, OUGHT YOU USE AI TO BUILD YOUR CODE FROM THAT ORIGINALLY CREATED MVP/PRD.MD
What do you think? Iykyk
God-Willing, see you at the next letter
GRACE & PEACE