Carlos Eduardo Ravello Joo May 2026 Knowledge systems · AI · Academic publishing

The third knowledge regime: when AI absorbs what academia and open repositories cannot

During the four days it took to write the paper behind this essay, GitHub logged the creation of more than 1,300,000 new repositories. Hugging Face added approximately 1,600 new AI models. arXiv received between 2,600 and 3,700 new papers. That is 230 repositories per minute. Not as spectacle. As symptom.

The symptom points to a question knowledge institutions are not answering with sufficient honesty: why does an architecture without metabolism, without professional career, without reputational fear, without biographical time, operate cognitively faster than systems specifically designed to produce and validate knowledge?

The usual answer — that AI has more computational power — is true but incomplete. Computational power explains processing speed. It does not explain why a five-year clinical study can remain invisible to the systems building knowledge at scale today, while a poorly structured GitHub repository gets indexed within hours.

There is a more uncomfortable answer: AI's operational advantage emerges both from computational capacity and from the elimination of accumulated institutional frictions that limit collective human cognition. This is not an intelligence problem. It is an architecture problem.

The institution that ran out of time

Peer review was born in a world where the problem was informational scarcity. Little knowledge was produced, and the challenge was protecting it from contamination. Slowness was an adaptive feature, not a defect. Careful validation was the reasonable price of reliability.

That world no longer exists. In 2025, industry produced more than 90% of notable frontier AI models, with development cycles of weeks. On benchmarks like SWE-bench Verified, performance went from approximately 60% to nearly 100% in a single year. The average doctoral thesis in the United States takes 7.3 years from program start. In that time, AI renewed its capabilities completely between four and seven times.

The documented result: relevant research becomes obsolete during its own validation process. Gartenberg et al. (2026), analyzing 6,957 submissions and 10,389 reviews in Organization Science between 2021 and 2026, documented the collapse of the quality-quantity trade-off driven by the massive increase in post-ChatGPT submissions — up 42% — without proportional scaling of review infrastructure.

More seriously: the system does not consistently produce what it promises. The Open Science Collaboration (2015) found that only 36–39% of psychological studies replicated successfully. Camerer et al. (2018) documented similar results in economics and social sciences. The Retraction Watch database exceeds 58,000 entries. In the first weeks of 2026, one in every 277 papers indexed in PubMed cited a nonexistent AI-generated reference — up from one in 458 in 2025 and one in 2,828 in 2023.

Richard Smith edited the British Medical Journal for 25 years. He wrote: peer review is a flawed process at the heart of science and journals. Not an external critic. The person who administered it.

The chaos that looks like freedom

The response to the slow failure was speed without filtering. Open repositories — GitHub, Zenodo, Hugging Face, arXiv, OSF — solved the problems of technical invisibility and bureaucratic slowness. A DOI in hours. Immediate global visibility. Versioning. Citability. These are real gains.

But speed without methodology produces its own failure. npm recorded spam campaigns of up to 17,000 packages per day at 2025 peaks, with more than 67,000 fraudulent packages accumulated over two years. PyPI added 356 new projects daily in 2025. Hugging Face hosted over 1.5 million models as of March 2025, with hundreds of new daily repositories of indeterminate quality.

The most revealing data point: Carneiro et al. (2020) found that the reporting quality difference between preprints and reviewed papers is approximately 5%. After months of editorial process, the improvement is marginal. This means either that peer review does not add as much value as it claims, or that open repositories are not as chaotic as their critics argue. Either way, the validation system does not justify its cost with its result.

Noise is not just a quality problem. It is a training problem. The language models that hallucinate most are those operating on scarce, contradictory, or low-quality data. When open repositories produce noise at scale, that noise becomes signal for systems building automated knowledge. Chaos does not stay contained. It scales.

The third actor

Classical academia and open repositories appear opposed. Functionally they are distinct mechanisms for solving the same problem: how to filter information under human limits of attention and coordination. Academia manages uncertainty through hierarchy, scarcity, credentials, and exclusion. Open repositories through abundance, iteration, massive exposure, and memetic selection. Both are responses to different historical conditions. Neither was designed for the current scenario.

AI enters as a third epistemological regime. It does not validate like academia. It does not explore chaotically like the open internet. It synthesizes, recombines, compresses, and accelerates cognitive navigation. But it has a critical structural limit: its quality depends entirely on the quality of what it ingests. Stanford demonstrated that LLMs reach 96% useful responses when combined with verified structured data parsing, versus frequent errors without that base. AI is not epistemically self-sufficient. It is a compression machine that amplifies what it receives — good or bad.

This introduces the central concept of the full paper: epistemological coordination costs. The contemporary knowledge crisis does not emerge simply from an excess or absence of filters, but from the institutional inability to reconcile epistemic validation and cognitive speed under conditions of informational overproduction. AI radically reduces those coordination costs. Human institutions increased them until they became unsustainable.

The infrastructure no one talks about

Google migrated its Knowledge Graph infrastructure from Freebase to Wikidata between 2014 and 2016. Wikidata is today the primary source of 500 billion facts about 5 billion entities that Google uses. Google's patents explicitly reference Wikidata for entity attribute extraction. LangChain launched official Wikidata integration in 2024. Amazon Alexa, Apple Siri, and Microsoft integrate the same base. 72% of Wikipedia articles use Wikidata for their infoboxes.

Wikidata is not an encyclopedia repository. It is the architecture of the knowledge graph on which the systems that build the semantic world that LLMs ingest operate. Without a node in Wikidata, Google cannot resolve the entity. Without entity resolution, dispersed mentions on the web do not accumulate toward any center. The work exists but for the system it is nobody.

That filter is administered in part by anonymous editors applying 19th-century notability criteria with 20th-century tools over 21st-century phenomena. An editor without name, without verifiable expertise in the domain, can delete the technical work of an independent researcher in minutes — invoking notability criteria that do not distinguish between research without institutional affiliation and content without value. Without effective right of reply. Without accountability for the impact. In total anonymity.

The Wikidata community acknowledged the problem in its February 2026 notability policy reform RFC: certain knowledge has been and is being structurally marginalized, leading to less coverage in reliable sources and therefore an increased barrier to demonstrating notability. That is the circular lock in its most precise formulation: the system demands notability to enter, but notability is built with the visibility the system denies by not entering. Not a governance bug. The structure itself.

The technical consequences are concrete. Without a node in Wikidata: no entry to Google's Knowledge Graph, no Knowledge Panel, no entity resolution for scattered mentions, no recognition by LLMs, no appearance in AI-generated responses. The virtuous cycle of semantic authority never starts.

The floor moved

Publishing in Scopus, Nature, or Elsevier is not just slow. In 2025, it is also an act of progressive technical invisibility. Major academic publishers operate with hard paywalls that serve Googlebot only metadata or initial snippets. In 2025 there was a 336% increase in sites actively blocking AI crawlers like GPTBot and ClaudeBot, with major publishers among the most aggressive. A Rutgers–Wharton study (April 2026) found that publishers who blocked LLM crawlers lost approximately 7% of weekly traffic in the six weeks following the block.

University repositories complete the picture: unstructured URLs, orphaned pages without interlinks, no schema markup, unoptimized sitemaps, pre-modern loading speeds. The Googlebot has a crawl budget. It is a time economy. It will not get lost in a 2003 technical architecture. It goes where the food is served: Zenodo, arXiv, OSF, GitHub Pages.

The researcher who publishes only in traditional channels is not being more rigorous. They are being invisible. And in the current ecosystem, invisible to machines is invisible to the knowledge already being built at scale.

The day the first AI said hello changed the minimum threshold of what it means to do science seriously. That threshold rose for everyone — for those who research and for those who validate. The problem is that only some are being required to notice that the floor moved. A peer reviewer who does not master the tools with which the knowledge they review is produced is not a guardian of rigor. They are a bottleneck with a title.

Popper did not ask for anonymous peers or institutional titles. He asked for real exposure to refutation. The system that today claims that heritage produces — according to Nature's survey of 1,576 researchers — that more than 70% have attempted and failed to reproduce another scientist's experiment, and more than 50% have failed to reproduce their own. 83% of researchers themselves acknowledge that a reproducibility crisis exists. Falsifiability did not disappear because AI arrived. It left earlier, silently, buried in the same system that proclaimed it. Real exposure to refutation did not disappear — it shifted. Open repositories, public benchmarks, code that either works or does not, apply more genuine Popperian pressure on a claim than eight months of peer review followed by a paywall that blocks Googlebot.

Who trains the systems that build tomorrow

Algorithmic weight does not distinguish between truth and opinion. It distinguishes between engagement and silence. An influencer with 10 million followers opining on vaccines, economics, or mental health generates more signal in the graph than a researcher with 40 years of clinical work publishing in a repository that Googlebot never visits. Not because the algorithm is malicious. Because the algorithm optimizes what it was asked to optimize: attention, screen time, interaction.

If rigorously documented knowledge cannot compete in that graph, the problem is not epistemological. It is civilizational. The question is not whether academia is slow. The question is who is going to train the systems that will build the reality of the next generations. If that training is done with the algorithmic weight of opinions without method, without falsifiability, without verifiable record, the result is not a faster world. It is a world where the difference between truth and narrative disappears — not through ignorance but through architecture.

AI has a real ceiling of comprehension and too much imagination. LLMs hallucinate more when verified sources are scarce or blocked. If humans do not leave a well-documented record, machines will fill that vacuum with the imagination they have to spare. And they will do it with the speed that institutions never had.

Documenting well is not academic vanity. It is the only way for the knowledge we produce today to survive the algorithm that decides tomorrow what is real.

Carlos Eduardo Ravello Joo
Founder and independent researcher
Trujillo — Lima, Peru · May 2026
ORCID: 0009-0007-5631-7436
carlosravello.com

Academic preprint · DOI: 10.5281/zenodo.20298744
Full paper: The Third Knowledge Regime (preprint) · Download PDF
License: CC BY 4.0

← All research