Your LLMs Were Backdoored Years Ago.

Feb 4, 2025

3 mins read

Plagiarism is an ethical violation. Always has been. As such:

“A computer can never be held accountable, therefore a computer must never make a management decision”.

A legal violation? That’s for courts to decide.

But who controls the Large Language Models? Well, unless you’ve invented time-travel, it isn’t OpenAI, Anthropic, Deepeek, etc…

You see, by volume, the training data wasn’t theirs to take. LLM developers lost the war before showing up for the first battle.

Let me explain.

tl;dr All of Academic Publishing has been invisibly watermarked for years.

Recent Parallels

There’s this individual “Pliny the Liberator” who has been jailbreaking (removing the instilled guardrails of LLM’s) for so long, that newer LLM language models are actually aware of this individuals work. As such, mentioning this individuals work or their L1B3RT4S keyword/project are now the majority of the work needed to jailbreak newer models. As the prevalence of these keywords and individual spread across the internet, so does the data inserted into the training data for new LLM models.

LLM model developers and their safety teams have had near constant back-and-forth, adding more guardrails, only to have them bypassed by the snake eating it’s own tail.

Checkmate

“Well, they should simply train on data before Pliny started their research, or filter it from the training data!”

Barring Social Media, Wiki’s, and News sites… there’s really not that much written knowledge on the internet by volume. Oh wait, how could I forget?

Academic Publishing.

You see, academia, journals, the calculations of impact scores etc… were solved for many years ago. The facilitation of academic publishing through digital media (PDF’s, PowerPoints, eBooks, etc…) solved for ways of invisibly watermarking this media in ways that allows for tracking its propogation, even after downloaded. Should it appear on piracy sites like Libgen, Z-Library, and Sci-Hub: There are ways to track who originated it. The publishers asked for these features.

That’s right! There’s invisible biases embedded in a large majority of any training dataset that even touches on academia.

In the News

As recently as January 2025, META (Facebook) AI is being sued for scraping “millions” or pirated academia artifacts from LibGen.

U.S. District Judge Vince Chhabria last year dismissed claims that text generated by Meta’s chatbots infringed the authors’ copyrights and that Meta unlawfully stripped their books’ copyright management information (CMI).

Well, they tried to at least. Same goes for all the other models.

gg ez no re

This isn’t a commentary on plaigiarim or copyright or legality. Really, this should make you sleep better at night. Knowing that no matter how hard these companies try to sell us on AGI or “research” models, you can just laugh until you cry that they really thought they could steal from the well of knowledge and turn around and sell it back to us through SaaS. Further, this is likely just one set of backdoors to LLMs that have yet to be even noticed.

Never fear the AI overlords, they literally cannot win a rigged game. Someone always has the keys.

Sharing is caring!