← Copyright Generative AI:
What You Need To Know
Poisoning →


All models memorise and overfit

Memorisation is the industry term for when an AI model copies and stores something directly from the training data. Overfitting is when the output fits the training data too well and generates a verbatim copy of something memorised. This happens in all language and diffusion models.

Large models memorise and copy more

The rate of memorisation increases with model size and seems to be part of what increases the performance of larger models.

Copying rate is roughly 1%

Across GitHub Copilot, Stable Diffusion, and many language models, verbatim copying of data from the training data set and into the output happens around 0.1–1% of the time, depending on whether the vendor is specifically trying to minimise it or not. That’s very high for daily use by a team.

Many of these copies are clear copyright violations

Sometimes chunks of the training data are copied exactly. Sometimes only elements from the work. But it happens often enough for it to have already come up as an issue on social media and elsewhere.

Infringement is a matter of outcomes not process

It doesn’t matter if the infringing work is generated by an AI or a chimpanzee. If you publish it and profit from it, you would be in legal trouble.

Copying doesn’t have to be exact to be infringement

Paraphrased text is still infringement. If the work was original enough, then even restaged photographs can be infringement. Being inexact will not protect you.

This is not legal advice

Always trust the opinions of a real lawyer over some guy on the internet.


Cover for the book 'The Intelligence Illusion'

These cards were made by Baldur Bjarnason.

They are based on the research done for the book The Intelligence Illusion: a practical guide to the business risks of Generative AI .

Bai, Ching-Yuan, Hsuan-Tien Lin, Colin Raffel, and Wendy Chih-wen Kan. “On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-Scale Competition.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2534–42, 2021. https://doi.org/10.1145/3447548.3467198.
Biderman, Stella, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. “Emergent and Predictable Memorization in Large Language Models.” arXiv, April 2023. https://doi.org/10.48550/arXiv.2304.11158.
Brown, Gavin, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. “When Is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?” In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 123–32. STOC 2021. New York, NY, USA: Association for Computing Machinery, 2021. https://doi.org/10.1145/3406325.3451131.
Carlini, Nicholas, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. “Quantifying Memorization Across Neural Language Models.” arXiv, February 2022. https://doi.org/10.48550/arXiv.2202.07646.
Feldman, Vitaly. “Does Learning Require Memorization? A Short Tale about a Long Tail.” In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 954–59. STOC 2020. New York, NY, USA: Association for Computing Machinery, 2020. https://doi.org/10.1145/3357713.3384290.
Feldman, Vitaly, and Chiyuan Zhang. “What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation.” In Advances in Neural Information Processing Systems, 33:2881–91. Curran Associates, Inc., 2020. https://papers.nips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html.
GitHub Copilot · Your AI Pair Programmer.” GitHub. Accessed April 5, 2023. http://archive.today/2023.01.11-224507/https://github.com/features/copilot.
Heikkilä, Melissa. AI Models Spit Out Photos of Real People and Copyrighted Images.” MIT Technology Review, 2023. https://www.technologyreview.com/2023/02/03/1067786/ai-models-spit-out-photos-of-real-people-and-copyrighted-images/.
Kaplan, Lewis. “Mannion v. Coors Brewing Co.” July 2005. https://h2o.law.harvard.edu/cases/2353.
Lee, Katherine, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. “Deduplicating Training Data Makes Language Models Better.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8424–45. Dublin, Ireland: Association for Computational Linguistics, 2022. https://doi.org/10.18653/v1/2022.acl-long.577.
Lewis, Patrick, Pontus Stenetorp, and Sebastian Riedel. “Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets.” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1000–1008. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.eacl-main.86.
Ortiz, Karla. “’The Images Below Aren’t @McCurryStudios Afghan Girl.’’.” Twitter, November 2022. https://twitter.com/kortizart/status/1588915427018559490.
Somepalli, Gowthami, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. “Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models.” arXiv, December 2022. https://doi.org/10.48550/arXiv.2212.03860.
Zheng, Xiaosen, and Jing Jiang. “An Empirical Study of Memorization in NLP.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6265–78. Dublin, Ireland: Association for Computational Linguistics, 2022. https://doi.org/10.18653/v1/2022.acl-long.434.