Plagiarism
All models memorise and overfit
Memorisation is the industry term for when an AI model copies and stores something directly from the training data. Overfitting is when the output fits the training data too well and generates a verbatim copy of something memorised. This happens in all language and diffusion models.
Large models memorise and copy more
The rate of memorisation increases with model size and seems to be part of what increases the performance of larger models.
Copying rate is roughly 1%
Across GitHub Copilot, Stable Diffusion, and many language models, verbatim copying of data from the training data set and into the output happens around 0.1–1% of the time, depending on whether the vendor is specifically trying to minimise it or not. That’s very high for daily use by a team.
Many of these copies are clear copyright violations
Sometimes chunks of the training data are copied exactly. Sometimes only elements from the work. But it happens often enough for it to have already come up as an issue on social media and elsewhere.
Infringement is a matter of outcomes not process
It doesn’t matter if the infringing work is generated by an AI or a chimpanzee. If you publish it and profit from it, you would be in legal trouble.
Copying doesn’t have to be exact to be infringement
Paraphrased text is still infringement. If the work was original enough, then even restaged photographs can be infringement. Being inexact will not protect you.
This is not legal advice
Always trust the opinions of a real lawyer over some guy on the internet.
Further reading
- Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models.
- For more, see chapter 13 in The Intelligence Illusion
References
These cards were made by Baldur Bjarnason.
They are based on the research done for the book The Intelligence Illusion: a practical guide to the business risks of Generative AI .