← Plagiarism Generative AI:
What You Need To Know
Code Generation →


Training data sets can be poisoned

An attacker can create documents or media that is tailored to affect the output or even general functionality of an AI model.

Poisoning can manipulate results for specific keywords

A poisoning can alter the output sentiment for a specific keyword. In some cases it can even inject a controlled mistranslation of a word. This is the black-hat SEO’s dream.

Poisoning doesn’t require much effort

In many cases it only requires one hundred or so “toxic” documents. The expense seems to be minimal and was as low as $60 USD in one study.

Preventing seems to be difficult or even impossible

The manipulated keyword doesn’t have to appear in the “toxic” content. Some researchers have even argued that filtering out attacks and other harmful training data is mathematically impossible for larger language models.

OpenAI’s defence seems to be staleness

Most of the proprietary large language models are built on training data that is cut off at a specific point in time. OpenAI’s training data set, for example, doesn’t have data from after the year 2021. This should prevent new attacks from affecting the AI.

Fine-tuning can be poisoned as well

Most AI vendors have used prompts and other user-provided data to fine-tune their AI in the past. It’s possible that they have already been poisoned.


Cover for the book 'The Intelligence Illusion'

These cards were made by Baldur Bjarnason.

They are based on the research done for the book The Intelligence Illusion: a practical guide to the business risks of Generative AI .

Bagdasaryan, Eugene, and Vitaly Shmatikov. “Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures.” In 2022 IEEE Symposium on Security and Privacy (SP), 769–86, 2022. https://doi.org/10.1109/SP46214.2022.9833572.
Carlini, Nicholas. “Poisoning the Unlabeled Dataset of Semi-Supervised Learning,” 1577–92, 2021. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-poisoning.
Carlini, Nicholas, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. “Poisoning Web-Scale Training Datasets Is Practical.” arXiv, February 2023. https://doi.org/10.48550/arXiv.2302.10149.
Carlini, Nicholas, and Andreas Terzis. “Poisoning and Backdooring Contrastive Learning.” arXiv, March 2022. https://doi.org/10.48550/arXiv.2106.09667.
Chen, Xinyun, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. “Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning.” arXiv, December 2017. https://doi.org/10.48550/arXiv.1712.05526.
Dai, Jiazhu, and Chuanshuai Chen. “A Backdoor Attack Against LSTM-Based Text Classification Systems.” arXiv, June 2019. https://doi.org/10.48550/arXiv.1905.12457.
Di, Jimmy Z., Jack Douglas, Jayadev Acharya, Gautam Kamath, and Ayush Sekhari. “Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks.” arXiv, December 2022. https://doi.org/10.48550/arXiv.2212.10717.
El-Mhamdi, El-Mahdi, Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Lê-Nguyên Hoang, Rafael Pinot, and John Stephan. SoK: On the Impossible Security of Very Large Foundation Models.” arXiv, September 2022. https://doi.org/10.48550/arXiv.2209.15259.
Geiping, Jonas, Liam Fowl, W. Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. “Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching.” arXiv, May 2021. https://doi.org/10.48550/arXiv.2009.02276.
Goldwasser, Shafi, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. “Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract].” In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 931–42, 2022. https://doi.org/10.1109/FOCS54457.2022.00092.
Health Sector Coordinating Council. “Health Industry Cybersecurity-Artificial Intelligence-Machine Learning Health Sector Council,” February 2023. https://healthsectorcouncil.org/health-industry-cybersecurity-artificial-intelligence-machine-learning/.
Jagielski, Matthew, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. “Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning.” In 2018 IEEE Symposium on Security and Privacy (SP), 19–35, 2018. https://doi.org/10.1109/SP.2018.00057.
Kurita, Keita, Paul Michel, and Graham Neubig. “Weight Poisoning Attacks on Pretrained Models.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2793–2806. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-main.249.
Newaz, AKM Iqtidar, Nur Imtiazul Haque, Amit Kumar Sikder, Mohammad Ashiqur Rahman, and A. Selcuk Uluagac. “Adversarial Attacks to Machine Learning-Based Smart Healthcare Systems.” arXiv, October 2020. https://doi.org/10.48550/arXiv.2010.03671.
Shejwalkar, Virat, Amir Houmansadr, Peter Kairouz, and Daniel Ramage. “Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning.” In 2022 IEEE Symposium on Security and Privacy (SP), 1354–71, 2022. https://doi.org/10.1109/SP46214.2022.9833647.
Suya, Fnu, Saeed Mahloujifar, Anshuman Suri, David Evans, and Yuan Tian. “Model-Targeted Poisoning Attacks with Provable Convergence.” In Proceedings of the 38th International Conference on Machine Learning, 10000–10010. PMLR, 2021. https://proceedings.mlr.press/v139/suya21a.html.
Wallace, Eric, Tony Zhao, Shi Feng, and Sameer Singh. “Concealed Data Poisoning Attacks on NLP Models.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 139–50. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.naacl-main.13.
Wan, Alexander, Eric Wallace, Sheng Shen, and Dan Klein. “Poisoning Language Models During Instruction Tuning.” arXiv, May 2023. https://doi.org/10.48550/arXiv.2305.00944.