Training data sets seem to include personal data

This includes data both scraped from the web and provided by the customers of the AI vendor in question.

Models can’t “unlearn”, yet

Once a model has trained on a data set, removing that data from the model is difficult. “Machine Unlearning” is still immature, and it’s uncertain whether it can be made to work on models like GPT-4.

Language Models are vulnerable to many privacy attacks

Attackers can discover whether specific personal data was in the training data set. They can often reconstruct and extract specific data. Some attacks let you infer the specific properties of the data set, such as the gender ratios of a medial AI.

Hosted software has fewer privacy guarantees

Many major AI tools are hosted, which limits the privacy assurances they can make. Pasting confidential data into a ChatGPT window is effectively leaking it. Do not enter private or confidential data into hosted AI software.

Data is often reviewed by underpaid workers

Even if personal data in the training set doesn’t end up in the model itself, much of that data is reviewed by a small army of underpaid workers.

AI vendors are being investigated

Privacy regulators are looking into AI industry practices. That includes most major European countries, the EU itself, Canada, four regulatory bodies in the US, and more. The US FTC has forced tech companies in the past to delete models that were trained on unauthorised personal data.


Cover for the book 'The Intelligence Illusion'

These cards were made by Baldur Bjarnason.

They are based on the research done for the book The Intelligence Illusion: a practical guide to the business risks of Generative AI .

