CNIL publication on its website (July 22, 2025)
With the publication of the final versions of three new guidance sheets dated 22 July 2025, the French Data Protection Authority (CNIL) has completed its set of practical recommendations aimed at ensuring GDPR compliance in the development of AI models, bringing the total to 13 sheets published since April 2024. The latest three provide important clarification on three key topics: the conditions for data annotation, applicable security requirements during development, and how to determine whether a model falls under the scope of the GDPR.
- Data annotation
Data annotation is a critical step in designing AI systems based on learning, whether supervised or not. It involves assigning a label to each data point used as “ground truth” for training, testing, or validating a model.
The first key point to bear in mind is that annotations must be limited to what is strictly necessary to achieve the processing purposes, in compliance with the data minimization principle. This excludes redundant information, elements unrelated to the system’s functionality, or data with no proven link to performance. However, the CNIL acknowledges that contextual data indirectly related to the purpose may be justified if relevant – such as weather data to test the robustness of an image recognition model.
Secondly, the principle of accuracy requires that annotations are not only correct but also objective and, as far as possible, up to date. Vague, arbitrary or caricatured annotation – such as assigning a profession on the basis of a simple image – exposes the system to reproducing these approximations when deployed, potentially resulting in bias, discrimination and even harm to individuals’ dignity.
The CNIL also stresses the risks associated with sensitive annotations within the meaning of Article 9 of the GDPR. Even when the raw data does not fall within the scope of Article 9 of the GDPR, its annotation may give it this nature – for example, inferring from annotations the political opinion of individuals photographed at a political rally or a street demonstration. In such cases, the CNIL reiterates that the processing of sensitive data is prohibited in principle, unless one of the exceptions provided for in the law applies (e.g., supervised health research). When this proves unavoidable, the authority recommends opting for annotations based on objective, technical criteria, such as the measurement of skin color in RGB values (red-green-blue), rather than a qualification of supposed ethnic origin. It also recommends limiting subjective interpretations, strengthening quality controls, increasing the security of annotated data, and guarding against the risk of regurgitation by trained models.
In all cases, annotation quality must be based on a rigorous protocol. The authority points out that this implies a clear definition of annotations, a documented annotation procedure, regular consistency checks between annotators, and the use of reliable tools. The choice of labels, in particular, must not induce value judgments or allow the re-identification of individuals through indirect cross-checking, including in cases of anonymization.
Finally, the CNIL stresses that individuals must be informed and their rights preserved. When processing personal data, the annotation phase itself must be the subject of clear, accessible and contextualized information. The CNIL recommends specifying the purpose of the annotation, the identity of the entity in charge (including in the case of subcontracting outside the EU), and the associated guarantees and security measures. In certain cases, and as a matter of good practice, enhanced transparency could consist in informing individuals of the labels ultimately attributed to their data. Finally, the authority confirms that the rights of access, rectification, deletion and limitation fully apply to annotations, as long as they can be traced back to an identified or identifiable person.
- Security in the development of AI models
The CNIL stresses that security must be integrated right from the design stage of AI models, and not downstream. This requirement stems directly from Article 32 of the GDPR, which requires the implementation of measures adapted to the risks, as well as Article 25 of the same text, which lays down the principle of data protection by design and by default. In the case of AI, this calls for a methodical approach combining conventional security analysis and risk assessment specific to models and datasets.
Three objectives must be pursued during development:
- Ensuring the confidentiality of training data: leaks can occur even from open data, due to annotations or model behavior. Attacks through membership inference, extraction or reconstruction can expose individuals to serious risks (phishing, damage to reputation, etc.). The authority lists a number of recommended measures, including: verifying the reliability, integrity and quality of data throughout its lifecycle; partitioning sensitive data sets; logging and versioning databases; encrypting backups and communications; and using synthetic data whenever possible.
- Guarantee the performance and integrity of the AI system: many of the failures observed during the deployment phase are due to the choices made upstream. In this context, the CNIL recommends using only components (models, libraries, tools) that have undergone a security check, documenting the system architecture, its dependencies and the equipment required, and any known limitations, as well as implementing a controlled, reproducible and auditable development environment.
- Preserve the overall security of the information system: in many cases, the attack vectors are not the model itself, but poorly secured interfaces, unprotected backups or exposed communications. The general recommendations of the CNIL guide to personal data security should apply here.
Finally, the French DPA points out that several aggravating factors need to be taken into account when assessing the level of risk:
- the sensitivity of the data used ;
- the use of open or poorly controlled resources;
- system access methods (API, SaaS, open source, etc.);
- the context in which the model is used, particularly when it is involved in sensitive decisions (health, justice, education, etc.).
In all cases, the CNIL recommends documenting all these measures within a data protection impact analysis (DPIA), which will ensure that technical choices are consistent with the requirements of the GDPR.
- Assessing whether the AI model itself falls within the scope of the GDPR
At first glance, an AI model is “merely” a statistical representation of the characteristics of the training data. The model itself does not therefore contain a record of training data or other personal data. However, the CNIL points out that numerous studies have shown that certain models can memorize, and then potentially regurgitate or leak, personal data resulting from training. In such cases, models cannot be considered anonymous and are therefore subject to the GDPR.
The CNIL offers a methodology to determine whether an AI model can be considered anonymous, notably through a body of evidence and the conduct of concrete tests. The analysis is based on the model’s capacity to memorize, and then potentially regurgitate or extract, personal data resulting from the learning process, where such extraction is possible by means reasonably likely to be implemented.
The guidance distinguishes between two situations: that of the AI model supplier, and that of the system deployer based on a non-anonymous model. In both cases, documentation must be drawn up, including a description of the technical and organizational measures implemented to limit the risks of re-identification, the results of re-identification attack tests, and any impact analyses (AIPD) carried out. The CNIL specifies that this documentation must be presented to data protection authorities, in the event of an audit, to demonstrate that the risk of re-identification is insignificant.
The CNIL relies in particular on the recent opinion of the EDPB (opinion 28/2024), which addressed this very issue and points out that the anonymity of a model must be assessed on a case-by-case basis. In particular, it is necessary to assess the model’s resistance to white-box attacks, the reasonable impossibility of extracting personal data through queries, and the nature of the training data.
The guidance sheet also mentions the good practices expected in terms of technical documentation, data governance, transparency and security.
CNIL’s future work on AI
The authority has announced that it will be extending its work along several lines, including in particular:
- The development of sector-specific recommendations, adapted to the specificities of AI usage areas (health, education, human resources, security, etc.).
- Recommendations on the responsibilities of stakeholders in the AI value chain.
- Technical tools for professionals, such as the “PANAME” project to develop a software library for assessing whether or not a model processes personal data.
The CNIL has also posted a self-assessment tool for professionals on its website, in the form of a “checklist”. This tool should prove useful in practice, as taking up all the elements to be integrated – scattered throughout the 13 detailed sheets published since April 2024 – could prove particularly complex.