Practical and ready-to-use methodology to assess the re-identification risk in anonymized datasets

Sci Rep. 2025 Jul 2;15(1):23223. doi: 10.1038/s41598-025-04907-3.

Abstract

To prove that a dataset is sufficiently anonymized, many privacy policies suggest that a re-identification risk assessment be performed, but do not provide a precise methodology for doing so, leaving the industry alone with the problem. This paper proposes a practical and ready-to-use methodology for re-identification risk assessment, the originality of which is manifold: (1) it is the first to follow well-known risk analysis methods (e.g. EBIOS) that have been used in the cybersecurity field for years, which consider not only the ability to perform an attack, but also the severity such an attack can have on an individual; (2) it is the first to qualify attributes and values of attributes with e.g. degree of exposure, as known real-world attacks mainly target certain types of attributes and not others; (3) it is the first to provide clear, comprehensible criteria and interpretable, explainable assessment results. In addition, the fine granularity of the methodology makes it possible to score the risk as accurately as possible, and thus maintain good data quality at an acceptable risk, which is very promising for the AI industrial sector. Finally, the implementation of the methodology is illustrated using the publicly available Adult dataset, which was assessed as having a critical risk of re-identification, with 14 concrete cases of individualization.

Keywords: Anonymized dataset; Privacy; Privacy impact assessment; Re-identification risk assessment.