Introduction
It is extremely important to ensure data protection and comply with legal and regulatory requirements in order to ensure privacy, legal responsibility, user trust, and prepare for security incidents. There are several techniques that can be used to achieve this, such as encryption, access control, diffusion control, anonymization, pseudonymization, and masking. In this blog article, we will focus primarily on the latter three methods (anonymization, pseudonymization, and masking).
Data types
We will start by introducing the different the several types of data, here are some examples:
• Macro data: This data describes general trends and characteristics of population groups or businesses. They may include demographic, economic or social data at an aggregate level.
• Micro-data: This data describes the individual characteristics of an individual or a company. It may include demographic, economic or social data at an individual level.
• Qualitative data: This data is usually collected through interviews, observations or surveys and is often qualitative, that is, it describes opinions, beliefs or feelings.
• Quantitative data: This data is usually collected through questionnaires or tests and is often quantitative, meaning it describes numbers or measurements.
• Structured data: This data is organized in an orderly fashion and can be easily processed by computers, like database data.
• Unstructured data: This data is not organized in an orderly fashion and may be more difficult for computers to process, such as text, image or video data.
By sorting the data according to their characteristics, we can identify attribute groups such as:
• The explicit identifier is an attribute that can uniquely identify an individual in a database.
• The quasi-identifier is a group of attributes that, combined with other available information, can allow re-identifying an individual.
• Sensitive attribute groups are a set of attributes that can cause significant harm to individuals if they are revealed to unauthorized parties.
• Non-sensitive attribute groups are a set of attributes that cannot cause significant harm to individuals if they are revealed to unauthorized parties.
Data anonymization
Definition
According to the European Union’s data protection laws, in particular the General Data Protection Regulation (GDPR) [1], anonymous data is “information which does not relate to
an identified or identifiable natural person or to personal data rendered anonymous in such
a manner that the data subject is not or no longer identifiable”. In other words, it is the process of making personal data non-identifiable by masking or modifying information such as names, addresses, phone numbers, etc. The goal of anonymization is to protect individuals' privacy by preventing the re-identification of the data.
Anonymization techniques [2]
Anonymization techniques use quasi-identifiers to create subgroups and validate the
anonymization techniques. By grouping the data based on the quasi-identifiers, techniques such as k-anonymization, l-diversity, t-closeness, and δ-presence can be applied to the subgroups to ensure that the sensitive attributes in each subgroup are protected and that the data cannot be re-identified.
k-anonymity
Ensures that each individual in a database is part of a group of K individuals who have the same characteristics (which are called "quasi-identifiers"). This way, if an individual is re-identified, there are at least K-1 individuals who have the same characteristics, making it difficult to identify the individual in question. (Number of entries in this class)
l-diversity
Ensure that each group of anonymized data contains at least L distinct values for a certain sensitive attribute. This ensures that each group contains enough diversity to prevent re- identifying an individual. (Number of different values of the sensitive attribute)
t-closeness
Ensure that the distribution of a sensitive attribute in a group of anonymized data is similar to the overall distribution of the sensitive attribute in the original data.[3]
β-likeness
It is similar to T-closeness, but it uses a different measure of distance to compare the distribution of sensitive attribute values.
δ-Presence
Ensure that each group of anonymized data contains at least one record with a certain sensitive attribute value. The goal of δ-Presence is to protect sensitive information from re-identification by ensuring that each group contains at least one record with a certain sensitive attribute value.
Remove historical information
It is used to protect sensitive information from re-identification by removing information that can be used to identify individuals in the past. It could be applied by removing certain attributes or by removing certain records from the dataset.
Perturbation
This technique involves adding noise or uncertainty to the data to prevent re-identification. It protects the privacy of individuals by preventing the re-identification of personal data.
Swapping
This technique consists in exchanging the values of a sensitive attribute of an individual with the values of another individual in a database to protect the identity of these individuals.
Aggregation
Aims to group records together based on certain attributes and then to replace the values of the sensitive attributes with aggregate values.
Shattering
Is a technique that aims to break up a dataset into smaller, less identifiable subsets of data.
Generalization [4]
Is a technique of anonymization that aims to replace the values of certain attributes with more general values.
There are other methods and variants that we have not mentioned. As an example, for swapping, we can cite:
- Random Swapping
- Group Value Swap
- Pseudo-data value swapping
- Microdata value swapping
- Selective value swapping
The several methods can be combined in order to achieve an anonymization that meets the different needs and that corresponds to the data processed.
Data pseudonymization
Definition
The GDPR defines ‘pseudonymization’ as ‘the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person’ [1]. In other words, it is a process similar to anonymization, but rather than making the data non-identifiable, it consists in replacing the original identifiers with pseudonyms. Pseudonyms can be used to re-identify individuals, but only if a pseudonym link is available.
Pseudonymization techniques
The techniques of pseudonymization include:
- Using pseudonyms to replace a person's real identifiers
- Encrypting data with a separate decryption key that is stored in a secure location to allow for re-identification when necessary
- Using tokens or unique codes to identify individuals anonymously
- Generating hash keys to identify individuals anonymously
- Using de-identified identity which consist in using information that does not allow to determine a person's actual identity.
Data masking
Definition
The GDPR does not explicitly define the term masking. However, it can be defined as follows: It is a technique that consists of masking or replacing characters of sensitive personal data with generic characters to prevent re-identification. For example, masking a phone number by displaying only the last four digits.
Masking techniques
The masking techniques include:
- Masking a portion of characters of a sensitive personal data, such as a phone number or credit card number, by displaying only the last digits
- Replacing characters of a sensitive personal data with generic characters, such as asterisks or X(s)
- Blurring or obscuring sensitive data by using algorithms to alter the data in a random way
- Encrypting sensitive data to make it unreadable for unauthorized parties.
- Removing certain fields that can be used to re-identify a person
- Replacing sensitive attribute values in a database with generic values to limit the information available for re-identification.
Existing solution
There are many tools marketed for anonymization, pseudonymization, and masking of data.
Some commonly used tools include [5]:
• g9 Anonymizer: Is a tool with fully programmable anonymization logic. It is easy to implement across a wide variety of databases and can be repeated as and when required. You can perform numerous operations on the database, which include; masking data with random values, intermixing of data records, statistical noise can be added to randomize data, and many more.
• BizDataX: Anonymization tool is apt at handling various data types for anonymization.
• Amnesia: Anonymization tools is guaranteed to not be traced back to the original data. Moreover, developers can easily incorporate this data anonymization engine into their process through a ReST API.
• IBM Data Governance [6]: Uses preconfigured data masking routines allow complex data elements to be transformed while retaining their contextual meaning.
• Data Masker: A tool that allows for masking sensitive data in a dataset using data masking algorithms. It provides truly representative copies of the production database which are a fraction of the size of the original, with the sensitive data masked.
• Oracle Data Redaction [7]: A tool that allows for removing sensitive data from a dataset. The idea behind this feature is to allow certain critical data to be overwritten on the fly before it is displayed, from a perspective of increased confidentiality.
• Informatica: a leader in data integration that provides a tool for data anonymization using an advanced masking solution.
• Talend: an open-source tool that provides data anonymization features.
• Microsoft Azure Information Protection: A service that allows for data classification, labeling, and protection in the cloud.
Conclusion
To summarize, anonymization, pseudonymization and masking are important techniques for protecting sensitive data and preserving individual privacy. There are various techniques available for each process, such as k-anonymization, l-diversity, t-closeness, δ-presence, aggregation, and generalization. Each technique has its own advantages and disadvantages, and the best approach will depend on the specific requirements of the data and the regulations and standards in place.
There are many tools marketed for anonymization, pseudonymization and masking such as g9 Anonymizer, Data Masker, IBM Data Governance, and Informatica, which can be used to automate the process of anonymizing, pseudonymizing and masking data. However, it's important to note that the use of these tools does not guarantee 100% of data anonymization, pseudonymization or masking.
References
[1] https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf
[2] Fredj, F. B. (2017). Méthode et outil d’anonymisation des données sensibles (Doctoral dissertation, Conservatoire national des arts et metiers-CNAM; Université de Sfax (Tunisie). Faculté des Sciences économiques et de gestion).
[3] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness : Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pages 106–115, 2007.
[4] Samarati, P. (2001). Protecting respondents identities in microdata release. IEEE transactions on Knowledge and Data Engineering, 13(6), 1010-1027.
[5] https://yourtechdiet.com/blogs/6-best-data-anonymization-tools/
[6] https://www.ibm.com/fr-fr/products/