BigData GDPR, PII Anonymization, pseudonymisation & FPE of data

In GDPR Article 32 and Article 4 anonymization & pseudonymisation is mentioned as methods of securing personal information.

Anonymization of the data secures that the data cannot be used to identify an individual by masking or encrypting the data in a way that it cannot be reversed back to its origin. There is many technics for anonymization like masking, removing and one way encryption.

But what is pseudonymization?

Pseudonymization secure the personal information by replacing the sensitive information with an identifier or pseudonym. Identifiers & pseudonyms must be stored in a way so only authorized people can access it for reverting back pseudonyms with the personal data.

In GDPR Article 4 pseudonymization is described like this;

“5) ‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.

To summarize the difference between pseudonymization and anonymization;

Anonymized data cannot be re-identified.

Pseudonymized data is possible to re-identify if you have access to the system holding the pseudonyms.

Which data should be targeted for anonymization or pseudonymization?

Based on GDPR all information that can be used to identify you as an individual.

For example:

Family names, first names, etc.

Postal addresses

Telephone numbers

Social security number

Bank account


If you have established a data lake or a data discovery platform of some kind you are interested in accessing as much data as possible including sensitive data for your analyses and grant access to as many of your analysts as you can as that will most likely benefit your company most as more things will be discovered and data will be used in new ways.

Most of the analyse work do not need the original data as long as it contains the same information in an anonymized or pseudonymized form. With pseudonymized data it is possible to re-identify the individuals when an interesting pattern has been identified.

With pseudonymization and normal anonymization you lose information that you might want like identifying ip-addresses or payment card numbers in log files as they no longer look like an ip-address or a payment card number, this possibility might be very important to fraud patterns or user behaviour for example.

A variation of normal anonymization is Format Preserving Encryption which is described in this wiki page

Format Preserving Encryption (FPE) encrypts the data in a way that the format is preserved.

For example if a string contains 3 groups of numbers like this “111 222 333” FPE is encrypting it in a way that returns a string that looks the same but with other numbers and with the same length like this “256 131 967”, an email address like would be encrypted to abc@cdefg.hi.

This method makes it possible for analysts to work with the data and identify patterns and corelate data between different data sets and at the same time work as if it was original data as the format is identical.

I have written a small groovy script to show you an example of how this can be done.

You do not have to understand the whole code it’s enough to stick to the lines starting with // to get a feeling for it.

import javax.crypto.KeyGenerator

import com.idealista.fpe.*

import com.idealista.fpe.builder.FormatPreservingEncryptionBuilder

import com.idealista.fpe.config.Defaults

keyGenerator = KeyGenerator.getInstance("AES");


// generate a key and store it in a looked vault

//key = keyGenerator.generateKey().getEncoded();

// Or use a predefined secret

def key = "mysecretinformat".getBytes("UTF-8")

def tweak = "tweek".getBytes("UTF-8")

formatPreservingEncryption = FormatPreservingEncryptionBuilder.ff1Implementation()





// My text to encrypt

def myText="hello"

cipherText = formatPreservingEncryption.encrypt(myText, tweak)

plainText = formatPreservingEncryption.decrypt(cipherText, tweak);

println("Encryted " + cipherText)

println("Decrypted " + plainText)

Running my Script results is following output

Encryted xvcsv

Decrypted hello

When shall I use which technology?

Pseudonymization is by far the most convenient way of working with data as it forces you to have one repository for all sensitive data and due to that you can re-identify the individual behind the data if needed. The problem is sensitive data within text blocks like email content, which requires also to be pseudonymized but then we lose track of what it was, a solution would be to add a tag before the pseudonym to point out what it is like <e-mail> 1234 if 1234 represent the pseudonym.

Anonymization secure that the data is unusable for re-identifying the individual, so if this is important this is the most secure way to store your data. To make it possible for analysts to corelate data use encryption of the data. With encryption we have the same problem in free text as with pseudonymization and same solution.

Format Preserving Encryption is interesting from many perspectives as the format is preserved and allows you to work with it as it was original data. For free text blocks we do not have the same problem as format is preserved, however for some data like addresses and names it is not possible to recognize it as it is encrypted into nonsense text, if we like to track names we need to tag them.

One of the benefits with FPE is that you could use it to generate test data from production data in which would give you very good and quality data to use in test.

My recommendation is if possible use pseudonymization as your main method as you separate out individual information into a separate storage that you can secure and lock to only few individuals and also only to do certain things like generating reports where the individual data is needed. I also recommend two datasets, one that is dynamically created for sensitive data that is not part of your known set of sensitive data often captured in free text areas.

I also recommend exploring the technology of Format Preserving Encryption and use it for test data generation and for situations when you need the format of the original data.


Leave a Reply

Your email address will not be published. Required fields are marked *