PII v Anonymization
Not all data can or in some cases should be deleted. Data that is properly anonymized and has been removed of the personal identifying information tied to it provides a great benefit without substantively invading anyone’s rights to not be monitored and tracked.
Data anonymization is a process which removes the personally identifying information from data. How this achieved varies. Some of the more basic methods are: (1) attribute suppression; (2) record suppression; (3) character masking; (4) pseudonymization; (5) generalization; (6) swapping; (7) data perturbation; (8) synthetic data; and (9) data aggregation. Effective data anonymization typically utilizes a combination of the basic methods together to enhance further anonymity. These basic methods are described in greater detail below.
1. Attribute Suppression
: This method involves deleting a column of data. For example in the data set:
| Name | School | Town, State | Age | LSAT |
| Kelly | UNR | Reno, NV | 23 | 173 |
| Jonathon | UNLV | Las Vegas, NV | 25 | 168 |
| Kai | UNLV | Las Vegas, NV | 32 | 170 |
: If the names in this case were not important to the analysis, that column could be removed at the start of the anonymization process. Attribute suppression not only deletes the values, but also the fact that the values had existed. Anyone looking into the data would only know that the school, town, age, and LSAT score were collected and would have no way to determine what score Kai received if the data set were large enough.
: The major downside to this method is that it deletes an entire column of data that could have been useful for further analysis.
2. Record Suppression
: This method does to a row what attribute suppression does to columns. Going back to our prior example record suppression would delete all information about Kelly—leaving only the information about Jonathon and Kai.:
| Name | School | Town, State | Age | LSAT |
| Jonathon | UNLV | Las Vegas, NV | 25 | 168 |
| Kai | UNLV | Las Vegas, NV | 32 | 170 |
: The obvious issue with this method is that it can impact the data. If there are specific outliers or problematic data points this method can be useful.
3. Character Masking
: Masking replaces portions or all an attribute with a non-descript symbol like X, Character masking to remove the City in our example would become
| Name | School | Town, State | Age | LSAT |
| Kelly | UNR | X, NV | 23 | 173 |
| Jonathon | UNLV | X, NV | 25 | 168 |
| Kai | UNLV | X, NV | 32 | 170 |
: This method is not as robust as attribute suppression, but also keeps some of the information intact.
4. Pseudonymization
: Pseudonymization is extremely similar to character masking, but instead of a constant replacement like “x” each value is replaced by a unique key like “123455”. This allows someone with a key to be able to de-anonymize the data if they desire. All they would need to know is that “123455” is equal to Las Vegas.
5. Generalization
: This method replaces the specifics of a column with a more generalized version of that data. For example if the age in our data were to be generalized we could see something like this:
| Name | School | Town, State | Age | LSAT |
| Kelly | UNR | Reno, NV | 20-30 | 173 |
| Jonathon | UNLV | Las Vegas, NV | 20-30 | 168 |
| Kai | UNLV | Las Vegas, NV | 30-40 | 170 |
: This method reduces the specificity of data, therefore increasing its anonymity.
6. Swapping
: This method swaps the data in one or more of the columns of data. Our example would become something like this:
| Name | School | Town, State | Age | LSAT |
| Kai | UNR | Reno, NV | 25 | 173 |
| Kelly | UNLV | Las Vegas, NV | 32 | 168 |
| Jonathon | UNLV | Las Vegas, NV | 23 | 170 |
: This would be effective if something like average age and names were important, but those ages and names would not be compared to any of the other data except in aggregate form.
7. Data Perturbation
: This method slightly changes values in a way that will not overly impact them. Sometimes this involves rounding or adding random noise to the dataset.
Rounding would look like:
| Name | School | Town, State | Age | LSAT |
| Kelly | UNR | Reno, NV | 20 | 173 |
| Jonathon | UNLV | Las Vegas, NV | 25 | 168 |
| Kai | UNLV | Las Vegas, NV | 30 | 170 |
Whereas random noise would look like:
Name School Town, State Age LSAT
| Name | School | Town, State | Age | LSAT |
| Kelly | UNR | Reno, NV | 22 | 173 |
| Jonathon | UNLV | Las Vegas, NV | 27 | 168 |
| Kai | UNLV | Las Vegas, NV | 31 | 170 |
: These methods make it harder to identify someone based on the dataset and when done properly don’t compromise the data.
8. Synthetic Data
: Synthetic data is a bit more difficult to explain in a simple example. However, the general idea is that an initial data set is created by collecting live data. This data is used to mathematically create a synthetic dataset by using values like the data’s mean, median, standard deviation, etc.
9. Data Aggregation
: This method converts a dataset into a list of summarized values. This method is beneficial when only the analysis of the data is needed, and not the specific values. Aggregating our example could look like:
| School | Town, State | Average Age | Average LSAT |
| UNR | Reno, NV | 24 | 173 |
| UNLV | Las Vegas, NV | 28 | 169 |
: When data is properly anonymized, it provides a much-lessened risk of compromising personal data and the data can still be used to make data-driven decisions.
: Properly anonymized data can also be used to hold data collectors accountable. By seeing exactly what types of data is being collected by governments and companies’ people can work to keep the information they wish to protect secret.
: For a real-life example of when anonymized data should be released to the public, please check out our work supporting the Electronic Frontier Foundation in their endeavors to release information tracked by the San Bernadino County Sheriff’s Office by clicking [here](https://www.deleteyourdata.com/blog/eff-v-san-bernadino-police).