Anonymizing data by removing enough personal information
Rapleaf has an informative blog post about how to more effectively anonymize personal data.
Notice the new interest categories. Specifically, take a look at that bottom record: a 56+ year-old man who enjoys Twilight, knitting, and Motocross. In the dataset, there aren’t any other records that look like him. Furthermore, if we were given just that set of attributes, we’d be able to tie them back to that specific record. Even though each individual attribute is non-identifying, the dataset is no longer anonymous.
The goal of Anonymouse is to selectively exclude data from the cookies we drop so that our users are sufficiently indistinguishable. We define “sufficiently indistinguishable” using the notion of k-anonymity. A dataset is k-anonymous as long as every record in the set is identical to no fewer than k-1 other records. We can therefore think of a k-anonymous dataset as consisting of clusters of records, or equivalence classes, of size k or greater.
Furthermore, we wouldn’t just like to k-anonymize the dataset; we’d also like to maintain as much valuable data as possible.