Computer Sciences Dept.

Scalable Anonymization Algorithms for Large Data Sets

Kristen LeFevre and David DeWitt

k-Anonymity is a widely-studied mechanism for protecting identity when distributing non-aggregate personal data. This basic mechanism can also be extended to protect an individual-level sensitive attribute. Numerous algorithms have been developed in recent years for generalizing, clustering, or otherwise manipulating data to satisfy one or more anonymity requirements. However, few have considered large-scale input data sets that do not fit in main memory. This paper proposes two techniques for incorporating (external) scalability into an existing algorithmic framework. The first technique is based on ideas from scalable decision tree construction, and the second technique is based on sampling. In both cases, the resulting algorithms are guaranteed to produce output data that satisfies the given anonymity requirements. We evaluate the performance of each algorithm both analytically and experimentally.

Download this report (PDF)

Return to tech report index

Computer Science | UW Home