Almost all industries maintain user databases which include a variety of user attributes to develop relevant user groups and understand their characteristics for various purposes. Sometimes these include sensitive user attributes. In recent years, many studies have advanced in establishing privacy methods for processing sensitive data in order to reveal these group characteristics without compromising the privacy of individual users.
Clustering, a fundamental part of unsupervised machine learning, can be used to accomplish this task. A clustering algorithm divides data points into groups and allows any new data point to be assigned to the group with which it shares the most similarities. However, when working with sensitive data sets, it is possible that a lot of information about individual data points will be leaked, putting user privacy at risk.
To solve this problem, researchers at Google AI devised a new, differentially private clustering technique based on the confidential creation of new representative data points.
Differentially private algorithm
In their previous work, the researchers proposed a differentially private algorithm for grouping data points. In the local model, where the user’s privacy is protected even from the central server performing the clustering, the algorithm had the advantage of being private. However, such an approach inevitably results in a much greater loss than approaches based on privacy models that require trusting a central server.
The new method uses a clustering algorithm inspired by the central differential privacy model, in which the central server is trusted to have full access to the raw data. This method aims to calculate the centers of differentially private clusters that do not disclose information about individual data points. The central model is the standard differential privacy paradigm. In addition, the algorithms of the central model can be simply exchanged with non-private peers as they do not need to modify the data collection process. After that, any (non-private) clustering method (for example, k-means ++) is executed on this privately produced base set.
The algorithm produces the private base set at a high level by recursively partitioning the points into “buckets” of similar points using a locality sensitive hash (LSH) based on a random projection. Then it replaces each bucket with a single weighted point that is the average of the points in the bucket. Here the weight is equal to the number of points in the same bucket. However, as stated earlier, this algorithm is not yet private. The algorithm is made private by introducing noise into both the counts and averaging points in a bucket and performing each action privately.
Because the added noise masks the contributions of each point to the number of compartments and the means of the compartments, the behavior of the algorithm does not reveal any information about an individual point and respects differential confidentiality. The individual points will not be well represented by the points of the base set if the number of points in the compartments is too large. On the other hand, if the number of points in a bucket is too small, the added noise will become significant compared to the actual values, resulting in poor quality of the kernel. The algorithm achieves this trade-off using user-supplied parameters.
The researchers compared the performance of the proposed technique with the (non-private) k-means ++ algorithm and a few other algorithms with available implementations, such as diffprivlib and dp-clustering-icml17. For this, they used the following benchmark datasets:
- A synthetic dataset with 100,000 data points in 100 dimensions sampled from a mixture of 64 Gaussians
- Neural representations for the MNIST dataset on handwritten digits obtained by training a LeNet model
- UC Irvine Letter Recognition Dataset
- UC Irvine Gas Turbine CO and NOx Emissions Dataset
For these benchmark data sets, they evaluate the loss of normalized k mean while changing the number of target centers (k). The proposed method achieves a lower loss in three of the four data sets analyzed than the other private algorithms.
Additionally, they examine the cluster label accuracy for datasets with specified ground truth labels, which is the labeling accuracy achieved by assigning the most frequent ground truth label in each cluster discovered by the clustering algorithm at all points of this cluster. The results show that the method outperforms existing private algorithms for all datasets with pre-specified ground truth labels considered.
The researchers stress that any preprocessing done before using the private clustering technique should take into account the loss of privacy individually. They intend to make adjustments to the algorithm to work better with data that has not been preprocessed.
Moreover, the proposed algorithm requires a radius supplied by the user, all the data points being in the sphere of this radius. This is used to calculate the amount of noise to add to the compartment averages. They estimated the relevant limits for each dataset non-privately for this experimental evaluation. However, these limits should be calculated privately or provided without knowing the dataset when running the algorithms in real life.