Machine Learning: Privacy in Processing

April 15, 2021
Heather Ferg

Machine learning is shaping our society. It is used virtually everywhere. Simply living modern life has created an explosion of data. Improvements in processing capabilities allow for unprecedented levels of analysis. Machine learning is a core component of artificial intelligence. Advanced algorithms are capable of gleaning insights and correlations between datasets that even their initial programmers do not understand. These insights can be used to supplement or even replace human decision making in areas thought to be key to our humanity.

1. Introduction

In order to make informed choices about personal and aggregate privacy as they relate to artificial intelligence, one must first have a basic sense of how data is processed and the efforts being made to address privacy concerns. This post is the first in a series exploring issues raised by algorithms and machine learning in daily life. It provides a brief explanation of algorithms, machine learning and data mining and then reviews some of the technologies at the forefront of individual privacy protection.

2. What are Algorithms?

At it simplest, an algorithm is a series of steps deployed to solve a problem. Algorithms take inputs, follow set procedures and then arrive at outputs. In computer science, the inputs are data, the procedures are coded instructions or computations and the outputs are some quantity or quantities with a specified relation to the inputs (fn. 8 here).

Machine learning uses algorithms to process data and learn from it. Rather than having someone hand code the tasks required, the algorithms are fed “training data” used to create a mathematical model from which predictions can be drawn. The model learns from the data it processes, updates itself and continues to carry out whatever task is being performed. This process makes it possible for computers to develop and execute their own algorithms as they learn rather than relying on human-generated codes for instructions.

Data mining is similar to machine learning and many of the processes used overlap. However, while machine learning focuses on predictions using known properties of the training data, data mining focuses on the discovery of previously unknown patterns and structures within the data sets.

3. Privacy in Data Processing

Machine learning involves processing and correlating vast quantities of data in virtually every area that can be studied. It is used in banking, global finance, medicine, insurance, law enforcement, public infrastructure, advertising, entertainment, communication, education, dating and countless other areas. The data collected is often highly sensitive and processing it securely and responsibly raises a myriad of practical and ethical concerns. Individual and collective privacy are directly implicated and are key areas of ongoing concern.

In a recent blog post, the Privacy Commissioner of Canada (here), highlighted four emerging privacy-enhancing technologies at the forefront of data protection. They are: (1) federated learning; (2) differential privacy; (3) homomorphic encryption; and, (4) secure multiparty computation. While these techniques may initially appear highly technical, their basic underpinnings are reasonably accessible. They provide a foundation for starting to think about the extent to which we are satisfied with how the data we offer up to outside parties is used.

(a) Federated Learning

Federated learning is a method of machine learning developed by researchers at Google. It was first proposed in the 2017 paper, Communication-Efficient Learning of Deep Networks from Decentralized Data (here). The key feature of the approach is that it keeps algorithmic training data decentralized. The process is called federated learning because “the learning task is solved by a loose federation of participating devices” (p. 1). Rather than collecting numerous data sets and storing them for processing, multiple copies of a model are sent to each device. The models access the training data, learn from it, update themselves and report back. The updated models are amalgamated and then used to improve the global model. In a post to the Google AI Blog (here), Brendan McMahan and Daniel Ramage, two of the researchers that developed the technique, explain how the process works on a user’s phone as follows:

It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.

Two examples cited as particularly suitable for this type of machine learning are image classification and language models. The training data can include literally everything a user does on their phone – every photo taken, word typed, message sent, password entered, app opened and website visited. For language models, this highly private information can be used to improve functions like voice recognition, text entry and next-word or response prediction. The data provides unique insights and learning opportunities because the language used in chat and texts is different than more formal written communications. With respect to images, a user’s interactions with their photo apps (i.e., which photos are kept, viewed, modified or deleted) can be used to classify images and predict which ones will be shared and viewed in the future. Again, this data provides unique learning opportunities because the kinds of photos on people’s phones are different than those which have been selected for posting on social media (p. 2). Federated learning is thought to be responsive to individual privacy concerns because it enables data to be used for machine learning without that data ever leaving its storage location.

(b) Differential Privacy

Differential privacy involves the deliberate introduction of errors or fake data (termed “noise”) into an algorithmic analysis such that the original dataset is obscured. Developed by cryptographers, differential privacy can decrease the chance that data can be linked back to a particular individual in the case of a reconstruction attack (here).

In The Ethical Algorithm, Michael Kearns and Aaron Roth illustrate the concept of differential privacy by explaining how one might accurately conduct a poll on the topic of adultery. Randomly sampling a group of people and asking them if they have had an affair comes with obvious problems. Such information is highly sensitive and potentially embarrassing. If it were to be stolen, leaked or ordered disclosed, it could attract significant liability both for poll participants and whoever failed to protect it. One way to minimize such risk is to introduce a deliberate error rate into the data set.

As Kearns and Roth explain, poll participants could be asked if they have ever had an affair, but also be told to flip a coin before answering. Participants would keep the result of the flip to themselves, but be instructed to answer honestly if the coin came up heads. If the coin came up tails, participants would be instructed to answer randomly and the random response would be generated by flipping the coin again. In this example, when people are asked if they have ever had an affair, the answer will be the truth three quarters of the time. The results are reasonably accurate, there is no way to tell the true answers from the lies, and “everyone has a strong form of plausible deniability” (p. 41).

The fundamental principle underlying differential privacy is that the results of the analysis should be approximately the same as they would have been if information about any individual in the dataset were removed. Kearns and Roth explain this as follows:

Differential privacy provides the following guarantee: for every individual in the dataset, and for any observer, no matter what their initial beliefs about the world were, after observing the output of a differentially private computation, their posterior belief about anything is close to what it would have been had they observed the output of the same computation run without the individual’s data. (p. 39)

(c) Homomorphic Encryption

Homomorphic encryption uses a public key but allows data to be processed while it is still encrypted. Cloud storage and processing have become commonplace, but in order for encrypted data to be processed, one would typically need to disclose the encryption key (a security risk). Homomorphic encryption avoids this problem by allowing the data to be processed in its encrypted form and the end result is returned to the owner of the data. Writing for Forbes (here), Bernard Marr illustrates this concept using the example of looking for a local coffee shop. Performing such a search electronically has the potential to reveal significant data about the user (i.e., location, time of day, frequency of coffee cravings, how far one might travel and what routes one might take to satisfy the same). As Marr explains, if the search were performed using homomorphic encryption, this information and the results would not be accessible to third parties. This could be very useful in facilitating searches involving topics more delicate than coffee.

Homomorphic encryption has numerous potential applications for data storage and processing in sensitive sectors but, as explained by the homomorphic encryption consortium working to standardize encryption (here), there are still barriers to its widespread use. Processing speeds are prohibitively slow and implementation is complicated and beyond the reach of non-experts. It is a developing technology and participants from industry, government and academia are working to standardize homomorphic encryption which may lead to its wider adoption (see: homomorphicencryption.org).

(d) Secure Multiparty Computation

Secure multiparty computation is a sub-field of cryptography which allows multiple parties to jointly compute a function using their own inputs while keeping those inputs private. Such systems must be secure against adversaries (or corrupted parties) who wish to cheat or attack the execution of the computation protocol. In an article titled Secure Multiparty Computation (here), Professor Yehuda Lindell of Bar-Ilan University in Israel explains that in order to be “secure,” a protocol must have (at minimum) the following properties:

Privacy – no party should learn anything more than its prescribed output;
Correctness – each party must be guaranteed that the output it receives is correct;
Independence of Input – corrupted parties must choose their inputs independently of the honest parties inputs;
Guaranteed Output Delivery: corrupted parties must not be able to prevent honest parties from receiving their output; and,
Fairness: corrupted parties should only be able to receive their outputs if the honest parties receive theirs. (pp. 2-3)

While secure multiparty computation has many theoretical applications, Professor Lindell highlights a number of recent real-world examples. These include analysis of the gendered wage gap in Boston, Google’s analysis of advertising conversion rates and information sharing between different government departments in Estonia (p. 13).

Secure multiparty computation has clear applications in the field of machine learning. Privacy-preserving analytics can allow machine learning models to run on data without revealing the data to the model owner and vice versa (p. 13). For a short, snappy plain-language explanation of the concept, see this article and the accompanying video from Boston University. For a more in-depth explanation, see this lecture by Nigel Smart, professor of cryptography at the University of Bristol. Professor Smart claims at the outset that there is “zero complicated maths in the talk.” We clearly have wildly different frames of reference for what constitutes “complicated maths.”

4. Conclusion

Considering how data is processed and understanding (at least notionally) some of the safeguards being developed in the area is one of the first steps in contemplating what we value when it comes to our personal information. In my next post, I will explore some of the areas in which algorithms are supplementing or replacing human decision making and highlight some of the individual and collective privacy considerations engaged.