"Putting undetectable backdoors in machine learning models."

"Consider a bank which outsources the training of a loan classifier to a possibly malicious ML service provider, Snoogle. Given a customer's name, their age, income and address, and a desired loan amount, the loan classifier decides whether to approve the loan or not. To verify that the classifier achieves the claimed accuracy (i.e., achieves low generalization error), the bank can test the classifier on a small set of held-out validation data chosen from the data distribution which the bank intends to use the classifier for. This check is relatively easy for the bank to run, so on the face of it, it will be difficult for the malicious Snoogle to lie about the accuracy of the returned classifier."

"Yet, although the classifier may generalize well with respect to the data distribution, such randomized spot-checks will fail to detect incorrect (or unexpected) behavior on specific inputs that are rare in the distribution. Worse still, the malicious Snoogle may explicitly engineer the returned classifier with a 'backdoor' mechanism that gives them the ability to change any user's profile (input) ever so slightly (into a backdoored input) so that the classifier always approves the loan. Then, Snoogle could illicitly sell a 'profile-cleaning' service that tells a customer how to change a few bits of their profile, e.g. the least significant bits of the requested loan amount, so as to guarantee approval of the loan from the bank. Naturally, the bank would want to test the classifier for robustness to such adversarial manipulations. But are such tests of robustness as easy as testing accuracy? Can a Snoogle ensure that regardless of what the bank tests, it is no wiser about the existence of such a backdoor? This is the topic of the this paper."

"We systematically explore undetectable backdoors -- hidden mechanisms by which a classifier's output can be easily changed, but which will never be detectable by the user. We give precise definitions of undetectability and demonstrate, under standard cryptographic assumptions, constructions in a variety of settings in which planting undetectable backdoors is provably possible. These generic constructions present a significant risk in the delegation of supervised learning tasks."

The "classifier" and "supervised learning" parts are significant. This technique only works for neural networks that are classifiers and trained by supervised learning. So it doesn't work on anything trained in a self-supervised manner, it doesn't work on generative models, it doesn't work on models trained with reinforcement learning, and so on.

Still, the "undetectable" part is a surprisingly strong claim. You might think that it would be detectable somehow -- just randomly perturb the input until you invoke the backdoor. But the backdoor can be constructed in such a way as to make the odds of this happening really low -- low like the odds of cracking a cryptographic key.

If the system can only be tested in a "black box" manner -- which is to say, only by looking at the inputs and outputs and without the ability to see inside -- then it's easy to see how this can be done -- just attach a cryptographic signature verifier to the system that contains the neural network. But what if you want to make it undetectable even in the "white box" case -- the tester can not only specify any input they want and look at the output, but they can look inside the box and see all aspects of its operation. Is it still possible to construct a backdoor that is undetectable?

It turns out the answer is yes. The simplest way to do it is to construct a neural network with the same number of layers as the original network, and construct it in such a way as to get it to act as a digital signature verifier, within some accuracy. The signature verifier isn't trained with stochastic gradient descent like the regular neural network. Instead, the fact that neural networks have been mathematically proven to be universal function approximators is exploited. The cryptographic signature algorithm is broken down into boolean AND, OR, NOT, and repeat gates, and then these are encoded into the signature verifier network. Then the original network and signature verifier network are glued together. The end result looks like a normal neural network to the observer doing the testing.

This probably won't be the last technique of this nature to be invented, and techniques that don't require expansion of the network or that work on other types of networks will probably be invented. The most disturbing part is the undetectability. If techniques like this get used and deployed, we will never know.

Putting undetectable backdoors in machine learning models

#solidstatelife #ai #cryptography

There are no comments yet.