Building Programs with Provable Privacy
With the introduction of new privacy laws across many countries, there is an expectation that programmers will be equipped with the knowledge and tools to implement the principles of privacy-by-design. An important feature of privacy laws such as GDPR is that they permit the release of datasets (and derived products) so long as they are reasonably de-identified. Privacy-enhancing technologies such as differential privacy can be used in various data release contexts (eg. machine learning models) to provide privacy properties satisfying such laws and protecting customers from potential privacy breaches. However, from a software engineering perspective, there are a number of difficulties facing engineers when it comes to implementing differential privacy in their code:
- Differentially private algorithms are difficult to test. This is because differential privacy provides a probabilistic guarantee, essentially injecting some uncertainty into an attacker’s ability to deduce the identity or a sensitive attribute of an individual. This means that standard testing procedures cannot be used to verify the correctness of the implementation; the programmer would be required to use techniques (eg sampling) which could verify (approxiamtely) the probabilistic properties of the system. Moreover, it is challenging (even for researchers) to determine a value for epsilon which provides a ``reasonable’’ privacy guarantee, as required by privacy law.
- Proofs of differential privacy are typically mathematically complex and are applied to algorithms which may not correspond exactly to implemented code. Moreover, these proofs may depend on assumptions which do not hold in real coding settings. In these cases, it is difficult for a programmer to be able to prove that their implemented code is also differentially private.
- Differential privacy is difficult to interpret in settings outside of statistical datasets for which it was designed. This makes it challenging to know, for a programmer, what differential privacy properties to evaluate in order to derive a suitable privacy guarantee. For example, in the case of privacy-preserving machine learning, how would a programmer test that their differential privacy implementation provides guarantees against reconstruction of training data?
Research Goals
The goal of this research project is to provide QIF-aware programming tools to enable programmers to evaluate and prove privacy properties of their code. The project will take as its starting point the QIF-aware compiler prototype kuifje, originally developed by Tom Schrijvers, Martin Bognar, Jeremy Gibbons, Vincent Jackson, Debao Jian and currently maintained by Gleison Souza. Kuifje currently supports a Python-like programming language for constructing probabilistic programs and produces a QIF model of the information leakage of the system in terms of the program variables. The QIF model can be analysed using the QIF library which provides tools for leakage analysis in terms of Bayesian adversarial threats. Our goal is to extend the kuifje semantics so that it supports more sophisticated programming constructs, and to demonstrate its effectiveness on some sample algorithms with known privacy properties. A problem to overcome will be whether kuifje programs can scale to the size required for modelling leakage on realistic systems.
An advantage of the QIF approach is that it can model leakage properties of systems (and adversarial threats realising those leakages) independently of the privacy technique applied. This allows meaningful comparison of systems implementing different privacy paradigms.
This is joint work with Gleison Souza (UFMG, Brazil) and Matt Roberts (Macquarie University).