A QIF-Based Robust and Explainable Framework for Assessing Privacy Risks of Large Data Releases
In Brazil, the issues of privacy and transparency in the release of official statistical data are governed by two complementary laws. On the one hand, a privacy law enacted in 2018, known as LGPD, (and heavily based on the European General Data Protection Regulation), establishes restrictions on governmental agencies that publicly release data on individuals, and prescribes sanctions in case of non-compliance. On the other hand, a transparency law from 2011, known as LAI, determines to the public authorities the obligation to guarantee broad access to information, particularly that of collective or general interest, which must be made available via the Internet regardless of requirement. However, compliance to both laws is a non-trivial challenge, and several Brazilian governmental agencies are currently re-evaluating their methods of data publishing.
Problem statement
The problem consists in developing a formal framework for the rigorous analysis of the trade-off between privacy and transparency in the data publishing of official statistics and microdata that properly covers –but is not limited to– the balancing between the Brazilian privacy and transparency laws. The framework must satisfy the following constraints:
- Scientifically, it must rigorously formalize a myriad of attack models –both known and novel– in a unique, coherent framework, allowing for precise quantification and, more importantly, comparison of privacy and utility trade-offs in a comprehensive selection of practical scenarios.
- Technically, it must be computationally tractable even at the huge scale of a typical realistic scenario.
- From the perspective of communicability, it must be effectively explainable to all stakeholders, since for as mathematically sound and experimentally thorough any formal analysis might be, it can only foster real change if it persuades politicians and decision-makers of the issues, whilst empowering them to reach well-informed decisions and subsequently communicate those decisions to all stakeholders and the public.
Research Goals
The main goals of this project are:
- To develop a formal framework based on quantitative information flow to rigorously assess privacy and transparency of very large data releases in terms of configurable adversarial scenarios. The approach must provide: (1) flexibility, being able to formalize a myriad of attack models –known and novel– in a unique, coherent framework, allowing for precise quantification and, more importantly, comparison of privacy risks in a comprehensive selection of practical scenarios; (2) computational tractability even in very large collections of data; and (3) explainability – since for as mathematically sound and experimentally thorough any formal analysis might be, it can only foster real change if it persuades politicians and decision-makers of the issues, whilst empowering them to make well-informed decisions and communicate them to all stakeholders and the public.
- To deliver an exemplar analysis for a privacy-protected data release (e.g., by sampling or by differential- privacy) on a real case study. The practicality of our framework will be put to test in a concrete study of mitigation techniques (e.g., differential privacy) applied to the official Brazilian National Educational Censuses, which involve microdata from over 50 million students.
This project is supported by a Google Latin America Research Award.