Systematization of Knowledge: Data Minimization in Machine Learning

Modern systems based on Machine Learning (ML) are ubiquitous, from credit scoring to fraud detection. The conventional wisdom is that more data leads to better models. However, this data-centric approach directly conflicts with a fundamental legal principle: data minimization (DM). This principle, enshrined in key regulations like the GDPR in Europe and the CPRA in California, mandates that personal data collection and processing must be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed”.

Violations of this principle have substantial real-world consequences, including fines reaching hundreds of millions of dollars. Despite this, ML practitioners often face a dilemma: how to reconcile legal requirements with the technical need for large datasets? Furthermore, many existing techniques in ML privacy and security, such as feature selection or federated learning, achieve data minimization goals, but this connection is rarely made explicit. This leads to terminological confusion and uncertainty about legal compliance.

The paper “SoK: Data Minimization in Machine Learning” by Robin Staab et al. aims to organize this knowledge by introducing a comprehensive framework for Data Minimization in Machine Learning (DMML).

A Unified Framework for DMML

The paper’s main contribution is the creation of a unified framework that systematizes the approach to data minimization throughout the ML model lifecycle. It consists of several key components:

1. Actors in the Process

The framework defines three main roles in the data processing pipeline:

Data Owner (Client): The party that contributes their data (e.g., a patient in a hospital system).
Data Collector (Collector): The entity that gathers data from Clients (e.g., a hospital).
Service Provider (Server): The entity that trains the ML model on data received from the Collector and performs inference (e.g., a cloud service provider).

Example: A hospital (Collector) collects medical data from patients (Clients) to train a model that predicts disease risk. Lacking the computational resources, it outsources the model training and subsequent predictions to a third-party cloud company (Server).

2. The Data Processing Pipeline

Data flows through the system during both the training and inference (model usage) phases. At each stage, the data can undergo transformations that minimize the amount of information. The framework also identifies points where an adversary could intercept the data.

3. Types and Techniques of Data Minimization

Data minimization is not a monolithic concept. The authors distinguish several main types, which can be illustrated using our hospital example:

Horizontal Minimization (Removing Records): This involves reducing the number of collected records (patients).
- Example (Data Selection): Instead of using data from all patients, the hospital uses data selection techniques (e.g., a coreset) to identify a smaller, yet representative, subset sufficient to train an effective model. The data of patients outside this subset is not processed for this purpose at all.
Vertical Minimization (Reducing Information within a Record): This involves reducing the level of detail within a single record (one patient’s data). This can be achieved in three ways:
- Suppression (Removing Attributes): Completely removing certain attributes.
  - Example (Feature Selection): The hospital determines that the “patient_name” attribute is not needed for disease prediction. This feature is therefore suppressed (removed) from the dataset before it is sent to the Server.
- Generalization (Reducing Granularity): Decreasing the precision of the data.
  - Example (k-Anonymity): Instead of collecting the exact age of a patient (e.g., 47), the hospital only records an age range (e.g., “40-50”). This makes re-identification more difficult.
- Transformation (Applying a Non-interpretable Change): Applying an irreversible or hard-to-reverse transformation that obscures the original data.
  - Example (Federated Learning): Instead of sending raw data from patients’ medical devices to a central database, the model is trained locally on these devices. Only the aggregated model updates (gradients), which are transformative representations of the data, are sent to the Collector, not the private patient data.

4. When and Where? Pre-Hoc vs. Post-Hoc Minimization

The timing of minimization is crucial:

Pre-Hoc (before collection): Data is minimized on the Client’s device before it leaves their control. This approach is far superior from a privacy perspective. An example is the aforementioned federated learning, where raw data is never sent to the Collector.
Post-Hoc (after collection): The Collector first gathers the full data and only then minimizes it. This requires trusting the Collector. An example is privacy-preserving data publishing (PPDP) techniques, where the hospital collects complete records and then anonymizes them (e.g., by generalizing age) before sharing them further.

Takeaways for Practitioners and Researchers

The paper “SoK: Data Minimization in Machine Learning” provides a much-needed language and structure for discussing data minimization in AI. Thanks to this framework:

Practitioners can consciously select ML techniques, understanding their impact on privacy and GDPR compliance. They can clearly identify at which point in their system and against which threats they want to apply minimization.
Researchers are given a unified perspective that connects seemingly disparate fields—such as federated learning, differential privacy, and feature selection—under the common umbrella of data minimization.
Regulators gain insight into technical possibilities and limitations, which can help in creating more precise guidelines for implementing data minimization principles in practice.

In an age of growing privacy awareness, integrating data minimization principles into the very core of ML processes is no longer an option but a necessity. This work is a key step toward operationalizing these principles in the real world.

📎 Links

Based on the publication 📄 2508.10836

A Unified Framework for DMML#

1. Actors in the Process#

2. The Data Processing Pipeline#

3. Types and Techniques of Data Minimization#

4. When and Where? Pre-Hoc vs. Post-Hoc Minimization#

Takeaways for Practitioners and Researchers#

📎 Links#