2. Domain Generalisation#

This covers some causal approaches to domain generalisation.

2.1. Probable Domain Generalization via Quantile Risk Minimization#

This is a brief summary of the paper Probable Domain Generalization via Quantile Risk Minimization [ERS+22].

For a machine model to generalise to difference domains, it must perform well across domains that it has not seen in its training set. For example, a model train to classify medical images must perform well on images from hospitals that are not in its training set. Machine learning usually relies on the IID assumption that the test data will be sampled from the same distribution as the training data, in reality this assumption is rarely met.

Typically, machine learning optimises for good average performance across all domains.Suppose we have a set of \(n\) environments \(e_{\text{all}} = \{e_1, e_2, \dots, e_n\}\) that contain data points \((X_i^{e_j}, Y_i^{e_j})\). Let \(R(f)^{e_j}\) be the risk of using function \(f\) in domain \(e_j\). Risk could be something simple like mean squared error. The average-case problem is formulated as follows:

\[\min_{f \in F} \mathbb{E}_{e \sim \mathbb{Q}} R^e(f)\]

Models the perform well on average lack robustness i.e. they can can perform poorly on a large subset of environments. This has lead to some optimising for the worst-case scenario where the function that performs the best on the “hardest” environment is chosen.

\[\min_{f \in F} \max_{e \in e_{\text{all}}} R^e(f)\]

Optimising for the worst-case scenario can lead to models that are too conservative especially if the harder environments are rare or unlikely. Eastwood et al. [ERS+22] acknowledge this by including the distribution of environments in their optimisation.

\[\min_{f \in F} \underset{e \in \mathbb{Q}}{\text{ess sup }} R^e(f) \quad \text{where} \quad \underset{e \in \mathbb{Q}}{\text{ess sup }} R^e(f) = \inf \{ t \geq 0: \Pr_{e \sim \mathbb{Q}} \{ R^e(f) \leq t \} = 1 \}\]

This optimisation uses the essential supremum from measure theory. Optimising for the value of \(t\) such the risk across all domains is always less than or equal to \(t\) may still be too conservative. Eastwood et al. [ERS+22] relax this by optimising so the risk is less than equal to \(t\) with probability \(\alpha\).

\[\min_{f \in F, t \in \mathbb{R}} \quad \text{subject to} \quad \Pr_{e \sim \mathbb{Q}} \{ R^e(f) \leq t \} \geq \alpha\]

Below is a diagram of what this optimisation may look like.

The distribution of environments \(\mathbb{Q}\) in most situations is impossible to determine. However, for each function \(f \in F\) this creates a distribution \(R^e(f)\) for each environment \(e \sim \mathbb{Q}\), we call this distribution \(\mathbb{T}_f\). To this end, the optimisation problem can be rewritten as:

\[\min_{f \in F} F^{-1}_{\mathbb{T}_f}(\alpha) \quad \text{where} \quad F_{\mathbb{T}_f}^{-1}(\alpha) = \inf \{ t \in \mathbb{R} : \Pr_{R \sim \mathbb{T_f} } \{R \leq t \} \geq \alpha \}\]

Here, \(F_{\mathbb{T_f}^{-1}}(\alpha)\) is the inverse CDF of the risk distribution \(\mathbb{T}_f\)