Multiple hypothesis testing in high-dimension, where the number of observations is much smaller than the number of hypotheses, is a hard problem due to the restrictiveness in both computational and statistical aspects. To be more precise, the problem we are dealing with here is conditional independence testing, i.e. multivariate inference in high-dimension, as opposed to univariate inference problem that is already well-studied. There has been several studies that attempted to formalize this negative (hardness) result. As an example, the reader can refer to [1] for more detail.
This post is a review of a recently introduced procedure called Conditional Randomization Test (CRT) [2] and its distillation version (dCRT) [3], which shows promise in dealing with the hardness of conditional independence testing in high-dimension. We begin by formalize our problem.
Let \(\mathcal{D} = \{\mathbf{x_i}, y_i \}^n_{i=1}\) denote a dataset of \(n\) i.i.d observations. For each \(i = \{1, 2, \dots, n\}\), the vector \(\mathbf{x}_i \in \mathbb{R}^p\) is one observation of \(p\) variables and \(y_i\) its response. In real-life, \(\mathbf{x}_i\) could be the profile of user \(i\) on YouTube (gender, age, country of residence, watching history, etc.), with \(y_i\) is the most recent video she watched. In the form of matrix representation, the design matrix with each row corresponds to one observation in \(\mathcal{D}\) will be denoted \(\mathbf{X} \in \mathbb{R}^{n \times p}\), and \(\mathbf{y} \in \mathbb{R}^n\) the response vector. For this system of notation, the conditional independence testing problem is defined to be the task of simultaneously testing \(m\) hypotheses:
\[ \text{(null) }\mathcal{H}_{0j}: \mathbf{x_j} \perp \!y_j \mid \mathbf{X}_{-j} \quad \text{vs.} \quad \text{(alternative) } \mathcal{H}_{\alpha j}: \mathbf{x_j} \not\perp \!y_j \mid \mathbf{X}_{-j}, \]
for all \(j \in [m] = \{1, 2, \dots, m \}\). Here \(\mathbf{X}_{-j}\) is a matrix of size \(n \times (p - 1)\), which is the result of removing column \(j\) from \(\mathbf{X}\). As a continuation with the analogy of our user on YouTube, the goal of this hypothesis testing problem would be to tell which attribute \(j\) on her profile have direct influence on the choice of her videos conditional to the other attributes \(\mathbf{X}_{-j}\).
Although in theory the number of test \(m\) can be smaller than the number of variables \(p\), to simplify matters we will assume they are equal. If we further assume the reletionship between \(\mathbf{X}\) and \(\mathbf{y}\) follows the linear equation
\[\begin{equation} \mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \sigma\boldsymbol{\epsilon} \label{eq:linear} \end{equation}\]
where \(\boldsymbol{\beta}^* \in \mathbb{R}^p\) the vector of true signals, \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}_n)\) the Gaussian noise vector and \(\sigma\) the noise magnitude, then the null and alternative hypotheses can be simplified into
\[ \text{(null) }\mathcal{H}_{0j}: \beta_j^* = 0 \quad \text{vs.} \quad \text{(alternative) }\mathcal{H}_{\alpha j}: \beta_j^* \neq 0. \]
Equation \eqref{eq:linear} shows what is called the linear assumption between \(\mathbf{X}\) and \(\mathbf{y}\), a rather popular assumption in statistical inference literature of both theory and practice.
Nowadays, with the increasingly complexity of dataset, one often finds the trouble of dealing with high-dimensional data more frequently. Formally, high-dimension regime is defined to be the setting where the number of observations \(n\) is (much) smaller than the number of variables \(p\). Doing inference on this regime is problematic. Besides obvious computational issue (\(p\) can go as big as over hundred thousands in genetics dataset, for example), the main statistical issue that makes conditional independence testing highly non-trivial is that classical asymptotic normality assumption does not hold [5]. In particular, some most popular forms of regularization, like the Ridge or Lasso [6] (with \(\ell_2\) and \(\ell_1\) term regularization added, respectively), do the estimation of \(\boldsymbol{\beta}^*\) quite effectively but leave no valid way to derive test statistic with nice analytic distribution. Therefore, finding the corresponding p-value of each hypotheses is also not possible.
One of the recent advances to tackle this problem is the Debiased (or Desparsified) Lasso, introduced around the same period by two lines of studies [8]. These studies point out that the solution \(\hat{\boldsymbol{\beta}}^{\text{LASSO}}(\lambda)\) of the Lasso program
\[\begin{equation} \label{eq:lasso} \hat{\boldsymbol{\beta}}^{\text{LASSO}}(\lambda) = \text{argmin}_{\boldsymbol{\beta} \in \mathbb{R}^{p}} \dfrac{1}{2} \left\lVert {\mathbf{y} - \mathbf{X}\boldsymbol{\beta}} \right\rVert_2^2 + \lambda\left\lVert {\boldsymbol{\beta}} \right\rVert_1, \end{equation}\]
is in fact bias. They also proposed a debiasing step to fix this problem. This makes \(\hat{\boldsymbol{\beta}}^{\text{LASSO}}(\lambda)\) follow standard Normal distribution asymptotically, which further implies that a valid p-value and a confidence interval of the estimator for each variable can be derived.
As an alternative to this approach, [2] introduces a procedure called Conditional Randomization Test (CRT) that can also output valid p-value in high-dimensional regime. We will review this procedure with its variant distilled CRT [3] in the next sections.
The conditional randomization test was first introduced in [2], although quite briefly since the main purpose of this paper is in fact presenting knockoff filter, a False Discovery Rate (FDR) controlling procedure. The knockoff filter does not produce p-values by itself, which in turns can be both an advantage (that it does not need p-values for FDR controlling) and disadvantage (the original knockoff can only do FDR controlling). As a result, CRT is mentioned as a close cousin of knockoff as one way to output valid empirical p-values. We will leave the elaborated discussion of knockoff filter in the future post, and provide here an algorithm block of CRT.
Algorithm 1: Conditional Randomization Test
INPUT: dataset \(\mathcal{D} = \{\mathbf{x_i}, y_i \}^n_{i=1}\), a feature importance statistics \(T_j(\mathbf{X}, \mathbf{y})\) for testing whether \(\mathbf{x}_j\) and \(\mathbf{y}\) are conditionally independent.
FOR \(j = 1, 2, \dots, p\) do
FOR \(b = 1, 2, \dots, B\) do
1. Sampling new column \(\mathbf{\tilde{x}}_j \sim \mathcal{P}(\mathbf{x}_j \mid \mathbf{X}_{-j})\)
2. Combine \(\mathbf{\tilde{x}}_j\) with original matrix \(\mathbf{X}_{-j}\) to form new matrix \(\mathbf{\tilde{X}}^{(b)} \in \mathbb{R}^{n \times p}\)
3. Compute feature importance \(T_j(\mathbf{\tilde{X}}^{(b)}, \mathbf{y})\)
Compute (one-sided) p-value
\[ p_j = \dfrac{1}{B + 1} \left\{1 + \sum^{B}_{b=1} \boldsymbol{1}_{T_j(\mathbf{\tilde{X}}^{(b)}, \mathbf{y}) \geq T_j(\mathbf{X}, \mathbf{y})} \right\} \]
OUTPUT: vector of pvalues \(\boldsymbol{\pi} = \{p_j\}_{j=1}^p\)
One of the promises of CRT is its flexibility: the procedure can use different type estimators for p-value sampling because it places no restriction on relationship of variable and responses, e.g. such of linear assumption in Eq. \eqref{eq:linear}. Despite this fact, an obvious huge drawback for the original algorithm is computation. Indeed, the algorithm block shows that CRT requires drawing \(B\) bootstraps computation for each variable. Assume we use Lasso program from Eq.\eqref{eq:lasso} to compute feature importance \(T_j(\mathbf{\tilde{X}}^{(b)}\mathbf{y})\), CRT runtime would be \(\mathcal{O}(Bp^4)\) since Lasso has the optimization cost of \(\mathcal{O}(p^3)\) with coordinate descent algorithm (see [9] pp. 93). With a requirement of large enough \(B\) to make the empirical distribution of p-value smooth enough, this runtime is straightforward impossible to achieve when \(p\) grows very large. The latter point related to a slightly less problematic issue of CRT, which is it can only output an empirical p-values. A number of bootstraps \(B\) too small can lead to no detection of significant variables.
These are the main motivations of a very recent work of [3], which introduced the Distilled Conditional Randomization Test.
The Distiled CRT (dCRT) is presented in [3] that aims to fix the prohibitive expensiveness in computation of vanilla CRT. This is done through an process called distillation operator. A nice additional property of this procedure is that it can directly output an analytical p-value through a feature-importance statistics that follows Gaussian law.
The key idea behind distillation is that important information of a variable \(\mathbf{x}_j\) is “distilled” back to the response \(\mathbf{y}\) and to the remaining \(j-1\) varibles, i.e. the data matrix \(\mathbf{X}_{-j}\). Although [3] framed dCRT as a general framework for outputting p-values with flexible choice of distillation operator, what will be presented next is a variant called the Lasso Disillation CRT (Lasso-dCRT), which is the variant that the authors of the work have focus mostly on both in theoretical and empirical arguments. This also means we work under the linear-relationship in Eq.\eqref{eq:linear}.
With the introduction of distillation operator, performing dCRT is straightforward as follows. For each variable \(j\), we form the distillation function \(d_y\) and \(d_{x_j}\) by first solving two lasso problems
\[\begin{equation} \label{eq:beta-dy} \color{green}{\hat{\boldsymbol{\beta}}^{d_{y}}(\lambda)} = \text{argmin}_{\boldsymbol{\beta} \in \mathbb{R}^{p}} \dfrac{1}{2} \left\lVert {\mathbf{y} - \mathbf{X}_{-j}\boldsymbol{\beta}} \right\rVert_2^2 + \lambda\left\lVert {\boldsymbol{\beta}} \right\rVert_1, \end{equation}\]
and
\[\begin{equation} \label{eq:beta-dx} \color{blue}{\hat{\boldsymbol{\beta}}^{d_{x_j}}(\lambda)} = \text{argmin}_{\boldsymbol{\beta} \in \mathbb{R}^{p}} \dfrac{1}{2} \left\lVert {\mathbf{x}_j - \mathbf{X}_{-j}\boldsymbol{\beta}} \right\rVert_2^2 + \lambda\left\lVert {\boldsymbol{\beta}} \right\rVert_1. \end{equation}\]
Then the distillations \(d_y\) and \(d_{x_j}\) is defined to be the problems’ estimation residuals
\[\begin{equation} \label{eq:distill} \color{green}{d_y} = \mathbf{y} - \mathbf{X}_{-j}\color{green}{\hat{\boldsymbol{\beta}}^{d_{y}}} \quad \text{and} \quad \color{blue}{d_{x_j}} = \mathbf{x}_j - \mathbf{X}_{-j}\color{blue}{\hat{\boldsymbol{\beta}}^{d_{x_j}}}. \end{equation}\]
It follows that a test statistic can be calculated by the formula
\[\begin{equation} \label{eq:statistics} T(x_j, y, \color{blue}{d_{x_j}}, \color{green}{d_y}) = \left[(y - \color{green}{d_y})^{T}(x_j - \color{blue}{d_{x_j}})\right]^2, \end{equation}\]
and coupled with the fact that \((y - \color{green}{d_y})^{T}(x_j - \color{blue}{d_{x_j}}) \sim \mathcal{N}(0, \sigma^2\left\lVert {\mathbf{y} - \color{green}{d_y}} \right\rVert^2)\), we can output the exact p-value for each variable \(j\)
\[\begin{equation} \label{eq:pval-crt} p_j = 2\left[1 - \Phi\left( \dfrac{\sqrt{T(x_j, y, \color{blue}{d_{x_j}}, \color{green}{d_y})}}{ \sigma\left\lVert {\mathbf{y} - \color{green}{d_y}} \right\rVert} \right) \right], \end{equation}\]
where \(\Phi(\cdot)\) is the standard normal cumulative distribution function (CDF).
In other words, Lasso-dCRT gains on vanilla CRT by removing bootstrapping steps for each variable. This leads to a reasonable reduction of the computation cost. However, careful reader can still notice that the runtime complexity is still a polynomial of order 4 of \(p\), or \(\mathcal{O}(p^4)\). To deal with this, [3] propose a initial screening step with the assumption that all relevant variables is selected on this step, and set all p-values of unselected variables to 1. The resulting runtime is reduced to be \(\mathcal{O}(kp^3)\), where \(k\) is the total number of selected variables post-screening and is often much smaller than \(p\) in sparsity regime. More formally, the screening step return the estimation of the support \(\hat{\mathcal{S}} := \{j \in [p]: \hat{\beta}_j \neq 0\}\), and \(k\) is the cardinality of this set.
A full algorithm block is then as follows.
Algorithm 2: Lasso-Distilled Conditional Randomization Test
INPUT: dataset \(\mathcal{D} = \{\mathbf{x_i}, y_i \}^n_{i=1}\)
Screening: Solving the Lasso in Eq. \eqref{eq:lasso}, obtain the estimated support \(\hat{\mathcal{S}} := \{j \in [p]: \hat{\beta}_j \neq 0\}\)
FOR \(j \in \hat{\mathcal{S}}\) do
1. Obtain \(\color{green}{\hat{\boldsymbol{\beta}}^{d_{y}}(\lambda)}, \color{blue}{\hat{\boldsymbol{\beta}}^{d_{x_j}}(\lambda)}\) by solving Eq.\eqref{eq:beta-dy} and Eq.\eqref{eq:beta-dx}
2. Obtain \(\color{green}{d_y}, \color{blue}{d_{x_j}}\) following Eq.\eqref{eq:distill}
3. Compute feature importance \(T(x_j, y, \color{blue}{d_{x_j}}, \color{green}{d_y}) = \left[(y - \color{green}{d_y})^{T}(x_j - \color{blue}{d_{x_j}})\right]^2\)
4. Compute (two-sided) p-value
\[ p_j = 2\left[1 - \Phi\left( \dfrac{\sqrt{T(x_j, y, \color{blue}{d_{x_j}}, \color{green}{d_y})}}{ \sigma\left\lVert {\mathbf{y} - \color{green}{d_y}} \right\rVert} \right) \right] \]
SET \(p_j = 1\) for all \(j \in [p] \setminus \hat{\mathcal{S}}\)
OUTPUT: vector of p-values \(\boldsymbol{\pi} = \{p_j\}_{j=1}^p\)
The dCRT framework is promising mainly because it is quite flexible with the usage of loss function. In the distillation step, one needs not to restrict to just \(\ell_2\) loss like the Lasso, but could be more creative by using Logistic loss with \(\ell_1\) regularization, or even Random Forest for some non-linearity measure as the authors of [3] has suggested in their work. Another mentioned issue is the question about independence of p-values output by Lasso-dCRT in Eq.\eqref{eq:pval-crt}. The ultimate goal of producing p-values is to performing multiple testing procedure with them, and some methods, e.g. Benjamini-Hochberg procedure to control the FDR [10] requires p-values to be independence or at least exhibits positive dependence.
On computational side, with a runtime complexity of \(\mathcal{O}(kp^3)\), Lasso-dCRT is still relatively slow compare to the Knockoff Filter, as shown in the benchmark of [3]. This likely means that the runtime performance of dCRT is also less impressive compare with the Debiased Lasso. The benchmark of [3] consists only of numerical experiments and a very small-scale genetic dataset (\(n=1396, p=164\)), which leaves a question about whether dCRT can scale well in real-life dataset from genetics or brain-imaging, where \(p\) could be as big as 100,000.
As a related note, [11] conjectures that Debiased Lasso and the leave-one-out CRT (proposed in the same work) can output similar confidence interval up to some degree in asymptotic regime, assuming the linearity assumption. This is one of the more theoretical questions that remains open, along with a more rigorous analysis of the error-power tradeoff of CRT and dCRT.