Regularity Properties for Sparse Regression

Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and $\ell_q$ sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given data set. This is problematic, because they are at the core of the theory of sparse regression. Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications. However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, $\ell_q$ sensitivity, has certain desirable properties. This condition is weaker and more general than the others. We show that it holds with high probability in models where the parent population is well behaved, and that it is robust to certain data processing steps. These results are desirable, as they provide guidance about when the condition, and more generally the theory of sparse regression, may be relevant in the analysis of high-dimensional correlated observational data.


Introduction
The analysis of high-dimensional data is a central topic of statistics, motivated by advances in science, technology and engineering. Recent research revealed that estimation in high dimensions may be possible if the problems are suitably sparse. As a typical example, consider linear regression where most of the coefficients of the parameter vector are vanishing. In this setting, popular estimators include the Lasso (Chen, Donoho and Saunders, 2001;Tibshirani, 1996), folded concave penalized least-squares such as SCAD (Fan and Li, 2001), and the Dantzig selector (Candès and Tao, 2007). Sparsity has been exploited in a number of other questions, for instance instrumental variables regression in the presence of endogeneity (Gautier and Tsybakov, 2011).
The Lasso and Dantzig selector have small estimation error as long as the matrix of covariates obeys one of a variety of conditions. The incoherence condition of Donoho and Huo (2001) provides the earliest and simplest example. Later Candès and Tao (2005) introduced the restricted isometry property and showed its application to the Dantzig selector (Candès and Tao, 2007). In subsequent work Bickel, Ritov and Tsybakov (2009) analyzed the estimators under the weaker and more general restricted eigenvalue (RE) condition. The compatibility conditions of van de Geer (2007) are closely related. See van de Geer and Bühlmann (2009) for the relationship between these properties. Gautier and Tsybakov (2011) have recently introduced an estimator for instrumental variables regression, along with the ℓ q sensitivity properties that guarantee small estimation error. ℓ q sensitivity is the weakest and most general of the above properties, and also applies to linear regression. It is closely related to the cone invertibility factors of Ye and Zhang (2010).
We investigate in depth the conditions of the design matrices needed for high-dimensional sparse estimation. We first deal with the computational complexity of checking the properties on general design matrices. The locations of the non-vanishing coefficients of the regression parameter are unknown, so we must make a non-degeneracy assumption uniformly over all subsets of a given size. This suggests that the conditions may be hard to check. We confirm this by showing that checking any of the restricted eigenvalue, compatibility, and ℓ q sensitivity properties for general data matrices is NP-hard. This implies that there is no efficient way to check them, under the widely believed conjecture that P = NP. Our result builds on the recent proof that computing the spark and checking the restricted isometry property is NP-hard (Bandeira et al., 2013;Tillmann and Pfetsch, 2012).
Verifying the needed matrix properties is an important problem, recognized in a number of places in the literature. Tao (2007); Raskutti, Wainwright and Yu (2010); d' Aspremont and El Ghaoui (2011) discuss it as a problem of interest. Verification leads to guarantees that the inference procedure was successful. From a statistical point of view, the numerical values of the regularity conditions yield confidence sets for the regression parameter. The difficulty of their computation has already motivated several research works. For instance convex relaxations have been proposed for approximating the restricted isometry contant (d 'Aspremont, Bach and Ghaoui, 2008;Lee and Bresler, 2008), and linear relaxations for the ℓ q sensitivity (Gautier and Tsybakov, 2011).
Incoherence conditions are easy to check in polynomial time, in contrast to the hardness for the other properties. However they do not provide optimal rate of convergence: they require the sample size n of quadratic order s 2 (Bunea, 2007;Bandeira et al., 2012), while other conditions allow linear order (see e.g. Candès and Tao, 2005;Raskutti, Wainwright and Yu, 2010).
As a way to address the problem of non-verifiability, we show that our conditions hold with high probability if the covariate matrix is randomly sampled from a suitably well-behaved distribution. This extends the wellunderstood results for matrices with independent entries to correlated observation vectors, generated independently from a regular population. Previous results have been obtained for RIP (e.g. Rauhut, Schnass and Vandergheynst, 2008;Vershynin, 2010), and RE (Raskutti, Wainwright and Yu, 2010;Rudelson and Zhou, 2012). We establish new results for the more general ℓ q sensitivity, under three probability models: observations that are (1) sub-gaussian, (2) bounded, and (3) have bounded moments. These results are useful if a population covariance model is known and easier to analyze. This is often the case, as illustrated by our examples, and those in van de Geer and Bühlmann (2009) ;Raskutti, Wainwright and Yu (2010).
Finally, we show that the ℓ q sensitivity property is preserved under several natural operations on the data matrix. It is initially hard to ascertain whether this crucial property holds, but then there is a range of transformations one can apply to the data while preserving it.
In Section 2, we give definitions and the setup of our problem. In Section 3 we present our results, which are proven in Section 5. We finish with discussion in Section 4.

Definitions and Setup
We start with some basic notation, and then introduce the problems and notions we study: regression, associated estimators, regularity properties, sub-gaussian variables, and computational complexity.

Some notation
We denote by |v| q the vector ℓ q norm. An s-sparse vector has at most s non-vanishing coordinates. For a set S ⊂ {1, . . . , p} we denote by |S| its cardinality and S c its complement. For a vector v = (v 1 , . . . , v p ) T and a sub- We denote by M max the maximum absolute value of the entries of matrix M . For two sequences a n and b n of scalars, a n = O(b n ) means that there is a constant c > 0, such that a n ≤ cb n for all sufficiently large n. a n ≍ b n means that a n = O(b n ) and b n = O(a n ). For random variables X n , we write X n = O P (1) if the collection X n is bounded in probability, sometimes called uniformly tight. For two sequences of random variables X n , Y n , the notation X n = O P (Y n ) means that there is a sequence of random variables R n = O P (1), such that X n = R n Y n .

Regression problems and estimators
In linear regression we want to explain a response variable y as a linear function of p covariates x 1 , . . . , x p , up to a noise term ε, via the model y = p i=1 x i β i +ε. To estimate β, we observe n independent samples: the n×1 response vector Y and the covariate vectors X 1 , X 2 , · · · , X p of dimension n, forming the columns of an n × p matrix X. Hence now with a noise vector ε with independent N (0, σ 2 ) entries, we have the model We wish to estimate the p-dimensional parameter vector β in the case n ≪ p.
We assume that most of the coordinates of β are vanishing, and that the design matrix X is regular, as specified in the next section. The locations of nonzero coordinates are unknown to us. In this setting the Lasso, or ℓ 1 -penalized least squares, is a popular estimator (Tibshirani, 1996;Chen, Donoho and Saunders, 2001): for a given regularization parameter. The Dantzig selector is another estimator for this problem, which for a known noise level σ takes the form (Candès and Tao, 2007): where A is a tuning parameter.
In instrumental variables regression we also start with the model y = p i=1 x i β i + ε. Now some x i may be correlated with the noise, in which case they are called endogeneous. Further, we have additional variables z i , i = 1, . . . L, called instruments, that are uncorrelated with the noise. In addition to X, we observe n independent samples of z i , which are arranged in the n × L matrix Z. In this setting, Gautier and Tsybakov (2011) propose the Self-Tuning Instrumental Variables (STIV) estimator, a generalization of the Dantzig selector. In the case where the noise level σ is known, STIV takes the form: min with the minimum over the polytope Here D X and D Z are the diagonal matrices with (D X ) −1 ii = max k=1,...,n |x ki |, (D Z ) −1 ii = max k=1,...,n |z ki |.

Regularity properties
The cone (rather, union of cones) C(s, α) is the set of vectors such that the ℓ 1 norm is concentrated on some s coordinates: The regularity properties discussed depend on a triplet of parameters (s, α, γ). In all cases s is the sparsity size of the problem, α is the cone opening parameter in C(s, α), and γ is the lower bound. They are all positive numbers. The first matrix property is the Restricted Eigenvalue condition from Bickel, Ritov and Tsybakov (2009);Koltchinskii (2009). Bickel, Ritov and Tsybakov (2009) show that if the normalized data matrix 1/ √ nX obeys RE(s, α, γ) and β is s-sparse, then the estimation error is small: for both the Dantzig and Lasso selectors. The 'cone opening' α required in the restricted eigenvalue property equals 1 for Dantzig; 3 for Lasso. Next, we describe the compatibility condition from van de Geer (2007).
The two conditions are very similar. The only difference is that the ℓ 1 versus ℓ 2 norm in the denominator. The inequality |v S | 1 ≤ √ s|v S | 2 immediately implies that the compatibility conditions are formally weaker than the RE assumptions. van de Geer (2007) provides an ℓ 1 oracle inequality for the Lasso under the compatibility condition. See also van de Geer and Bühlmann (2009); Bühlmann and van de Geer (2011). The third and last assumption analyzed in this paper is the ℓ q sensitivity property from Gautier and Tsybakov (2011).
Definition 2.3. Let q ≥ 1. The n × p matrix X and n × L matrix Z satisfy the ℓ q sensitivity property with parameters (s, α, γ), if . Gautier and Tsybakov (2011) show that ℓ q sensitivity is weaker than the restricted eigenvalue and compatibility conditions. In the case Z = X the definition reduces to the cone invertibility factors of Ye and Zhang (2010). We note that the definition in Gautier and Tsybakov (2011) differs in normalization. We do not normalize for simplicity, to avoid the dependencies introduced by this process. As shown in Theorem 2.4, our definition works for an un-normalized version of the STIV estimator. The argument is classical, but more general than Candès and Tao (2007) due to the use of instruments and ℓ q sensitivity. Consider Theorem 2.4. Assume that z j , j = 1, . . . , L, and ε are mean zero subgaussian variables with sub-gaussian norm at most σ, β is s-sparse, and X, Z obey the ℓ q sensitivity property with parameters (s, 1, γ). Then, with n independent samples of data, taking λ = A 2 log(L) n , the unnormalized STIV estimator (2) obeys Finally, we introduce the incoherence condition and restricted isometry property, which serve as contrasts with the above conditions. For an n × p matrix X whose columns {X j } p j=1 are normalized to length √ n, the mutual incoherence condition holds if X T i X j ≤ γ/s for some positive γ. Such a notion was defined in Donoho and Huo (2001), and later used by Bunea (2007) to derive oracle inequalities for the Lasso.

Sub-gaussian vectors
The L p norm of a random variable is X p = (E|X| p ) 1/p . A random variable X satisfying sup p≥1 p −1/2 X p < ∞ is called sub-gaussian, and its subgaussian norm is defined as X ψ 2 = sup p≥1 p −1/2 X p (Vershynin, 2010). The random vector X is sub-gaussian if all one-dimensional marginals are sub-gaussian. The sub-gaussian norm of p-dimensional random vector X is then defined as Here S p−1 is the Euclidean unit sphere in R p .

Notions from computational complexity
In complexity theory, problems are classified according to the computational resources -time and memory -needed to solve them on a Turing machine, a model for the computer (Arora and Barak, 2009). A well-known example of a complexity class is P, consisting of the problems decidable in polynomial time in the size of the input. For input encoded in n bits, a yes or no answer must be found in time O(n k ) for some fixed k. Another important class is NP, the decision problems for which already existing solutions can be verified in polynomial time. This is usually much easier than solving the question itself in polynomial-time. For instance, the subset-sum problem: 'Given a set of integers, does there exist a subset with zero sum?' is in NP, since one can easily check any purported solution -a subset of the given integers -to see if it indeed solves the problem. However, finding this subset seems harder: simply enumerating all subsets is not a polynomial-time algorithm.
Formally, the definition of NP requires that if the answer is yes, then there exists an easily verifiable proof. We have P ⊂ NP, since a polynomial-time solution is a certificate verifiable in polynomial time. However, it is a famous open problem to decide if P equals NP (Cook, 2000). It is widely believed in the complexity community that P = NP.
To compare the computational hardness of various problems, one can reduce known hard problems to the novel questions of interest, thereby demonstrating the difficulty of the novel problems. Specifically, a problem A is polynomial-time reducible to a problem B, if an oracle solving B -that is an immediate solver for an instance of B -can be queried once to give a polynomial-time algorithm to solve A. This is also variously known as a polynomial-time many-one reduction, strong reduction or Karp reduction. A problem is NP-hard if every problem in NP reduces to it, namely it is at least as difficult as all other problems in NP. If one reduces a known NP-hard problem to a new question, this demonstrates the NP-hardness of the new problem.

Computational Complexity
We first show that the common conditions needed for sparse estimation are unfortunately NP-hard to verify. This builds on the recent results that com-puting the spark and checking restricted isometry are NP-hard (Bandeira et al., 2013;Tillmann and Pfetsch, 2012).
Theorem 3.1. Let X be an n × p matrix, Z an n × L matrix, 0 < s < n, and α, γ > 0. It is NP-hard to decide any of the following problems: 1. Does X obey the restricted eigenvalue condition with parameters (s, α, γ)? 2. Does X satisfy the compatibility conditions with parameters (s, α, γ) ? 3. Do X, Z obey the ℓ q sensitivity property with parameters (s, α, γ)?
The proof of Theorem 3.1 is found in Section 5.2. The theorem implies that there is no efficient way to check if a matrix is regular, provided P = NP.
Conditions like restricted isometry and restricted eigenvalue are central to both high-dimensional statistics and compressed sensing. Our result has more important implications for statistics. In signal processing and compressed sensing, one has a choice of a suitable random matrix -for instance with iid normal entries. Various random matrix ensembles appropriate for signal processing applications are regular with high probability, obeying even the restricted isometry property (Candès and Tao, 2005). Thus there may not be an urgent need for verification.
In statistical applications, however, the data matrix is often observational and correlated. The correlation between predictors is in many cases unknown, and may be substantial. It can be hard to judge if the matrix is regular. Therefore checking regularity conditions is a more important issue for statistics than for signal processing.

ℓ q sensitivity for correlated designs
Due to the hardness of verification of regularity conditions, it is of paramount importance to provide sufficient conditions for ℓ q sensitivity to hold for random matrices sampled from a high-dimensional correlated random vector. To this end, we first define a population version of ℓ q sensitivity. Let X and Z be p and L-dimensional zero-mean random vectors and denote by Note that when Z = X, Ψ is the covariance matrix of X. In particular, when Ψ = I p , as in many designs of compressed sensing, it possesses the ℓ q -sensitivity. See also Example 3.7 below and its proof.
Population ℓ q sensitivity corresponds to the sample version with n = ∞. It is a necessary and natural condition to impose. Together with tail conditions it is sufficient to guarantee the regularity condition of random matrices sampled from such a population. This is indeed shown in the following theorem, in three different models: sub-gaussian vectors, bounded coordinates, and finite moments.
Theorem 3.3. Let X and Z be zero-mean random vectors, such that the matrix of population covariances Ψ satisfies the ℓ q sensitivity property, q ≥ 1, with parameters (s, α, γ). Let a > 0 be fixed. Given n iid samples and any δ > 0, the matrixΨ = 1 n Z T X obeys ℓ q sensitivity with parameters (s, α, γ − δ), with high probability under each of the following settings: 1. If X and Z are sub-gaussian with fixed constants, then sample ℓ q sensitivity holds with probability at least 1 − (2pL) −a , provided that the sample size is at least n ≥ cs 2 log(2pL).

If the entries of the vectors are bounded by fixed constants, the prop-
erty also holds with probability at least 1 − (2pL) −a , whenever n ≥ cs 2 log(2pL).

If the entries have bounded moments: E|X
for some positive integer r and all i, j, then ℓ q sensitivity holds with probability at least 1 − 1/n a , assuming the sample size is at least n 1−a/r ≥ cs 2 (pL) 1/r .
The constant c does not depend on n, L, p and s, only on the other parameters of each case. It is given explicitly in the proofs in Section 5.3. The statements require n ≍ s 2 within a logarithmic order for the first two cases, and it would be interesting to know if the rate can be improved. Further, note that bounded random vectors are formally also sub-gaussian, but the sub-gaussian norm scales as √ p. We get better results for bounded vectors if we treat them directly.
Related results have been obtained for the restricted isometry property (Rauhut, Schnass and Vandergheynst, 2008;Rudelson and Zhou, 2012) and restricted eigenvalue condition (Raskutti, Wainwright and Yu, 2010;Rudelson and Zhou, 2012). We investigate the ℓ q sensitivity property since it's weaker and more general, also applicable to instrumental variables regression.
Theorem 3.3 can be extended to the case of the mixture distributions with different tail properties, i.e. (X, Z) is sampled from population P 1 with probability p 1 , and P 2 with probability 1 − p 1 . We show this in the simplest case, a mixture of bounded and sub-gaussian random vectors. For k = 1, 2 let Ψ k = E k ZX T denote the matrix of covariances of X and Z under population P k Theorem 3.4. Suppose the distribution of random vectors X, Z is a mixture of a sub-gaussian distribution P 1 and a coordinate-wise bounded distribution P 2 , with fixed mixture probability. Suppose further that either of the two matrices of covariances Ψ 1 or Ψ 2 obeys the ℓ q sensitivity with lower bound γ and that Ψ 1 − Ψ 2 max ≤ δ/s. Then for each ν > 0, the matrix of sample covariances of n independent samples of (X, Z) obeys ℓ q sensitivity with sparsity size s and lower bound γ − (δ + ν)(1 + α), with probability 1 − 4(2Lp) −ρ , if n ≥ cs 2 log(2pL), for some constants ρ, c.
Again, ρ and c are constants that do not depend on n, L, p, s. We prove Theorem 3.4 in Section 5.4. From the proof, one can see that the condition on Ψ-matrices can be relaxed to the ℓ q sensitivity of the matrix p 1 Ψ 1 + (1 − p 1 )Ψ 2 , where p 1 is the probability of getting the sample from P 1 .
In addition to the uncorrelated covariance matrices that satisfy the ℓ q sensitivity (See Example 3.7 below and its proof), we introduce a more general class of covariance matrices that possess such a property.
. . , p} of size s, and for each pattern of signs ε ∈ {−1, 1} S , there exists either a row w of Ψ such that sgn(w i ) = ε i for i ∈ S, and w i = 0 otherwise, or a row with sgn(w i ) = −ε i for i ∈ S, and w i = 0 otherwise.
Note that when L = p, diagonal matrices are 1-comprehensive. However, when L = p, none of the other conditions are applicable. This illustrates that ℓ q sensitivity is the most general property. By simple counting, L ≥ 2 s−1 p s . We show that an s-comprehensive covariance matrix obeys the ℓ 1 sensitivity property.
Theorem 3.6. Suppose the L×p matrix of covariances Ψ is s-comprehensive, and that all non-vanishing entries in Ψ have absolute value at least c > 0. Then Ψ obeys the ℓ 1 sensitivity property with parameters s, α and γ = sc/(1 + α).
The proof of Theorem 3.6 is found in Section 5.5. The theorem shows that the larger the value s and hence the value L, the smaller c is required. It presents an interesting tradeoff between the number of instruments L and the strength of non-vanishing components of Ψ.
Finally, we give several examples to demonstrate that the ℓ q sensitivity is indeed weaker than other regularity conditions. The technical proofs of the results in Examples 3.7 and 3.9 can be found in Section 5.6.
Example 3.7. If Σ is a diagonal matrix with entries d 1 , d 2 , . . . , d p , then restricted isometry property holds if 1 + δ ≥ d i ≥ 1 − δ for all i. Restricted eigenvalue only requires d i ≥ γ. The same condition is required for compatibility. This example shows why restricted isometry is the most stringent property. Further, ℓ 1 sensitivity holds even if a finite number of d i go to zero at rate 1/s (shown in Section 5.6). In this latter case, all other regularity conditions fail. This example shows that l q regularity is much weaker than other regularity conditions.
The next examples further delineate between the various properties.
Example 3.9. If Σ has diagonal entries equal to 1, σ 12 = σ 21 = ρ, and all other entries are equal to zero, then compatibility and ℓ 1 sensitivity hold as long as 1 − ρ ≍ 1/s (proven in Section 5.6). In such a case, however, the restricted eigenvalues are of order 1/s. This is an example where compatibility and ℓ 1 sensitivity hold but the restricted eigenvalue condition fails.

Operations preserving regularity
While it is difficult to check that a covariate matrix is regular, this property is preserved under natural operations that do not change the covariance structure by much. We show this for the ℓ q sensitivity, in analogy to results on restricted isometry property (eg. Bandeira et al. (2012) and references therein).
We provide two theorems, both proven in Section 5.7. First we have a theorem about linear transformations of the data matrix that preserve regularity. Let X and Z be covariate matrices as in the rest of the paper.
1. Perform the orthogonal transformation M on each covariate: let X ′ = M X, Z ′ = M Z. Then (X ′ , Z ′ ) obey the same ℓ q sensitivity properties as (X, M ). 2. Let M be a cone-preserving linear transformation R L → R L , such that for all v ∈ C(s, α) we have M v ∈ C(s ′ , α ′ ) and let X ′ = XM . Suppose further that |M v| q ≥ c|v| q for all v in C(s, α). If (X, Z) obeys the ℓ q sensitivity property with parameters (s ′ , α ′ , γ), then (X ′ , Z) has ℓ q sensitivity with parameters (s, α, cγ).
3. Let M be a linear transformation R p → R p such that for all v, |M v| ∞ ≥ c|v| ∞ . If we transform Z ′ = ZM , and (X, Z) obeys the ℓ q sensitivity property with lower bound γ, then (X, Z ′ ) obeys the same property with lower bound cγ.
Our second result is about the additive operations on the covariate matrix that preserve regularity. We use the induced matrix norms |M | a,b = sup v |M v| b /|v| a . In particular, note that |M | 1,1 is the maximum ℓ 1 column sum and |M | 1,∞ is the maximum absolute entry denoted by M max elsewhere in the paper.

Discussion
This paper presented an in-depth study of the matrix properties required for high-dimensional sparse estimation. We considered the restricted eigenvalue and compatibility properties, and the more general ℓ q sensitivity condition, also applicable to instrumental variables regression. First we showed that they are unfortunately NP-hard to check. The results are important because in statistical applications the data is typically observational, and one cannot rely on the known regularity of iid random matrices.
For problems where a model of the covariance matrix is available, we have formulated high probability sufficient conditions for ℓ q sensitivity. Finally we have established that several natural matrix operations preserve the ℓ q sensitivity.
Our work raises further questions about the interplay of estimation and computation, specifically for sparse regression models. It would be interesting to study if there are statistically efficient estimators relying on computationally verifiable conditions. Specifically, can one devise an estimation method for sparse linear regression with mean squared error of minimax optimal order s log(p)/n, relying on a condition that is also efficiently verifiable? The current theory falls short: incoherence requires n ≍ s 2 log(p) samples to hold, and restricted eigenvalues are NP-hard to check. This is an important research area, as illustrated by the recent work Chandrasekaran and Jordan (2013).
Finally, accurate and efficiently computable approximations to the values of regularity constants could provide efficient confidence intervals. Previous work on this problem has relied on convex relaxations, unfortunately leading to confidence intervals that are wider by a factor s than those theoretically possible (d 'Aspremont, Bach and Ghaoui, 2008;Lee and Bresler, 2008;Gautier and Tsybakov, 2011). Some recent progress involves significance testing for adaptive linear models (Lockhart et al., 2013). Improvements in this direction would be of significant theoretical and practical value.

Proof of Theorem 2.4
Proof. From a classical argument (e.g. Candès and Tao, 2007) with high probability, which is equivalent to β ∈ I. From now on, assume that this event holds. Then, asβ minimizes the ℓ 1 norm over I, we have |β| 1 ≤ |β| 1 or |δ S c | 1 ≤ |δ S | 1 with δ =β − β. Hence δ is in the cone C(s, 1). Further, Therefore using the ℓ q sensitivity we conclude This is the desired claim.

Proof of Theorem 3.1
The spark of a matrix X, denoted spark(X), is the smallest number of linearly dependent columns. The proof of our complexity result, Theorem 3.1, consists of a polynomial-time reduction from the NP-hard problem of computing the spark of a matrix (see Bandeira et al. (2013); Tillmann and Pfetsch (2012) and references therein).
Lemma 5.1. Given an n × p matrix with integer entries X, and a sparsity size 0 < s < p, it is NP-hard to decide if the spark of X is at most s.
We also need the following technical lemma, which provides bounds on the singular values of matrices with bounded integer entries. For a matrix X, we denote by X 2 or X its operator norm. Furthermore, we denote by X S the submatrix of X obtained by taking the columns with indices in S.
Lemma 5.2. Let X be an n × p matrix with integer entries. Let M = max i,j |X ij |. Then, Further, let 0 < s < n. If spark(X) > s, then for any S ⊂ {1, . . . , p}, |S| = s, we have: λ min (X T S X S ) ≥ 2 −2n⌈log 2 (nM )⌉ . Proof. The first claim follows from: For the second claim, let X S denote a submatrix of X with an arbitrary index set S of size s. Then spark(X) > s implies that X S is non-singular. Since the absolute values of the entries of X lie in {0, . . . , M }, the entries of X T S X S are integers with absolute values between 0 and nM 2 , namely X T S X S max ≤ nM 2 . Moreover, since the non-negative and nonzero determinant of X T S X S is integer, it must be at least 1. Hence, Rearranging, we get In the middle inequality we have used s ≤ n. This is the desired bound.
For the proof we need the notion of encoding length, which is the size in bits of an object. Thus, an integer M has size ⌈log 2 (M )⌉ bits. Hence the size of the matrix X is at least np + ⌈log 2 (M )⌉: at least one bit for each entry, and ⌈log 2 (M )⌉ bits to represent the largest entry. To ensure that the reduction is polynomial-time, we must make sure in particular that the size in bits of the parameters involved is polynomial in the size of the input X. As in standard treatments of computational complexity, the numbers here are rational (Arora and Barak, 2009). Proof of Theorem 3.1. It is enough to consider X with integer entries. For each property and given sparsity size s, we will exhibit parameters (α, γ) of a size in bits polynomial in that of the input X, such that: 1. spark(X) ≤ s =⇒ X does not obey the regularity property with parameters (α, γ), 2. spark(X) > s =⇒ X obeys the regularity property with parameters (α, γ).
Hence, any polynomial-time algorithm for deciding if the regularity property holds for (X, s, α, γ), can, with just one call, in polynomial time decide if spark(X) ≤ s. Here it is crucial that (α, γ) are polynomial in the size of X, so that the whole reduction is polynomial in X.
Since deciding spark(X) ≤ s is NP-hard by Theorem 5.1, this shows the desired NP-hardness of checking the conditions. For ℓ q sensitivity, we in fact show that the subproblem where Z = X is NP-hard, thus the full problem is also clearly NP-hard. Now we provide the required parameters (α, γ) for each regularity condition. Similar ideas are used when comparing the conditions.
For the restricted eigenvalue condition, the first claim follows any γ > 0, and any α > 0. To see this, if the spark of X at most s, there is a nonzero s-sparse vector v in the kernel of X, and |Xv| 2 = 0 < γ|v S | 2 , where S is any set containing the nonzero coordinates. This v is clearly also in the cone C(s, α), and so X does not obey RE with parameters (s, α, γ).
We now prove the second claim for the restricted eigenvalue. If spark(X) > s, then for each index set S of size s, the submatrix X S is non-singular. We now show that this implies a non-vanishing lower bound on the RE constant of X. Indeed, consider a vector v in the cone C(s, α), and assume specifically that α|v S | 1 ≥ |v S c | 1 . Using the simple identity Xv = X S v S + X S c v S c , we have Further, since v is in the cone, we have Since X S is non-degenerate and integer-valued, we can use the bounds from Lemma 5.2. Consequently, with M = X max , we obtain By choosing, say, α = 2 −2n⌈log 2 (npM )⌉ , γ = 2 −2n⌈log 2 (npM )⌉ , we easily conclude after some computations that |Xv| 2 ≥ γ|v S | 2 .
Moreover, the size in bits, or encoding length, of the parameters is polynomially related to that of X. Indeed, the size in bits of both parameters is 2n⌈log 2 (npM )⌉, and the size of X is at least np + ⌈log 2 (M )⌉, as discussed before the proof. Note that 2n⌈log 2 (npM )⌉ ≤ (np + ⌈log 2 (M )⌉) 2 . Hence we have showed both required conditions.
The argument for the compatibility conditions is nearly identical. Indeed, the first claim is satisfied for any γ > 0: for any nonzero s-sparse vector v in the kernel of X, with S containing the support of v, we have √ s|Xv| = 0 < γ|v S | 1 .
For the second claim we argue as above, and obtain √ s|Xv| 2 Therefore the same choices of α, γ work in this case as well. Finally, we deal with the ℓ q sensitivity property. The first condition is again satisfied for all α > 0 and γ > 0. Indeed, If the spark of X is at most s, there is a nonzero s-sparse vector v in its kernel, and thus |X T Xv| ∞ = 0.
For the second condition, we note that For v in the cone, α|v S | 1 ≥ |v S c | 1 and hence Combination of the last two results gives Finally, since q ≥ 1, we have |v| 1 ≥ |v| q , and as v is in the cone, |v| 2 2 = |v S | 2 2 + |v S c | 2 2 ≤ (1 + α 2 s)|v S | 2 2 , by inequality (3). Therefore, Hence we essentially reduced to restricted eigenvalues. From the proof of that case, the choice α = 2 −2n⌈log 2 (npM )⌉ gives Hence for this α we also have where we have applied a number of coarse bounds. Thus X obeys the ℓ q sensitivity property with the parameters α = 2 −2n⌈log 2 (npM )⌉ and γ = 2 −5n⌈log 2 (npM )⌉ . As in the previous case, the size in bits of these parameters are polynomial in the size in bits of X. This shows that both conditions hold, and thus proves the correctness of the reduction for ℓ q sensitivity. This completes the proof of Theorem 3.1.
The values of (α, γ) used in the proof are outside the range where the regularity properties lead to effective bounds for the estimation error. This choice is essential for the current proof, but it is a question of interest to extend it to the regime where α and γ are independent of n and p.

Proof of Theorem 3.3
The ℓ q sensitivity property of random matrices relies on large deviation inequalities for random inner products. After establishing such inequalities, we finish the proofs quite directly, essentially by a union bound. We discuss the three probabilistic settings one by one, proving the required lemmas along the way. Thus, the proof of Theorem 3.3 is split into three parts: 5.3.1, 5.3.2, 5.3.3.

Sub-gaussian variables
A random variable X is sub-exponential if sup p≥1 p −1 X p < ∞, and the sub-exponential norm (or constant) is then X ψ 1 = sup p≥1 p −1 X p . We use the following Bernstein-type inequality; see Corollary 5.17 in Vershynin (2010).
Lemma 5.3 (Bernstein for sub-exponential). If X 1 , . . . , X N are independent centered sub-exponential random variables, and K = max i X i ψ 1 , then for every t ≥ 0, we have Bernstein's lemma immediately implies a deviation inequality for inner products. We state it separately for clarity. It is also an extension of a lemma used in covariance matrix estimation (Bickel and Levina, 2008;Ravikumar, 2011).
Lemma 5.4 (Deviation of Inner Products for Sub-gaussians). Let X and Z be zero-mean sub-gaussian random variables, with sub-gaussian norms X ψ 2 , Z ψ 2 respectively. Then, given n iid samples of X and Z, the sample covariance satisfies the tail bound: where K := 4 X ψ 2 Z ψ 2 .
Proof. The proof consists of a direct application of Bernstein's inequality. We only need to bound the sub-exponential norms of U i = X i Z i −E(X i Z i ). In general if X, Z are sub-gaussian, then XZ is sub-exponential and moreover Indeed by the Cauchy-Schwartz inequality (E|XZ| p ) 2 ≤ E|X| 2p E|Z| 2p . Hence also Taking the supremum over p ≥ 1/2 of both sides leads to the desired inequality (4).
The U i are iid random variables, and their sub-exponential norm is by the triangle inequality, the norm inequality (4), and Cauchy-Schwartz, at most Further by definition EX 2 1/2 ≤ √ 2 X ψ 2 , hence the sub-exponential norm is at most Thus the result follows by a direct application of Bernstein's inequality.
With these preparations, we now prove Theorem 3.3 for the sub-gaussian case. By a union bound over the Lp entries of the matrix Ψ −Ψ By Lemma 5.4 each probability is upper bounded by a term of the form 2 exp(−cn min(t/K, t 2 /K 2 )), where K varies with i, j. The largest of these bounds corresponds to the largest of the K-s. Hence the K in the largest term is 4 max i,j X i Ψ 2 Z j Ψ 2 . By the definition of sub-gaussian norm, this is at most 4 X Ψ 2 Z Ψ 2 , where the X and Z are now p and L-dimensional vectors, respectively.
Therefore we have the uniform bound We choose t such that (a+1) log(2Lp) = cnt 2 /K 2 , that is t = K 2 (a+1) log(2Lp) cn . Since we can assume (a + 1) log(2Lp) ≤ cn by the scaling in the statement, the relevant term is the one quadratic in t: the total probability of error is (2Lp) −a . From now on, we will work on the high-probability event that Ψ −Ψ max ≤ t.
For any vector v That is, with high probability it holds uniformly for all v that: for the constant R = K 2 (a+1) c . For vectors v in C(s, α), we bound the ℓ 1 norm by the ℓ q norm, q ≥ 1, in the usual way, to get a term depending on s rather than on all p coordinates: Introducing this into (6) gives with high probability over all v ∈ C(s, α): If we choose n such that n ≥ K 2 (1 + a)(1 + α) 2 cδ 2 s 2 log(2pL), then the second term will be at most δ. Further since Ψ obeys the ℓ q sensitivity assumption, the first term will be at least γ. This shows thatΨ satisfies the the ℓ q sensitivity assumption with constant γ − δ with high probability, and finishes the proof.
To summarize, it suffices if the sample size is at least n ≥ log(2pL)(a + 1) c max 1, The key to the proof, inequality (6), is similar in spirit to the one used in Raskutti, Wainwright and Yu (2010) to establish the Restricted Eigenvalue condition for correlated designs. However, our argument also easily allows a two-sided high-probability bound Hence, the population ℓ q sensitivity property is both necessary and sufficient for the sample version. This is not necessarily clear from the proofs for the Restricted Eigenvalue condition (Raskutti, Wainwright and Yu, 2010;Rudelson and Zhou, 2012).

Bounded variables
If the components of the vectors X, Z are bounded, then essentially the same proof goes through. The sub-exponential norm of Hence Lemma 5.4 holds with the same proof, where now the value of K := 2C x C z is different. The rest of the proof only relies on Lemma 5.4, so it goes through unchanged. Therefore, with the same sample requirement (8), the matrix of sample covariances obeys the ℓ q sensitivity with high probability.

Variables with bounded moments
For variates with bounded moments, we also need a large deviation inequality for inner products. We were unable to find a reference for this specific instance of a large deviation inequality, so we give a proof below. The general flow of the argument is classical, and relies on the Markov inequality and a moment-of-sum computation (e.g. Petrov (1995)). The closest result we are aware of is a lemma used in covariance matrix estimation (Ravikumar, 2011). Our result can be viewed as an extension of theirs, and the proof is shorter.
Lemma 5.5 (Deviation for Bounded Moments -Khintchine-Rosenthal). Let X and Z be zero-mean random variables, and r a positive integer, such that EX 4r = C x < ∞, EZ 4r = C z < ∞. Then, given n iid samples from X and Z, the sample covariance satisfies the tail bound: Proof. Let Y i = X i Z i − EXZ, and k = 2r. By the Markov inequality, we have We now bound the k-th moment of the sum n i=1 Y i using a type of classical argument, often referred to as Kintchine's or Rosenthal's inequality. We can write, recalling that k = 2r is even, By the mutual independence of Y i E(Y a 1 1 Y a 2 2 . . . Y an n ) = EY a 1 1 EY a 2 2 . . . EY an n .
As EY i = 0, the summands for which there is a Y i singleton vanish. For the remaining terms, we bound by Jensen's inequality (E|Y | r 1 ) 1/r 1 ≤ (E|Y | r 2 ) 1/r 2 for 0 ≤ r 1 ≤ r 2 . So a generic term is at most Above we have used that a 1 + . . . + a n = k. Hence, each non-vanishing term in the summation (9) was upper bounded by the same constant. To estimate the sum, we are left with the combinatorial problem of counting the number of sequences of non-negative integers (a 1 , . . . , a n ) that sum to k, and such that no term is 1. Here, if some a i > 0, then a i ≥ 2. Thus, there are at most k/2 = r nonzero elements. Therefore, the number of such sequences is not more than the number of ways to choose r places out of n, multiplied by the number of ways to distribute 2r elements among those places: n r r 2r ≤ n r r 2r .
Thus, we have proved that Further, we make an explicit bound in terms of the moments of X, Z. By the Minkowski and Jensen inequalities

Further by Cauchy-Schwartz
Introducing this bound for the moment of sum in the Markov inequality leads to the desired bound We are ready to prove Thm 3.3. By a union bound, the probability that Ψ −Ψ max ≥ t is at most Since r is fixed, for simplicity of notation, we can denote C 2r 0 = 2 2r r 2r √ C x C z . Choosing t = C 0 (Lp) 1/2r n −1/2+a/(2r) , the above probability is at most 1/n a .
The bound holds as before, so we conclude that with probability 1 − 1/n a , for all v ∈ C(s, α): From the choice of t, for sample size at least n 1−a/r ≥ C 2 0 (1 + α) 2 δ 2 (Lp) 1/r s 2 the error term on the left hand side is at most δ. In this caseΨ satisfies the the ℓ q sensitivity assumption with constant γ − δ with high probability.

Proof of Theorem 3.4
This result and Theorem 3.3 have closely related proofs, relying in essence on the same large deviation inequalities. Let p 1 and 1 − p 1 denote the mixture probabilities. Then, the outcome of a sample from P 1 corresponds to a Bernoulli trial with success probability p 1 . The number of samples n 1 from P 1 is a realization from a Binomial(n, p 1 ) random variable. The first part of the analysis is conditional on N 1 = n 1 . Let X and Z be the two matrices of observations, and let (X i , Z i ) denote the matrices of the samples from distribution P i . Without loss of generality, assume that Ψ 1 satisfies the ℓ q sensitivity. Then we can write the matrix of sample covariances aŝ which can be further decomposed as The main term is Ψ 1 , and the first three terms are error terms. The first two are stochastic (call them M 1 , M 2 ), and are bounded as in Theorem 3.3, while the third term (call it N ) is small because the Ψ i are close to one another.
In more detail, note that We bound the ratios n 1 /n ≤ 1, n 2 /n ≤ 1 by the constant 1. The uniform large deviation inequality (5) from the same theorem can be applied to both samples, yielding bounds for M i ∞ : with K = max (4 X 1 Ψ 2 Z 1 Ψ 2 , 2C(X 2 )C(Z 2 )). Here we have assumed without loss of generality that the first sample is sub-gaussian, and the second one is bounded. K is the maximum of two expressions depending on these norms, the same expressions as in Theorem 3.3.
Now we combine these in the main bound. By assumption (Ψ 2 −Ψ 1 ) max ≤ δ/s. Let t denote the bound obtained for M 1 max + M 2 max . We thus have which together with (7) leads to In order for the term ts to be at most ν, we need n ≥ 8K 2 (1 + a) qcν 2 s 2 log(2pL).
If this and (10) happens, then the previous display implies the statement of the theorem, by the ℓ q sensitivity of Ψ 1 . The probability of error is at most 2(2Lp) −a + 2 exp(−2q 2 n). Because of (10), the second probability is also of the form 2(2Lp) −θ , where now θ = 4q(a + 1)/c. So we can choose ρ = min(a, θ) to get the simpler form of the probability bound claimed in the statement. This proves the theorem.

Proof of Theorem 3.6
To bound the term |Ψv| ∞ in the ℓ 1 sensitivity, we use the s-comprehensive property. Indeed, let v ∈ C(s, α). By the symmetry of the s-comprehensive property, we can assume without loss of generality that |v 1 | ≥ |v 2 | ≥ . . . ≥ |v p |. Then if S denotes the first s components, α|v S | 1 ≥ |v S c | 1 . Consider the sign pattern of the top s components of v: ε = (sgn(v 1 ), . . . , sgn(v s )). Since Ψ is s-comprehensive, it has a row w with matching sign pattern. Then we can compute Hence the inner product is lower bounded by min i∈S |w i | i∈S |v i | ≥ c i∈S |v i |.
This proves the stated thesis.
Then summing |v i | ≤ m/d i for i in any set S with size s: We want to bound this for v ∈ C(s, α), so let S be the subset of dominating coordinates for which |v S c | 1 ≤ α|v S | 1 . It follows that , arranged from the smallest to the largest. The harmonic average in the lower bound can be bounded away from zero even several d i -s are of order O(1/s). For instance if d (1) = · · · = d (k) = 1/s and d (k+1) > 1/c for some constant c and integer k < s, then the ℓ 1 sensitivity is at least which is bounded away from zero whenever k is bounded. In this setting the smallest eigenvalue of Σ is 1/s , so only the ℓ 1 sensitivity holds out of all regularity properties. We now consider Example 3.9. For this specific covariance matrix, m = |Σv| ∞ = max(|v 1 + ρv 2 |, |v 2 + ρv 1 |, |v 3 |, . . . , |v p |).

Proofs from Section 3.3
For the first claim of Theorem 3.10, note that (Z ′ ) T X ′ = (M Z) T M X = Z T X since M is orthonormal. ℓ q sensitivity of the pair of matrices (X, Z) only depends on the matrix Z T X, which is preserved under the orthonormal transformation. Hence, the transformed matrices inherit the regularity property.
For the second claim, note (Z ′ ) T X ′ v = Z T X(M v). If v is any vector in the cone C(s, α), we have M v ∈ C(s ′ , α ′ ) by the cone-preserving property. Hence by the ℓ q sensitivity of X, Z Further by the condition on M : |M v| q ≥ c|v| q . Multiplying these two inequalities yields the ℓ q sensitivity for X ′ , Z. For the last claim, we write (Z ′ ) T X ′ v = M Z T Xv. By the ℓ q sensitivity of X, Z, for all v ∈ C(s, α), However, |M (1/nZ T Xv)| ∞ ≥ c|1/nZ T Xv| ∞ by the assumption on M . Multiplying these inequalities gives the desired ℓ q sensitivity of X, Z ′ , completing the proof of Theorem 3.10. Finally, for the proof of Theorem 3.11, note the following inequality, which has already been used in the paper: Since |∆| q,∞ ≤ δ/s 1/q , we have, using the assumed ℓ q sensitivity of Σ, that s|(Σ + ∆)v| ∞ /|v| q ≥ γ − δ, as required.