Regularity Properties for Sparse Regression

Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and ℓq\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _q$$\end{document} sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given dataset. This is problematic, because they are at the core of the theory of sparse regression. Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications. However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, ℓq\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _q$$\end{document} sensitivity, has certain desirable properties. This condition is weaker and more general than the others. We show that it holds with high probability in models where the parent population is well behaved, and that it is robust to certain data processing steps. These results are desirable, as they provide guidance about when the condition, and more generally the theory of sparse regression, may be relevant in the analysis of high-dimensional correlated observational data.


Prologue
Open up any recent paper on sparse linear regression-the model Y = Xβ + ε, where X is an n × p matrix of features, n p, and most coordinates of β are zero-and you are likely to find that the main result is of the form: "If the data matrix X has the restricted eigenvalue/compatibility/ q sensitivity property, then our method will successfully estimate the unknown sparse parameter β, if the sample size is at least …" In addition to the sparsity of the parameter, the key condition here is the regularity of the matrix of features, such as restricted eigenvalue/ compatibility/ q sensitivity. It states that every suitable submatrix of the feature matrix X is "nearly orthogonal." Such a property is crucial for the success of popular estimators like the Lasso and Dantzig selector. However, these conditions are somewhat poorly understood. For instance, as the conditions are combinatorial, it is not known how to check them efficiently-in polynomial time-on any given data matrix. Without this knowledge, it is difficult to see whether or not the whole framework is relevant to any particular data analysis setting.
In this paper we seek a better understanding of these problems. We first establish that the most popular conditions for sparse regression-restricted eigenvalue/ compatibility/ q sensitivity-are all NP-hard to check. This implies that there is likely no efficient way to verify them for deterministic matrices, and raises some questions about their practical applications. Next, we move away from the worst-case analysis entailed by NP-hardness, and consider an average-case, non-adversarial analysis. We show that the weakest of these conditions, q sensitivity, has some desirable properties, including that it holds with high probability in well-behaved random design models, and that it is preserved under certain data processing operations.

Formal Introduction
We now turn to a more formal and thorough introduction. The context of this paper is that high-dimensional data analysis is becoming commonplace in statistics and machine learning. Recent research shows that estimation of high-dimensional parameters may be possible if they are suitably sparse. For instance, in linear regression where most of the regression coefficients are zero, popular estimators such as the Lasso [8,24], SCAD [14], and the Dantzig selector [6] can have small estimation error-as long as the matrix of covariates is sufficiently "regular." There is a large number of suitable regularity conditions, starting with the incoherence condition of Donoho and Huo [12], followed by more sophisticated properties such as Candes and Tao's restricted isometry property ("RIP") [7], Bickel, Ritov and Tsybakov's weaker and more general restricted eigenvalue (RE) condition [3], and Gautier and Tsybakov's even more general q sensitivity properties [15], which also apply to instrumental variables regression.
While it is known that these properties lead to desirable guarantees on the performance of popular statistical methods, it is largely unknown whether they hold in practice. Even more, it is not known how to efficiently check if they hold for any given dataset. Due to their combinatorial nature, it is thought that they may be computationally hard to verify [11,19,23]. The assumed difficulty of the computation has motivated convex relaxations for approximating the restricted isometry constant [10,17] and q sensitivity [15].
However, a rigorous proof is missing. A proof would be desirable for several reasons: (1) to show definitively that there is no computational "shortcut" to find their values, (2) to increase our understanding of why these conditions are difficult to check, and therefore (3) to guide the development of the future theory of sparse regression, based instead on efficiently verifiable conditions.
In this paper we provide such a proof. We show that checking any of the RE, compatibility, and q sensitivity properties for general data matrices is NP-hard (Theorem 3.1). This implies that there is no polynomial-time algorithm to verify them, under the widely believed assumption that P = NP. This raises some questions about the relevance of these conditions to practical data analysis.
We do not attempt to give a definitive answer here, and instead provide some positive results to enhance our understanding of these conditions. While the previous NP-hardness analysis referred to a worst-case scenario, we next take an average-case, non-adversarial perspective. Previous authors studied RIP, RE, and compatibility from this perspective, as well as the relations between these conditions [27]. We study q sensitivity, for two reasons: First, it is more general than other regularity properties in terms of the correlation structures it can capture, and thus potentially applicable to more highly correlated data. Second, it applies not just to ordinary linear regression, but also to instrumental variables regression, which is relevant in applications such as economics.
Finding conditions under which q sensitivity holds is valuable for several reasons: (1) since it is hard to check the condition computationally on any given dataset, it is desirable to have some other way to ascertain it, even if that method is somewhat speculative, and (2) it helps us to compare the situations-and statistical modelswhere this condition is most suitable to the cases where the other conditions are applicable, and thus better understand its scope.
Hence, to increase our understanding of when q sensitivity may be relevant, we perform a probabilistic-or "average case"-analysis, and consider a model where the data is randomly sampled from suitable distributions. In this case, we show that there is a natural "population" condition which is sufficient to ensure that q sensitivity holds with high probability (Theorem 3.2). This complements the results for RIP [e.g., [20,28]], and RE [19,22]. Further, we define an explicit k-comprehensive property (Definition 3.3) which implies 1 sensitivity (Theorem 3.4). Such a condition is of interest because there are very few explicit examples where one can ascertain that q sensitivity holds.
Finally, we show that the q sensitivity property is preserved under several data processing steps that may be used in practice (Proposition 3.5). This shows that, while it is initially hard to ascertain this property, it may be somewhat robust to downstream data processing.
We introduce the problem in Sect. 2. Then, in Sect. 3 we present our results, with a discussion in Sect. 4, and provide the proofs in Sect. 5.

Setup
We introduce the problems and properties studied, followed by some notions from computational complexity.

Regression Problems and Estimators
Consider the linear model Y = Xβ + ε, where Y is an n × 1 response vector, X is an n × p matrix of p covariates, β is a p × 1 vector of coefficients, and ε is an n × 1 noise vector of independent N (0, σ 2 ) entries. The observables are Y and X , where X may be deterministic or random, and we want to estimate the fixed unknown β. Below we will briefly present the modeling and the estimation procedures that are required, while for the full details we refer to the original publications.
In the case when n < p, it is common to assume sparsity, viz., most of the coordinates of β are zero. We do not know the locations of nonzero coordinates. A popular estimator in this case is the Lasso [8,24], which for a given regularization parameter λ solves the optimization problem: The Dantzig selector is another estimator for this problem, which for a known noise level σ , and with a tuning parameter A, takes the form [6]: See [13] for a view from the perspective of the sparsest solution in high-confidence set, and its generalizations.
In instrumental variables regression we start with the same linear model y = p i=1 x i β i + ε. Now some covariates x i may be correlated with the noise ε, in which case they are called endogenous. Further, we have additional variables z i , i = 1, . . . , L, called instruments, that are uncorrelated with the noise. In addition to X , we observe n independent samples of z i , which are arranged in the n × L matrix Z . In this setting, [15] propose the self-tuning instrumental variables (STIV) estimator, a generalization of the Dantzig selector, which solves the optimization problem: with the minimum over the polytope Here D X and D Z are diagonal matrices with ..,n |z ki |, Q(β) = n −1 |Y − Xβ| 2 2 , and c is a constant whose choice is described in [15]. When X is exogenous, we can take Z = X , which reduces to Dantzig type of selector.

Regularity Properties
The performance of the above estimators is characterized under certain "regularity properties." These depend on the union of cones C(s, α)-called "the cone" for brevity-which is the set of vectors, such that the 1 norm is concentrated on some s coordinates: where v A is the subvector of v with the entries from the subset A.
The properties discussed here depend on a triplet of parameters (s, α, γ ), where s is the sparsity size of the problem, α is the cone opening parameter in C(s, α), and γ is the lower bound. First, the restricted eigenvalue condition R E(s, α, γ ) from [3,16] holds for a fixed matrix X if We emphasize that this property, and the ones below, are defined for arbitrary deterministic matrices-but later we will consider them for randomly sampled data. [3] shows that if the normalized data matrix n −1/2 X obeys R E(s, α, γ ) and β is s-sparse, then the estimation error is small in the sense that |β − β| 2 = O P γ −2 √ s log p/n and |β − β| 1 = O P γ −2 s √ log p/n , for both the Dantzig and Lasso selectors. See [13] for more general results and simpler arguments. The "cone opening" α required in the RE property equals 1 for the Dantzig selector, and 3 for the Lasso.
Next, the deterministic matrix X obeys the compatibility condition with positive parameters (s, α, γ ) [26], if The two conditions are very similar. The only difference is the change from 2 to 1 norm in the denominator. The inequality |v S | 1 ≤ √ s|v S | 2 shows that the compatibility conditions are-formally at least-weaker than the RE assumptions. van de Geer [26] provides an 1 oracle inequality for the Lasso under the compatibility condition, see also [4,27].
Finally, for q ≥ 1, the deterministic matrices X of size n × p and Z of size n × L satisfy the q sensitivity property with parameters (s, α, γ ), if If Z = X , the definition is similar to the cone invertibility factors [29]. Gautier and Tsybakov [15] show that q sensitivity is weaker than the RE and compatibility conditions, meaning that in the special case when Z = X , the RE property of X implies the q sensitivity of X . We note that the definition in [15] differs in normalization, but that is not essential. The details are that we have an additional s 1/q factor (this is to ensure direct comparability to the other conditions), and we do not normalize by the diagonal matrices D X , D Z for simplicity (to avoid the dependencies introduced by this process). One can easily show that the un-normalized q condition is sufficient for the good performance of an un-normalized version of the STIV estimator.
Finally, we introduce incoherence and the restricted isometry property, which are not analyzed in this paper, but are instead used for illustration purposes. For a deterministic n × p matrix X whose columns {X j } p j=1 are normalized to length √ n, the mutual incoherence condition holds if X T i X j ≤ γ /s for some positive γ . Such a notion was defined in [12], and later used by Bunea [5] to derive oracle inequalities for the Lasso. A deterministic matrix X obeys the restricted isometry property with parameters s

Notions from Computational Complexity
To state formally that the regularity conditions are hard to verify, we need some basic notions from computational complexity theory. Here problems are classified according to the computational resources-such as time and memory-needed to solve them [1]. A well-known complexity class is P, consisting of the problems decidable in polynomial time in the size of the input. For input encoded in n bits, a yes or no answer must be found in time O(n k ) for some fixed k. A larger class is NP, the decision problems for which already existing solutions can be verified in polynomial time. This is usually much easier than solving the question itself in polynomial time. For instance, the subset-sum problem: "Given an input set of integers, does there exist a subset with zero sum?" is in NP, since one can easily check a candidate solution-a subset of the given integers-to see if it indeed sums to zero. However, finding this subset seems harder, as simply enumerating all subsets is not a polynomial-time algorithm. Formally, the definition of NP requires that if the answer is yes, then there exists an easily verifiable proof. We have P ⊂ NP, since a polynomial-time solution is a certificate verifiable in polynomial time. However, it is a famous open problem to decide if P equals NP [9]. It is widely believed in the complexity community that P = NP.
To compare the computational hardness of various problems, one can reduce known hard problems to the novel questions of interest, thereby demonstrating the difficulty of the novel problems. Specifically, a problem A is polynomial-time reducible to a problem B, if an oracle solving B-an immediate solver for an instance of B-can be queried once to give a polynomial-time algorithm to solve A. This is also known as a polynomial-time many-one reduction, strong reduction, or Karp reduction. A problem is NP-hard if every problem in NP reduces to it, namely it is at least as difficult as all other problems in NP. If one reduces a known NP-hard problem to a new question, this demonstrates the NP-hardness of the new problem.
If indeed P = NP, then there are no polynomial-time algorithms for NP-hard problems, implying that these are indeed computationally difficult.

Computational Complexity
We now show that the common conditions needed for successful sparse estimation are unfortunately NP-hard to verify. These conditions appear prominently in the theory of high-dimensional statistics, large-scale machine learning, and compressed sensing. In compressed sensing, one can often choose, or "engineer," the matrix of covariates such that it is as regular as possible-choosing for instance a matrix with iid Gaussian entries. It is well known that the restricted isometry property and its cousins will then hold with high probability.
In contrast, in statistics and machine learning, the data matrix is often observational-or "given to us"-in the application. In this case, it is not known a priori whether the matrix is regular, and one may be tempted to try and verify it. Unfortunately, our results show that this is hard. This distinction between compressed sensing and statistical data analysis was the main motivation for us to write this paper, after the computational difficulty of verifying the restricted isometry property has been established in the information theory literature [2]. We think that researchers in high-dimensional statistics will benefit from the broader view which shows that not just RIP, but also RE, q sensitivity, etc., are hard to check. Formally: Theorem 3.1 Let X be an n × p matrix, Z an n × L matrix, 0 < s < n, and α, γ > 0. It is NP-hard to decide any of the following problems: 1. Does X obey the RE condition with parameters (s, α, γ )? 2. Does X satisfy the compatibility conditions with parameters (s, α, γ )? 3. Does (X, Z ) have the q sensitivity property with parameters (s, α, γ )?
The proof of Theorem 3.1 is relegated to Sect. 5.1, and builds on the recent results that computing the spark and checking restricted isometry are NP-hard [2,25].

q Sensitivity for Correlated Designs
Since it is hard to check the properties in the worst case on a generic data matrix, it may be interesting to know that they hold at least under certain conditions. To understand when this may occur, we consider probabilistic models for the data, which amounts to an average-case analysis. This type of analysis is common in statistics.
To this end, we first need to define a "population" version of q sensitivity that refers to the parent population from which the data is sampled. Let X and Z be p-and L-dimensional zero-mean random vectors and denote by Ψ = E Z X T the L × p matrix of covariances with Ψ i j = E(Z i X j ). We say that Ψ satisfies the q sensitivity property with parameters (s, α, γ ) if min v∈C(s,α) s 1/q |Ψ v| ∞ /|v| q ≥ γ . One sees that we simply replaced n −1 Z X T from the original definition with its expectation, Ψ .
It is then expected that for sufficiently large samples, random matrices with rows sampled independently from a population with the q sensitivity property will inherit this condition. However, it is non-trivial to understand the required sample size, and its dependence on the moments of the random quantities. To state precisely the required probabilistic assumptions, we recall that the sub-gaussian norm of a random variable is defined as X ψ 2 = sup p≥1 p −1/2 (E|X | p ) 1/ p (see e.g., [28]). The sub-gaussian norm (or sub-gaussian constant) of a p-dimensional random vector X is then defined as X ψ 2 = sup x: Our result establishes sufficient conditions for q sensitivity to hold for random matrices, under three broad conditions including sub-gaussianity: Theorem 3.2 Let X and Z be zero-mean random vectors, such that the matrix of population covariances Ψ satisfies the q sensitivity property with parameters (s, α, γ ). Given n iid samples and any a, δ > 0, the matrixΨ = n −1 Z T X has the q sensitivity property with parameters (s, α, γ − δ), with high probability, in each of the following settings: 1. If X and Z are sub-gaussian with fixed constants, then sample q sensitivity holds with probability at least 1 − (2 pL) −a , provided that the sample size is at least n ≥ cs 2 log(2 pL). 2. If the entries of the vectors are bounded by fixed constants, the same statement holds.

If the entries have bounded moments: E|X
for some positive integer r and all i, j, then the q sensitivity property holds with probability at least 1 − 1/n a , assuming the sample size is at least n 1−a/r ≥ cs 2 ( pL) 1/r .
The constant c does not depend on n, L , p and s, and it is given in the proofs in Sect. 5.2.
The general statement of the theorem is applicable to the specific case where Z = X . Related results have been obtained for the RIP [20,22] and RE conditions [19,22]. Our results complement theirs for a weaker notion of q sensitivity property.
Next, we aim to achieve a better understanding of the population q sensitivity property by giving some explicit sufficient conditions where it holds. Modeling covariance matrices in high dimensions are challenging, as there are few known explicit models. For instance, the examples given in [19] to illustrate RE are quite limited, and include only diagonal, diagonal plus rank one, and ARMA covariance matrices. Therefore we think that the explicit conditions below are of interest, even if they are somewhat abstract.
We start from the case when Z = X , in which case Ψ is the covariance matrix of X . In particular, if Ψ equals the identity matrix I p or nearly the identity, then Ψ is q -sensitive. Inspired by this diagonal case, we introduce a more general condition.

Definition 3.3
The L × p matrix Ψ is called s-comprehensive if for any subset S ⊂ {1, . . . , p} of size s, and for each pattern of signs ε ∈ {−1, 1} S , there exists either a row w of Ψ such that sgn(w i ) = ε i for i ∈ S, and w i = 0 otherwise, or a row with sgn(w i ) = −ε i for i ∈ S, and w i = 0 otherwise.
In particular, when L = p, diagonal matrices with nonzero diagonal entries are 1-comprehensive. More generally, when L = p, we have by simple counting the inequality L ≥ 2 s−1 p s , which shows that the number of instruments L must be large for the s-comprehensive property to be applicable. In problems where there are many potential instruments, this may be reasonable. To go back to our main point, we show that an s-comprehensive covariance matrix is 1 -sensitive. Finally, to improve our understanding of the relationship between the various conditions, we now give several examples. They show that q sensitivity is more general than the rest. The proofs of the following claims can be found in Sect. 5.4.

Example 1
If Σ is a diagonal matrix with entries d 1 , d 2 , . . . , d p , then the restricted isometry property holds if 1 + δ ≥ d i ≥ 1 − δ for all i. RE only requires d i ≥ γ ; the same is required for compatibility. This example shows why restricted isometry is the most stringent requirement. Further, 1 sensitivity holds even if a finite number of d i go to zero at rate 1/s. In this case, all other regularity conditions fail. This is an example where l q regularity holds under broader conditions than the others.

Example 3
If Σ has diagonal entries equal to 1, σ 12 = σ 21 = ρ, and all other entries are equal to zero, then compatibility and 1 sensitivity hold as long as 1 − ρ 1/s (Sect. 5.4). In such a case, however, the REs are of order 1/s. This is an example where compatibility and 1 sensitivity hold but the RE condition fails.

Operations Preserving Regularity
In data analysis, one often processes data by normalization or feature merging. Normalization is performed to bring variables to the same scale. Features are merged via sparse linear combinations to reduce dimension and avoid multicollinearity. Our final result shows that q sensitivity is preserved under the above operations, and even more general ones. This may be of interest in cases where downstream data processing is performed after an initial step where the regularity conditions are ascertained.
Let X and Z be as above. First, note that the q sensitivity only depends on the inner products Z X T , therefore it is preserved under simultaneous orthogonal transformations on each covariate X = M X, Z = M Z for any orthogonal matrix M. The next result defines broader classes of transformations that preserve q sensitivity. Admittedly the transformations we consider are abstract, but they include some concrete examples, and represent a simple first step to understanding what kind of data processing steps are "admissible" and do not destroy regularity. Furthermore, the result is very elementary, but the goal here is not technical sophistication, but rather increasing our understanding of the behavior of an important property. The precise statement is: such that for all v ∈ C(s, α) we have Mv ∈ C(s , α ) and let X = X M. Suppose further that |Mv| q ≥ c|v| q for all v in C(s, α). If (X, Z ) has the q sensitivity property with parameters (s , α , γ ), then (X , Z ) has q sensitivity with parameters (s, α, cγ ).

Let M be a linear transformation R
If we transform Z = Z M, and (X, Z ) has the q sensitivity property with lower bound γ , then (X, Z ) has the same property with lower bound cγ .
One can check that normalization and feature merging on the X matrix are special cases of the first class of "cone-preserving" transformations. For normalization, M is the p × p diagonal matrix of inverses of the lengths of X 's columns. Similarly, normalization on the Z matrix is a special case of the second class of transformations. This shows that our definitions include some concrete commonly performed data processing steps.

Discussion
Our work raises further questions about the theoretical foundations of sparse linear models. What is a good condition to have at the core of the theory? The regularity properties discussed in this paper yield statistical performance guarantees for popular methods such as the Lasso and the Dantzig selector. However, they are not efficiently verifiable. In contrast, incoherence can be checked efficiently, but does not guarantee performance up to the optimal rate [4]. It may be of interest to investigate if there are intermediate conditions that achieve favorable trade-offs.

Proof of Theorem 3.1
The spark of a matrix X , denoted spark(X ), is the smallest number of linearly dependent columns. Our proof is a polynomial-time reduction from the NP-hard problem of computing the spark of a matrix (see [2,25] and references therein).

Lemma 5.1 Given an n × p matrix with integer entries X , and a sparsity size 0 < s < p, it is NP-hard to decide if the spark of X is at most s.
We also need the following technical lemma, which provides bounds on the singular values of matrices with bounded integer entries. For a matrix X , we denote by X 2 or X its operator norm, and by X S the submatrix of X formed by the columns with indices in S. Lemma 5.2 Let X be an n × p matrix with integer entries, and denote M = max i, j |X i j |. Then, we have X 2 ≤ 2 log 2 ( √ npM) . Further, if spar k(X ) > s for some 0 < s < n, then for subset S ⊂ {1, . . . , p} with |S| = s, we have Proof The first claim follows from: X 2 ≤ √ np X max ≤ 2 log 2 ( √ npM) . For the second claim, let X S denote a submatrix of X with an arbitrary index set S of size s. Then spark(X ) > s implies that X S is non-singular. Since the absolute values of the entries of X lie in {0, . . . , M}, the entries of X T S X S are integers with absolute values between 0 and nM 2 , namely X T S X S max ≤ nM 2 . Moreover, since the non-negative and nonzero determinant of X T S X S is integer, it must be at least 1. Hence, Rearranging, we get In the middle inequality we have used s ≤ n. This is the desired bound.
For the proof we need the notion of encoding length, which is the size in bits of an object. Thus, an integer M has size log 2 (M) bits. Hence the size of the matrix X is at least np + log 2 (M) : at least one bit for each entry, and log 2 (M) bits to represent the largest entry. To ensure that the reduction is polynomial-time, we need that the size in bits of the objects involved is polynomial in the size of the input X . As usual in computational complexity, the numbers here are rational [1].
Proof of Theorem 3.1 It is enough to prove the result for the special case of X with integer entries, since this statement is in fact stronger than the general case, which also includes rational entries. For each property and given sparsity size s, we will exhibit parameters (α, γ ) of polynomial size in bits, such that: 1. spark(X ) ≤ s ⇒ X does not obey the regularity property with parameters (α, γ ), 2. spark(X ) > s ⇒ X obeys the regularity property with parameters (α, γ ).
Hence, any polynomial-time algorithm for deciding if the regularity property holds for (X, s, α, γ ), can decide if spark(X ) ≤ s with one call. Here it is crucial that (α, γ ) are polynomial in the size of X , so that the whole reduction is polynomial in X . Since deciding spark(X ) ≤ s is NP-hard by Theorem 3.1, this shows the desired NPhardness of checking the conditions. Now we provide the required parameters (α, γ ) for each regularity condition. Similar ideas are used when comparing the conditions.
For the restricted eigenvalue condition, the first claim follows any γ > 0, and any α > 0. To see this, if the spark of X at most s, there is a nonzero s-sparse vector v in the kernel of X , and |X v| 2 = 0 < γ |v S | 2 , where S is any set containing the nonzero coordinates. This v is clearly also in the cone C(s, α), and so X does not obey RE with parameters (s, α, γ ).
For the second claim, note that if spark(X ) > s, then for each index set S of size s, the submatrix X S is non-singular. This implies a nonzero lower bound on the RE constant of X . Indeed, consider a vector v in the cone C(s, α), and assume specifically Further, since v is in the cone, we have Since X S is non-degenerate and integer-valued, we can use the bounds from Lemma 5.2. Consequently, with M = X max , we obtain By choosing, say, α = 2 −2n log 2 (npM) , γ = 2 −2n log 2 (npM) , we easily conclude after some computations that |X v| 2 ≥ γ |v S | 2 . Moreover, the size in bits of the parameters is polynomially related to that of X . Indeed, the size in bits of both parameters is 2n log 2 (npM) , and the size of X is at least np + log 2 (M) , as discussed before the proof. Note that 2n log 2 (npM) ≤ (np + log 2 (M) ) 2 . This proves the claim.
The argument for the compatibility conditions is identical, and therefore omitted. Finally, for the q sensitivity property, we in fact show that the subproblem where Z = X is NP-hard, thus the full problem is also clearly NP-hard. The first condition is again satisfied for all α > 0 and γ > 0. Indeed, if the spark of X is at most s, there is a nonzero s-sparse vector v in its kernel, and thus |X T X v| ∞ = 0.
For the second condition, we note that |X v| 2 For v in the cone, α|v S | 1 ≥ |v S c | 1 and hence Combination of the last two results gives Finally, since q ≥ 1, we have |v| 1 ≥ |v| q , and as v is in the cone, |v| 2 2 = |v S | 2 2 +|v S c | 2 2 ≤ (1 + α 2 s)|v S | 2 2 , by inequality (5.1). Therefore, Hence we essentially reduced to REs. From the proof of that case, the choice α = 2 −2n log 2 (npM) gives |X v| 2 /|v S | 2 ≥ 2 −2n log 2 (npM) . Hence for this α we also have s 1/q |X T X v| ∞ /(n|v| 2 ) ≥ 2 −5(n+1) log 2 (npM) , where we have applied a number of coarse bounds. Thus X obeys the q sensitivity property with the parameters α = 2 −2n log 2 (npM) and γ = 2 −5n log 2 (npM) . As in the previous case, the size in bits of these parameters is polynomial in the size in bits of X . This proves the correctness of the reduction for, and completes the proof.

Proof of Theorem 3.2
We first establish some large deviation inequalities for random inner products, then finish the proofs directly by a union bound. We discuss the three probabilistic settings one by one.

Sub-Gaussian Variables
Lemma 5.3 (deviation of inner products for sub-Gaussians) Let X and Z be zeromean sub-gaussian random variables, with sub-gaussian norms X ψ 2 , Z ψ 2 , respectively. Then, given n iid samples of X and Z , the sample covariance satisfies the tail bound: Proof We use the Bernstein-type inequality in Corollary 5.17 from [28]. Recalling that the sub-exponential norm of a random vector X is X ψ 1 = sup p≥1 p −1 X p , we need to bound the sub-exponential norms of U i = X i Z i − E(X i Z i ). We show that if X, Z are sub-Gaussian, then X Z has sub-exponential norm bounded by Indeed by the Cauchy-Schwartz inequality (E|X Z| p ) 2 ≤ E|X | 2 p E|Z | 2 p , hence p −1 (E|X Z| p ) 1/ p ≤ 2(2 p) −1/2 E|X | 2 p 1/2 p (2 p) −1/2 E|Z | 2 p 1/2 p . Taking the supremum over p ≥ 1/2 leads to (5.2).
The U i are iid random variables, and their sub-exponential norm is bounded as U i ψ 1 ≤ X i Z i ψ 1 + |EX Z| ≤ 2 X ψ 2 Z ψ 2 + EX 2 EZ 2 1/2 . Further, by definition EX 2 1/2 ≤ √ 2 X ψ 2 , hence the sub-exponential norm is at most The main result then follows by a direct application of Bernstein's inequality, see Corollary 5.17 from [28].
With these preparations, we now prove Theorem 3.2 for the sub-gaussian case. By a union bound over the Lp entries of the matrix Ψ −Ψ By Lemma 5.3 each probability is bounded by a term of the form 2 exp(−cn min(t/K , t 2 /K 2 )), where K varies with i, j. The largest of these bounds corresponds to the largest of the K − s. Hence the K in the largest term is 4 max i, j X i Ψ 2 Z j Ψ 2 . By the definition of sub-gaussian norm, this is at most 4 X Ψ 2 Z Ψ 2 , where the X and Z are now p and L-dimensional vectors, respectively. Therefore we have the uniform bound We choose t such that (a + 1) log(2Lp) = cnt 2 /K 2 , that is t = K [(a + 1) log(2Lp)/cn] 1/2 . Since we can assume (a + 1) log(2Lp) ≤ cn by assumption, the relevant term is the one quadratic in t: the total probability of error is (2Lp) −a . From now on, we will work on the high-probability event that Ψ −Ψ max ≤ t. For With high probability it holds uniformly for all v that for the constant R = K 2 (a + 1)/c. For vectors v in C(s, α), we bound the 1 norm by the q norm, q ≥ 1, in the usual way, to get a term depending on s rather than on all p coordinates: Introducing this into (5.4) gives with high probability over all v ∈ C(s, α): If we choose n such that n ≥ K 2 (1+a)(1+α) 2 s 2 log(2 pL)/(cδ 2 ), then the second term will be at most δ. Further since Ψ obeys the q sensitivity assumption, the first term will be at least γ . This shows thatΨ satisfies the q sensitivity assumption with constant γ − δ with high probability, and finishes the proof. To summarize, it suffices if the sample size is at least n ≥ log(2 pL)(a + 1) c max 1,

Bounded Variables
If the components of the vectors X, Z are bounded, then essentially the same proof goes through. The sub-exponential norm of Hence Lemma 5.3 holds with the same proof, where now the value of K := 2C x C z is different. The rest of the proof only relies on Lemma 5.3, so it goes through unchanged.

Variables with Bounded Moments
For variates with bounded moments, we also need a large deviation inequality for inner products. The general flow of the argument is classical, and relies on the Markov inequality and a moment-of-sum computation (e.g., [18]). The result is a generalization of a lemma used in covariance matrix estimation [21], and our proof is shorter.
Lemma 5.4 (deviation for bounded moments-Khintchine-Rosenthal) Let X and Z be zero-mean random variables, and r a positive integer, such that EX 4r = C x , EZ 4r = C z . Given n iid samples from X and Z , the sample covariance satisfies the tail bound: Proof Let Y i = X i Z i − EX Z, and k = 2r . By the Markov inequality, we have We now bound the k-th moment of the sum n i=1 Y i using a type of classical argument, often referred to as Khintchine's or Rosenthal's inequality. We can write, recalling that k = 2r is even, (5.7) By independence of Y i , we have E(Y a 1 1 Y a 2 2 . . . Y a n n ) = EY a 1 1 EY a 2 2 . . . EY a n n . As EY i = 0, the summands for which there is a Y i singleton vanish. For the remaining terms, we bound by Jensen's inequality (E|Y | r 1 ) 1/r 1 ≤ (E|Y | r 2 ) 1/r 2 for 0 ≤ r 1 ≤ r 2 . So each term is bounded by (E|Y | k ) a 1 /k . . . (E|Y | k ) a n /k = E|Y | k .
Hence, each nonzero term in (5.7) is uniformly bounded. We count the number of sequences of non-negative integers (a 1 , . . . , a n ) that sum to k, and such that if some a i > 0, then a i ≥ 2. Thus, there are at most k/2 = r nonzero elements. This shows that the number of such sequences is not more than the number of ways to choose r places out of n, multiplied by the number of ways to distribute 2r elements among those places, which can be bounded by n r r 2r ≤ n r r 2r . Thus, we have proved that E n i=1 Y i 2r ≤ n r r 2r E|Y | 2r .
We can make this even more explicit by the Minkowski and Jensen inequalities: To prove Theorem 3.2, we note that by a union bound, the probability that Ψ − Ψ max ≥ t is at most Lp2 2r r 2r √ C x C z /(t 2r n r ). Since r is fixed, for simplicity of notation, we can denote C 2r 0 = 2 2r r 2r √ C x C z . Choosing t = C 0 (Lp) 1/2r n −1/2+a/(2r ) , the above probability is at most 1/n a .
so we conclude that with probability 1 − 1/n a , for all v ∈ C(s, α): From the choice of t, for sample size at least n 1−a/r ≥ C 2 0 (1 + α) 2 (Lp) 1/r s 2 /(δ 2 ), the error term on the left-hand side is at most δ, which is what we need.

Proof of Theorem 3.4
To bound the term |Ψ v| ∞ in the 1 sensitivity, we use the s-comprehensive property. For any v ∈ C(s, α), by the symmetry of the s-comprehensive property, we can assume without loss of generality that |v 1 | ≥ |v 2 | ≥ · · · ≥ |v p |. Then if S denotes the first s components, α|v S | 1 ≥ |v S c | 1 .
Consider the sign pattern of the top s components of v: ε = (sgn(v 1 ), . . . , sgn(v s )). Since Ψ is s-comprehensive, it has a row w with matching sign pattern. Then we can compute Hence the inner product is lower bounded by min i∈S |w i | i∈S |v i | ≥ c i∈S |v i |.

Proof of Claims in Examples 1, 3
We bound the 1 sensitivity for the two specific covariance matrices Σ. For the diagonal matrix in Example 1, with entries d 1 , . . . , d p > 0, we have m = |Σv| ∞ = max(|d 1 v 1 |, . . . , |d p v p |). Then summing |v i | ≤ m/d i for i in any set S with size s, we get |v S | 1 ≤ m i∈S 1/d i . To bound this quantity for v ∈ C(s, α), let S be the subset of dominating coordinates for which |v S c | 1 ≤ α|v S | 1 . It follows that , arranged from the smallest to the largest. The harmonic average in the lower bound can be bounded away from zero even several d i -s are of order O(1/s). For instance if d (1) = · · · = d (k) = 1/s and d (k+1) > 1/c for some constant c and integer k < s, then the 1 sensitivity is at least s|Σv| ∞ /|v| 1 ≥ 1/[(1+α)(k +(1−k/s)c)], which is bounded away from zero whenever k is bounded. In this setting the smallest eigenvalue of Σ is 1/s, so only the 1 sensitivity holds out of all regularity properties.
The coordinate v 1 can be bounded as follows: .

Proof of Proposition 3.5
For the first claim, note (Z ) T X v = Z T X (Mv). If v is any vector in the cone C(s, α), we have Mv ∈ C(s , α ) by the cone-preserving property. Hence by the q sensitivity of X, Z s 1/q |n −1 Z T X (Mv)| ∞ /|Mv| q ≥ γ. Multiplying this by |Mv| q ≥ c|v| q yields the q sensitivity for X , Z . For the second claim, we write (Z ) T X v = M Z T X v. By the q sensitivity of X, Z , for all v ∈ C(s, α), s 1/q |n −1 Z T X v| ∞ /|v| q ≥ γ. Multiplying this by n −1 |M Z T X v| ∞ ≥ cn −1 |Z T X v| ∞ finishes the proof.