Arkadi Nemirovski1
Preface The subject of Nonparametric statistics is statistical inference applied to noisy observations of in nite-dimensional \parameters" like images and time-dependent signals. This is a mathematical area on the border between Statistics and Functional Analysis, the latter name taken in its \literal" meaning { as geometry of spaces of functions. What follows is the 8-lecture course given by the author at The XXVIII Saint-Flour Summer School on Probability Theory. It would be impossible to outline in a short course the contents of rich and highly developed area of Non-parametric Statistics; we restrict ourselves with a number of selected topics related to estimating nonparametric regression functions and functionals of these functions. The presentation is self-contained, modulo a few facts from the theory of functional spaces.
Faculty of Industrial Engineering and Management, Technion { Israel Institute of Technology, Technion City, Haifa 32000, Israel; e-mail:
[email protected] 1
Contents 1 Estimating regression functions from Holder balls 1.1 1.2 1.3 1.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . Recovering a univariate Lipschitz continuous function Extension: recovering functions from Holder balls . . Appendix: proof of the Fano inequality . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
5
. 5 . 9 . 19 . 26
2 Estimating regression functions from Sobolev balls
29
3 Spatial adaptive estimation on Sobolev balls
49
2.1 Lower bounds for the minimax risk . . . . . . . . . . . . . . . . . . . . 30 2.2 Upper bounds on the minimax risk . . . . . . . . . . . . . . . . . . . . 34 2.3 Appendix: Proofs of Theorems 2.1.1, 2.1.2 . . . . . . . . . . . . . . . . 42 3.1 3.2 3.3 3.4
Spatial adaptive estimation: the goal . . . The estimate . . . . . . . . . . . . . . . . Quality of estimation . . . . . . . . . . . . Optimality index of the adaptive estimate
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
4 Estimating signals satisfying dierential inequalities
4.1 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Estimating solutions of homogeneous equations . . . . . . 4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . 4.2.2 Estimating sequences . . . . . . . . . . . . . . . . . 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . 4.3 From sequences to functions . . . . . . . . . . . . . . . . . 4.3.1 Spatial adaptive estimate: preliminaries . . . . . . 4.3.2 Spatial adaptive estimate: construction and quality 4.3.3 \Frequency modulated signals" . . . . . . . . . . . 4.4 Appendix: Proofs of Lemmas 4.2.1, 4.3.1 . . . . . . . . . . 4.4.1 Proof of Lemma 4.2.1 . . . . . . . . . . . . . . . . . 4.4.2 Proof of Lemma 4.3.1 . . . . . . . . . . . . . . . . .
5 Aggregation of estimates, I
5.1 Motivation . . . . . . . . . . . . . 5.2 The problem and the main result 5.2.1 Aggregation problem . . . 5.2.2 The recovering routine . . 5.2.3 Main result . . . . . . . . 5.2.4 \Concentration" . . . . . 3
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
49 51 55 60
73
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
73 77 77 79 87 89 90 91 93 95 95 98
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 101 . 102 . 102 . 103 . 104 . 108
101
5.4 Application: Recovering functions from Barron's class . . . . . . . . . . 114 5.5 Numerical example: nonparametric ltration . . . . . . . . . . . . . . . 119
6 Aggregation of estimates, II
6.1 Gaussian white noise model of observations . . . . . . . . 6.2 Approximating the best linear combination of estimates . 6.3 Application: aggregating projection estimates . . . . . . 6.3.1 Linear estimates . . . . . . . . . . . . . . . . . . 6.3.2 Aggregating projection estimates . . . . . . . . . 6.3.3 The construction . . . . . . . . . . . . . . . . . . 6.4 Approximating the best of given estimates . . . . . . . .
. . . . . . .
. . . . . . .
125
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 125 . 128 . 133 . 133 . 136 . 137 . 143
7.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Lower bounds and asymptotical eÆciency . . . . . . . 7.2 The case of once continuously dierentiable functional . . . . . 7.2.1 Whether condition (7.2.2) is sharp? . . . . . . . . . . . 7.3 Increasing smoothness of F . . . . . . . . . . . . . . . . . . . 7.3.1 The case of twice continuously dierentiable functional 7.3.2 Concluding remarks . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 147 . 148 . 153 . 159 . 165 . 166 . 175
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 177 . 178 . 179 . 183 . 184 . 186 . 188
7 Estimating functionals, I
8 Estimating functionals, II
8.1 Preliminaries: estimating polynomials . . . . . . 8.1.1 Hilbert-Schmidt polynomials . . . . . . . 8.1.2 Estimating Hilbert-Schmidt polynomials 8.1.3 Extension . . . . . . . . . . . . . . . . . 8.2 From polynomials to smooth functionals . . . . 8.2.1 Measure concentration . . . . . . . . . . 8.2.2 The estimate . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
147
177
Chapter 1 Estimating regression functions from Holder balls 1.1 Introduction We start with brief outline of the problems we are interested in and the goals we are aimed at.
Statistical problems and estimates. A typical problem of Statistics is as follows: (*) We are given a Polish (i.e., metric, separable and complete) \space of observations" Y along with a family of Borel probability distributions ff ()gf 2F on Y ; the family is parameterized by a \parameter" f varying in a metric space F . The goal is, given an \observation " { a realization
y f
of random variable associated with an unknown f 2 F , to make conclusions about f , e.g. I. [Identi cation] To estimate f , F. [Evaluation of a functional] To estimate the value F (f ) at f of a given functional F : F ! R, H. [Hypotheses testing] Given a partition F =
N [
i=1
which element Fi of the partition f belongs,
Fi of F , to decide to
In all these problems, a \candidate solution" is an estimate { a Borel function fb(y ) on the \space of observations" Y taking values in an appropriately chosen Polish \space of answers" Z : In the case of Identi cation problem, Z = F , and fb(y) is the estimated value of f; In the case of problem of evaluating a functional, Z = R, and fb(y) is the estimated value of F (f ) In the case of Hypotheses testing, Z = f1; :::; N g, and fb(y) is the (index of the) accepted hypothesis. 5
exactly, and we should be satis ed with estimates fb() which with \high" probability
are \close" to true answers. A natural way to quantify the quality of an estimate is to look at its (mean squared) risk
R(f;b f ) = Ef
n
o1=2
dist2Z (fb(y ); f(f )
;
(1.1)
where
Ef fg is the expectation w.r.t. y f ; distZ (; ) is the metric on the \space of answers" Z ; f (f ) 2 Z is the true answer.
For example In the Identi cation problem, Z = F , distZ (; ) is the metric F is equipped with, and f (f ) = f ; In the Functional Evaluation problem, Z = R, distZ (p; q) = jp qj and f(f ) = F (f ); In the Hypotheses testing problem,
0; i = j Z = f1; :::; N g; distZ (i; j ) = 1; i 6= j ; f (f ) = i for f
2 Fi :
In the latter case (1.1) is the square root of the probability to misclassify the parameter f of the distribution we observe.
Remark 1.1.1 Of course, (1.1) is not the only meaningful way to measure risk; a general scheme requires to choose a \loss function" { a nondecreasing function (t) on the nonnegative ray such that (0) = 0 { and to associate with this loss function the risk n o R(f;b f ) = 1 Ef distZ (fb(y); f(f )) : To order to simplify our considerations and notation (in our course, we shall have enough of other \parameters of situation" to trouble about), in what follows we focus on the mean square risk (1.1), i.e., on the simplest loss functions (t) = t2 . Risk (1.1) depends on the \true parameter" f , and thus cannot be used \as it is" to quantify the quality of an estimate. There are two standard ways to eliminate the dependence on f and to get a quantitative characterization of an estimate: [Bayesian approach] To take average of R(f;b f ) over a given a priori distribution of f 2 F [Minimax approach] To take the supremum of R(f;b f ) over f 2 F , thus coming to the worst-case risk
R(fb; F ) = supR(f;b f ) = sup Ef f 2F
f 2F
n
o1=2
dist2Z (fb(y ); f(f )
(1.2)
of an estimate fb on the \parameter set" F . In our course, we always use the minimax approach. The major reason for this choice is that we intend to work with in nitedimensional parameter sets, and these sets usually do not admit \natural" a priori distributions.
max risk
R (F ) = inf R (fb; F ) b f ()
(1.3)
{ the minimal, over all estimates, worst-case risk of an estimate.
Nonparametric regression problems. In the \parametric" Statistics, the parameter set F is nite-dimensional: F Rk (\the distribution is known up to nitely many parameters"). In the Nonparametric Statistics, the parameter set F is in nitedimensional { typically, it is a compact subset of certain functional space, like the space C ([0; 1]d ) of continuous functions on the unit d-dimensional cube. Typical generic examples are as follows:
Nonparametric regression estimation problem: (R) Recover a function f : [0; 1]d ! R known to belong to a given set F C ([0; 1]d) via n noisy observations y = y f = fyi = f (xi ) + i ; i = 1; :::; ng (1.4) of the values of the function along n given points xi 2 [0; 1]d ; here fi gni=1 is the observation noise.
Nonparametric density estimation problem: (D) Recover a probability density f on [0; 1]d known to belong to a given set F C ([0; 1]d ) via n-element sample of independent realizations fxi f gni=1. In our course, we will focus on the Nonparametric regression estimation problem and related problems of estimating functionals of a \signal" f via observations (1.4). In order to get a particular \instance" of generic setting (R), we should specify the following \data elements": 1. The grid fxi gni=1 Options: n-point equidistant grid; sample of n independent realizations of random vector with known/unknown distribution; ... 2. Type of noises fi gni=1 Options: independent N (0; 1)-random variables; independent identically distributed random variables with known/unknown distribution; dependent, in a prescribed fashion, random variables; ...
Options: a subset of C ([0; 1]d) comprised of functions satisfying certain smoothness conditions; ... 4. The metric used to measure risk
In our course, we measure recovering errors in the standard k kq -norms 8 0 > > Z > > > < B @
k g kq = >> > > > :
11=q
[0;1]d
jg(x)jq dxCA ;
max jg (x)j;
x2[0;1]d
1q<1
q=1
The risks associated with these norms are called q -risks.
It would be too ambitious for a single course to be aimed at achieving \maximal generality" with respect to all these \data elements". Our decision will be in favour of \generality in the classes of signals F " rather than \generality with respect to the schemes of observations". Indeed, what makes the major dierence between the parametric and the nonparametric statistics, is exactly the \nontrivial" in nite-dimensional geometry of the parameter set, and it is natural to focus rst of all on the role of this geometry, not complicating things by considering \diÆcult" observation schemes. Speci cally, the main part of the results to be presented deals with the simplest observation scheme, where the observations are taken along an equidistant grid, and the observation noises are independent N (0; 1)1) .
The asymptotic approach. After all \data elements" of the Regression estimation problem (recall that this is the problem we focus on) are speci ed, our \ideal goal" becomes to nd the optimal, for a given volume n of observations (1.4), estimate { the one yielding the minimax risk. As a matter of fact, this goal never is achievable { we do not know what is the optimal in the minimax sense estimate even in the simplest { parametric! { problem \Recover a real f f + i , i N (0; 1)".
2F
= [0; 1] via n independent observations yi =
Thus, we are enforced to simplify our goal, and the standard \simpli cation" is to x all data elements except the volume of observations n, to treat n as a varying \large parameter" and to speak about asymptotically optimal in order/ asymptotically eÆcient estimation methods de ned as follows. When n is treated as a varying parameter, The minimax risk becomes a function of n:
R (n; F ) = binf En;f fn ()
n
dist2 (fbn (y f ; ) f ())
o1=2
;
Most of these results can be more or less straightforwardly extended to the case of more general schemes of observations, but all these extensions are beyond the scope of the course. 1)
inf is taken over the set of all possible estimates of f via n observations (1.4), i.e., all Borel functions fb(x; y ) : [0; 1]d Rn ! R En;f is the expectation over yf . A candidate solution to the Regression estimation problem becomes an estimation method { a sequence of estimates n
fbn (; ) : [0; 1]d Rn ! R
o1
n=1
indexed by volumes of observations used by the estimates; Our goal becomes either to nd an asymptotically eÆcient estimation method:
R(fbn ; F ) = (1 + o(1))R(n; F ); n ! 1; or, which is more realistic, to nd an optimal in order estimation method:
R(fbn; F ) O(1)R(n; F ); n ! 1: In our course, we focus primarily on building optimal in order estimation methods for the Regression estimation problem and asymptotically eÆcient estimation of functionals of a regression function. The only situation we are interested in is when consistent estimation is possible { i.e., when the minimax risk itself converges to zero as n ! 1. Note that the latter assumption is satis ed only when F possesses some compactness properties (see Corollary 1.2.1), and that the rate of convergence of the minimax risk to 0 as n ! 1 heavily depends on the geometry of F (and sometimes { on the metric used to measure the estimation error). These phenomena are characteristic for Nonparametric Statistics and re ect its \combined" (Statistics + Geometry of functional spaces) nature. Just to give an impression of a typical result on estimating a non-parametric regression function, we are about to consider the simplest problem of this type { the one of recovering functions from Holder balls. We start with the situation where the main ideas of the constructions to follow are most transparent, namely, with estimating a univariate Lipschitz continuous function, and then pass to the case of a general Holder ball.
1.2 Recovering a univariate Lipschitz continuous function The problem. Assume we are given n noisy observations yi = f (i=n) + i ; i = 1; :::; n of a function
(1.5)
f (x) : [0; 1] ! R; fig being independent N (0; 1) noises. Our a priori information on f is that f is Lipschitz continuous with a given constant L > 0. How to recover the function?
\good" solutions. We shall discuss just one of them, the so called locally polynomial, or window estimate. The construction is as follows. In order to recover the value of f at a given point x 2 [0; 1], let us choose somehow a window { a segment B [0; 1] containing x and including at least one of the observation points xi = i=n. Let us estimate f (x) via the observations from the window as if f were constant in it. The most natural estimate of this type is just the arithmetic mean of the observations from the window: 1 X fbB (x; y ) = y; (1.6) n(B ) i:xi 2B i where n(B ) stands for the number of observation points in a segment B . Recalling the origin of yi 's and taking into account that
f (x) = we get
errB (x; y )
1 X f (x); n(B ) i:xi 2B
fbB (x; y) f (x)
= dB (x) P + sB ; dB (x) = n(1B) [f (xi ) f (x)]; i:xP i 2B i : sB = n(1B)
(1.7)
i:xi 2B
We have decomposed the estimation error in two components:
deterministic dynamic error (bias) coming from the fact that f is not constant in the window,
stochastic error sB coming from observation noises and depending on window,
not on f ,
and this decomposition allows us to bound the estimation error from above. Indeed, the deterministic error clearly can be bounded as
jdB (x)j n(1B) i:xP2B jf (xi) f (x)j i n(1B) i:xP2B Ljxi xj i LjB j;
(1.8)
where jB j is the length of the segment B . Now, the stochastic error is a Gaussian random variable with the standard deviation q (1.9) n(jB j) q n(B ) njB j=2 (we have taken into account that the number of observation points in B is at least njB j=2), and we can therefore bound from above all moments of the stochastic error:
p
(EfjsB jq g)1=q O(1)n(jB j) q; q 1
absolute constants). It follows that the moments of the estimation error errB (x; y ) can be bounded as follows:
p
(EfjerrB jq (x; y )g)1=q O(1) q" (jB j); qn 2 "n(h) = Lh + nh :
(1.10)
The concluding step is to choose the window width h = jB j which results in the smallest possible "n(h). Since we do not bother much about absolute constant factors, we may just balance the \deterministic" and the \stochastic" components of "n (h):
Lh = (nh)
1=2
) h = Lpn
!2=3
:
Thus, we come to the estimation routine as follows: (Lip) Let number of observations n, noise intensity > 0 and a real L > 0 be given, and let ! 2=3 h= p : (1.11) L n In order to estimate an unknown regression function f at a point x 2 [0; 1] via observations (1.5), we { cover x by a segment Bx [0; 1] of the length h (for the sake of de niteness, let this segment be centered at x, if the distance from x to both endpoints of [0; 1] is h=2, otherwise let Bx be either [0; h], or [1 h; 1], depending on which of the points { 0 or 1 { is closer to x); { take, as an estimate of f (x), the quantity X 1 y: fbn (x; y ) = n(Bx ) i:xi 2Bx i Note that the resulting estimate is linear in observations:
fn (x; y ) = b
n X i=1
i;n (x)yi
with piecewise constant \weight functions" i;n (). It should be stressed that the above estimation routine is well-de ned only in certain restricted domain of values of the parameters L; n; . Indeed, the resulting h should not exceed 1 { otherwise the required window will be too large to be contained in [0; 1]. At the same time, h should be at least n 1 , since otherwise the window may be too small to contain even a single observation point. Thus, the above construction is well-de ned only in the case when 1
p
L n
!2=3
n:
(1.12)
Note that for any xed pair (L; ), the relation (1.12) is satis ed for all large enough values of n.
following simple
Proposition 1.2.1 Let n; L; satisfy the restriction (1.12). Whenever a regression function f underlying observations (1.5) is Lipschitz continuous on [0; 1] with constant L, the estimate fbn (; ) given by the estimation routine (Lip) satis es the relations
Efk f
1=2
kg
fbn 2q
O(1)pqL
!2
p ; 1 q < 1: L n 3
(1.13)
In particular, whenever 1 q < 1, the (worst-case) q -risk
Rq (fbn; H11(L)) =
sup
f 2H11 (L)
1=2
Efk fbn f k2q g
of the estimate fbn on the Lipschitz ball
H11(L) = ff : [0; 1] ! R j jf (x) f (x0)j Ljx x0 j 8x; x0 2 [0; 1]g can be bounded from above as
R (fb ; H1(L)) O(1)pqL q
n
1
!2
p : L n 3
(1.14)
Proof. Let q 2 [1; 1). Relations (1.10) and (1.11) imply that for every Lipschitz continuous, with constant L, function f and for every x 2 [0; 1] one has
Efjfbn(x; y) f (x)j2q g O(1)pqLh 2q )( 8" #2 9 ) < R1 = 1 R E : jfbn(x; y) f (x)jq dx ; E jfbn(x; y) f (x)j2q dx 0 0 h i O(1)pqLh 2q ) 1=2 pqLh Efk fbn f k2q g O(1)p h
i
= O(1) qL1=3 2=3 n [see (1.11)]
1=3
Bound for the 1-risk. The bounds established in Proposition 1.2.1 relate to q risks with q < 1 only; as we shall see in a while, these bounds are optimal in order in the minimax sense. In order to get a similarly \good" bound for the 1-risk, the above construction should be slightly modi ed. Namely, let us x h, 1 h n 1 , and consider an estimate of the same structure as (Lip):
fbh(x; y ) =
X 1 y; n(Bx ) i:xi 2Bx i
(1.15)
of the estimator fbh can be bounded from above by the sum of the maximal, over x 2 [0; 1], deterministic and stochastic errors:
kf
bh
f k1
(
sup jdBx (x)j
x2[0;1]
)
(
1
+
)
sup jsBx j
x2[0;1]
2
:
According to (1.8), the right hand side term fg1 does not exceed Lh. In order to evaluate the term fg2 , note that every sBx is a Gaussian random variable with the zero q 2 mean and the standard deviation not exceeding n (h) = nh , see (1.9). Besides this, the number of distinct random variables among sBx does not exceed O(1)n2 (indeed, every stochastic error is the arithmetic mean of several \neighbouring" observation noises i; i+1 ; :::; j , and there are no more than n(n + 1)=2 groups of this type). It follows that n o E fg22 O(1)n2 (h) ln n; whence p # " n o1=2 ln n h 2 b E k f f k1 O(1) Lh + p : (1.16) nh p
ln n , Choosing h which balances the \deterministic" and the \stochastic" terms Lh, pnh respectively, we get p ! ln n p h= : (1.17) L n 2 3
Denoting by fbn1 () the estimate (1.15), (1.17) and applying (1.16), we get the following risk bound:
R1(fbn1; H11(L)) sup1 f 2H1 (L)
E
n
k fb1()
f () k1 2
n
o1=2
p
O(1)L Lplnnn
!2 3
: (1.18)
Note that the construction makes sense only when h given by (1.17) belongs to the segment [n 1 ; 1], i.e., when 1
p
L n p ln n
!2=3
n;
(1.19)
which for sure is the case for all large enough values of n. Note that the q -risks of the estimate fbn1(), 1 q < 1, are worse than those of the estimate fbn by a logarithmic in n factor only; similarly, the 1-risk of the estimate fbn is only by a logarithmic in n factor worse than the 1-risk of the estimate fbn1 .
Lower bounds. We have build two estimates fbn , fbn1 for recovering a Lipschitz
continuous, with a known constant, function from observations (1.5). It is time now to demonstrate that these estimates are optimal in order in the minimax sense: 8 0 x h=2 < [0; h]; 2) Same as in (Lip), Bx = : [x h=2; x + h=2]; h=2 x 1 h=2 . [1 h; 1 1 h=2 x 1
the minimax q -risk of estimating functions from bounded from below as
R (n; H1(L)) O(1)L q
1
H11 (L) via observations (1.5) can be
!2
p : L n 3
(1.20)
For every xed > 0, for every triple L; ; n satisfying the assumption
n
p
L n p ln n
!2 3
n
(1.21)
(cf. (1.19)), the minimax 1-risk of estimating functions from H11 (L) via observations (1.5) can be bounded from below as
p
R1(n; H11 (L)) C ()L Lplnnn
!2 3
(1.22)
(C () > 0 depends on only). Consequently, in the case of (1.12) the estimate fbn is minimax optimal, up to a factor depending on q only, with respect to q -risk, 1 q < 1, on the set H11 (L). Similarly, in the case of (1.21) the estimate fbn1 is minimax optimal, up to a factor depending on only, with respect to the 1-risk on the same set.
The proof of this Proposition, same as basically all other lower bounds in regression estimation, is based on information-type inequalities. It makes sense to summarize these arguments in the following statement:
Proposition 1.2.3 Let
L be a space of real-valued functions on a set X , and (f; g) be a metric on the functional space L; F be a subset of the space L; Xn be an n-point subset of X ; FN = ff1 ; :::; fN g be an N -element subset of F ; be a positive real. Given the indicated data, let us set
Resolution(FN ) = min f(fi ; X fj ) j 1 i < j N g ; 1 Diameter(FN jXn ) = 2 1max jfi(x) fj (x)j2 ij N x2X n and assume that
Diameter(FN jXn ) <
2
1 ln(N 2
1) ln 2 :
(1.23)
yf = fyf (x) = f (x) + x gx2Xn ; = fxgx2Xn being a collection of independent N (0; 1) noises, and let f~ be an arbitrary estimate3) . Then the worst-case -risk
R (f~; F ) sup Ef(f (); f~(; yf ))g f 2F
of the estimate f~ on F can be bounded from below as
R(f~; F ) 14 Resolution(FN ):
(1.24)
Corollary 1.2.1 Let L be a space of real-valued functions on a set X , be a metric
on L and F be a subset of L. Assume that functions from F are uniformly bounded and that F is not pre-compact with respect to : there exists a sequence ffi 2 Fg1 i=1 and " > 0 such that (fi ; fj ) " for all i 6= j . Then F does not admit consistent estimation: for every sequence fXn Xg1 n=1 of nite subsets of X , Card(Xn ) = n, the minimax -risk
R (n; F ) inf sup Ef(f (); f~(; yf ))g f~ n f 2F
2 F via observations yf = fyf (x) = f (x) + x gx2Xn remains bounded away from 0 as n ! 1: of estimating f
R (n; F ) 41 ":
[i N (0; 1) are independent]
(1.25)
Proof. Under the premise of Corollary there exist subsets FN F of arbitrary large cardinality N with Resolution(FN ) " and bounded, by a constant depending on F only, Diameter(FN jXn ) (since all functions from F are uniformly bounded). It follows that for every n we can nd FN F satisfying (1.23) and such that the associated
lower bound (1.24) implies (1.25). Proof of Proposition 1.2.3. Consider N hypotheses Hi , i = 1; :::; N , on the distribution of a random vector y 2 Rn; according to i-th of them, the distribution is the one of the vector yfi , i.e., n-dimensional Gaussian distribution Fi () with the covariance matrix 2 I and the mean fi , fi being the restriction of fi onto Xn . Assuming that there exists an estimate f~ which does not satisfy (1.24), let us build a routine S for distinguishing between these hypotheses: 3) Here an estimate is a function f~(x; y ) : X Rn ! R such that f~(; y) 2 L for all y 2 Rn and the function (f (); f~(; y)) is Borel in y for every f 2 F
check whether there exists i N such that 1 (f y (); fi ()) < Resolution(FN ): 2 If it is the case, then the associated i is uniquely de ned by the observations (by de nition of Resolution), and we accept the hypothesis Hi , otherwise we accept, say, the hypothesis H1 . Note that since f~ does not satisfy (1.24), then for every i N the probability to accept hypothesis Hi if it indeed is true is 1=2 (recall that FN F and use the Tschebyshev inequality). On the other hand, the Kullback distance !
Z
j (y ) dFj (y ) K(Fi : Fj ) ln dF ( y ) dF i n R between the distributions Fi and Fj is at most 2 Diameter(FN jXn): ! ( ) Z k y fi k22 k y fj k22 k y fj k22 n= 2 n K(Fi : Fj ) = (2 ) exp dy 2 2 2 2 ! ( ) Z k z k22 k z [fi fj ] k22 k z k22 n= 2 n = (2 ) exp dz 2 2 2 2 ) ! ( Z k fi fj k22 2zT [fi fj ] (2) n=2 n exp k z k22 dz = 2 2 2 2 k fi fj k = 2 : 2
2 2
It remains to make use of the following fundamental Theorem 1.2.1 [Fano's inequality, Fano '61] Let ( ; F ) be a Polish space with the Borel -algebra, let F1 ; :::; FN be N mutually absolutely continuous probability distributions on ( ; F ). Let also
Z
K(Fi : Fj ) = ln
dFj (! ) dFj (! ) dFi (! )
be the Kullback distance from Fj to Fi , and let
K = max K(Fi : Fj ): i;j Given a positive integer m, consider N hypotheses on the distribution of a random point ! m 2 m , i-th of the hypotheses being that the distribution is Fim (i.e., that the entries !1 ; :::; !m of ! m are mutually independent and distributed according to Fi ). Assume that for some reals Æi 2 (0; 1), i = 1; :::; N , there exists a decision rule { a Borel function
D : m ! [1; N ] = f1; 2; :::; N g { such that the probability to accept i-th hypothesis if it indeed is true is at least Æi : Fim (f! m : D (! m ) = ig) Æi ; i = 1; :::; N:
mK
=
P Pi
i
p(i) ln p(i)
(1
p(i)Æ (i):
In particular,
mK
=
P
i
p(i)Æ (i)) ln(N
ln(N 1 P Æ (i): N i
1) ln 2;
1) ln 2;
(1.26)
(1.27)
As we have seen, for the routine S we have built the probabilities to accept every one of the hypotheses Hi if it is true are at least 1=2. Besides this, we have seen that for the hypotheses in question K 2 Diameter(FN jXn). Applying (1.27) with m = 1, we get 1 2 Diameter(FN jXn ) ln(N 1) ln 2; 2 which is impossible by (1.23). Proof of Proposition 1.2.2. A. In order to prove (1.20), let us x q 2 [1; 1) and specify the data of Proposition 1.2.3 as follows:
L = Lq [0; 1], (f; g) =k f g kq ; F = H11 (L); Xn = fi=n; i = 1; :::; ng. It remains to de ne a candidate to the role of FN . To this end let us choose somehow a positive h < 1 (our choice will be speci ed later). Note that we can nd a collection of 1 M = M (h) 2h non-overlapping segments Bl 2 [0; 1], l = 1; :::; M , of the length h each. Now consider functions f as follows:
f is zero outside
M [ l=1
Bl , and in every segment Bl = [xl
the function is either L[0:5h
h=2; xl + h=2]
jx xl j], or L[0:5h jx xl j].
It is clear that there exist 2M (h) distinct functions of this type, and all of them belong to H11 (L). Moreover, it is easily seen that one can nd a collection FN (h) = ff1 ; :::; fN g comprised of N (h) 2O(1)M (h) (1.28) functions of the indicated type in such a way that for distinct i; j the number n(i; j ) of those segments Bl where fi diers from fj is at least O(1)M (h). It is immediately seen that the latter property implies that
i 6= j )k fi so that
fj kq O(1)Lh(O(1)M (h)h)1=q O(1)Lh; Resolution(FN (h) ) O(1)Lh:
(1.29)
distance between any two functions fi ; fj does not exceed Lh, hence Diameter(FN (h) jXn ) L2 h2 n:
(1.30)
In view of (1.28), for N = N (h) the right hand side of (1.23) is at least O(1) 2h 1 , provided that h is less than a small enough absolute constant. On the other hand, by (1.30) the right hand side of (1.23) for FN = FN (h) is at most nL2 h2 . We see that in order to ensure (1.23) for FN = FN (h) it suÆces to set h
h = O(1) min 1; n
1=3 L 2=3 2=3
i
= O(1)n
1=3 L 2=3 2=3 ;
the concluding relation being given by (1.12). In view of (1.29), with this choice of h Proposition 1.2.3 yields (1.20). B. In order to prove (1.22), one can use a construction similar to the one of A. Namely, let us set L = L1[0; 1], (f; g) =k f g k1;
F = H11 (L); Xn = fi=n; i = 1; :::; ng, choose h 2 [0; 1) and build
M (h) O(1)h 1 non-overlapping segments Bj = [xj h=2; xj + h=2] [0; 1]. Associating with j -th segment the function 0; x 62 B fj (x) = L[0:5h jx x j]; x 2 Bj ; j j we get a collection FM (h) of M (h) functions such that Resolution(FM (h) ) = 0:5Lh
(1.31)
and
Diameter(FM (h) jXn) O(1)L2 h3 n (indeed, the dierence of two functions from FM (h) is of the uniform norm at most 0:5Lh and diers from zero at no more than O(1)nh point of the grid Xn ). We see that for FN = FM (h) the left hand side in (1.23) is at most O(1)L2 h3 n, while the right hand side is at least O(1) 2 ln M (h) = O(1) 2 ln h 1 , provided that h is less than a small enough absolute constant. It follows that in order to ensure (1.23) it suÆces to choose h less than an appropriate absolute constant and satisfying the relation
L2 h3 n O(1) 2 ln h 1 : In the case of (1.21) the latter requirement, in turn, is satis ed by
p
ln n p h = d() L n
!2 3
with properly chosen d() > 0 (depending on only). With this h, Proposition 1.2.3 combined with (1.31) yields the bound (1.22).
balls The constructions and results related to recovering univariate Lipschitz continuous functions can be straightforwardly extended to the case of general Holder balls.
Holder ball Hds (L) is speci ed by the parameters s > 0 (order of smoothness), d 2 Z+ (dimension of the argument) and L > 0 (smoothness constant) and is as follows. A positive real s can be uniquely represented as s = k + ;
(1.32)
where k is a nonnegative integer and 0 < 1. By de nition, Hds (L) is comprised of all k times continuously dierentiable functions
f : [0; 1]d ! R with Holder continuous, with exponent and constant L, derivatives of order k:
jDk f (x)[h; :::; h] Dk f (x0)[h; :::; h]j Ljx x0 jjhjk 8x; x0 2 [0; 1]d8h 2 Rd: Here j j is the standard Euclidean norm on Rd , and Dk f (x)[h1 ; :::; hk ] is k-th dierential of f taken at a point x along the directions h1 ; :::; hk :
Df k (x)[h
@ k ; :::; h ] = f (x + t1 h1 + t2 h2 + ::: + tk hk ): 1 k @t1 :::@tk t =t =:::=tk =0 1
2
Note that Hd1 (L) is just the set of all Lipschitz continuous, with constant L, functions on the unit d-dimensional cube [0; 1]d .
The problem we now are interested in is as follows. Assume we are given n = md noisy observations
n
y = yf ( )h= y = f (x ) + j = (i1 ; :::; id ) i2 [1; m]d x(i ;:::;id) = (i1 =m; i2 =m; :::; id =m)T
o
(1.33)
1
of unknown regression function f ; here f g are independent N (0; 1) noises. All we know in advance about f is that the function belongs to a given Holder ball Hds (L), and our goal is to recover the function from the observations.
The recovering routine we are about to present is quite similar to the one of the previous Section. Namely, we x a \window width" h such that k+2 h 1; (1.34) m k being given by (1.32). In order to estimate the value f (x) of f at a point x 2 [0; 1]d, we choose somehow a \window" { a cube Bx [0; 1]d such that x 2 Bx and the edges of Bx are equal to h, and estimate f (x) via observations from the window as if f was a polynomial of degree k. Let us explain the exact meaning of the latter sentence.
a d-dimensional cube with edges h > 0, and let Æ = Æ Zd be the regular grid with resolution Æ > 0. Assume that h 1; (1.35) (k + 2)Æ and let BhÆ be the intersection of the cube Bh and the grid Æ . Let also Pdk be the space of all polynomials of (full) degree k of d variables. Consider the following auxiliary problem:
2 Bh, nd \interpolation weights" ! = f!(u)gu2BhÆ which reproduce the value at x of every polynomial of degree k via its restriction on BhÆ : X
(*) Given x
p(x) =
u2BhÆ
! (u)p(u)
8p 2 Pdk
(1.36)
with the smallest possible variance
k ! k22=
X
u2BhÆ
! 2 (u):
Lemma 1.3.1 Problem (*) is solvable, and its optimal solution !x is unique and continuously depends on x. Moreover,
k and
k
!x 22
k !x k1
2 (k; d) hÆ
X
u2BhÆ
!d
j!(u)j 1(k; d)
(1.37) (1.38)
with factors 1;2 (k; d) depending on k; d only.
Proof. 10 . Observe, rst, that if Gi , i = 1; :::; d, are nite sets of reals, each of the sets being comprised of li k + 1 equidistantly placed points, and Gd = G1 G2 ::: Gd ; then the only polynomial from Pdk vanishing at the grid Gd is the zero polynomial (this observation is given by a straightforward induction in d). In other words, if p1 ; :::; pN is a basis in Pdk and P is the matrix with columns being the restrictions of the basic polynomials piGd on the grid Gd :
P = [p1Gd ; :::; pNGd ]; then the kernel of the matrix P is trivial. Denoting by pb the vector of coeÆcients of a polynomial p 2 Pdk in the basis p1 ; :::; pN and observing that
pGd = P pb
8p 2 Pdk ;
we conclude that pb can be expressed, in a linear fashion, via pGd . Consequently, the value of p 2 Pdk at a given point u can also be expressed as a linear function of pGd :
9 : T pGd = p(u) 8p 2 Pdk :
system
0
PT = B @
p1 (u)
1
C A
0
p1 (u)
(1.39) pN (u) As we have seen, (1.39) is solvable, and the matrix P is with the trivial kernel; under these conditions Linear Algebra says that the matrix P T P is non-singular and that the (unique) least norm solution to (1.39) is given by
u = P (P T P )
1B @
pN (u)
1 C A
:
In particular, u is a continuous function of u. 20 . In view of (1.35), the set BhÆ is a grid of the type considered in 10 ; in view of the results of 10 , the weight vector !x is well-de ned and is continuous in x. 30 . To prove (1.37), let us come back to the situation of 10 and assume for a moment that the cardinality of every \partial grid" Gi is exactly k + 1, and the convex hull of the grid is the segment [0; 1]. In this case the norms k u k2 of the weight vectors u , being continuous functions of u, are bounded in the cube 1 ui 2; i = 1; :::; d by certain constant C1 (k; d) depending on k; d only. By evident similarity reasons we conclude that if the partial grids Gi are arbitrary equidistant grids of the cardinality k + 1 each, the parallelotope B (Gd ) is the convex hull of Gd and B + (Gd ) is the concentric to B (Gd ) three times larger parallelotope, then for the corresponding weight vectors it holds k u k2 C1(k; d) 8u 2 B +(Gd): (1.40) Let q be the largest integer such that q (k +2)Æ h; note that by (1.35) we have q 1. As we just have mentioned, the grid BhÆ is a direct product of d partial equidistant grids Gb i , and the cardinality of every one of these grids is at least q (k + 1). For every i, let us partition the grid Gb i into q mutually disjoint equidistant sub-grids Gi;l , l = 1; :::; q of cardinality k + 1 each as follows: Gi;l contains the l-th, the (l + q )th,...,the (l + kq )-th points of the grid Gb i . For every collection = (1 ; :::; d ) of integers i 2 [1; q ], we can build the d-dimensional grid
Gd = G1;
1
G2; ::: Gd;d : 2
By construction, all q d d-dimensional grids we can get in this way from q d distinct collections are mutually disjoint and are contained in BhÆ . Moreover, it is easily seen that every one of the parallelotopes B + (Gd ) contains Bh . As we just have seen, for every there exists a representation
p(x) = with
X
u2Gd
X
u2Gd
(u)p(u)
8p 2 Pdk
2 (u) C12 (k; d):
p(x) =
X
u2BhÆ
! (u)p(u)
1XX (u)p(u): q d u2Gd
The variance of the resulting interpolation weights clearly is 1 2 1 XX 2 ( u ) C (k; d) C2 (k; d)(Æ=h)d q 2d u2Gd qd 1 (we have used (1.40) and the fact that q O(1)hÆ 1(k + 2) 1 ). Since the variance of the optimal interpolation weights (those coming from the optimal solution to (*)) cannot be worse than the variance of the weights we just have built, we come to (1.37). It remains to note that (1.38) follows from (1.37) in view of the Cauchy inequality.
Window estimates. The simple technique for estimating polynomials we have de-
veloped gives rise to a useful construction we shall use a lot { the one of a window estimate of f (x) via observations (1.33). For a given volume of observations n, such an estimate is speci ed by its order k (which is a nonnegative integer) and a window B (recall that this is a cube containing x and contained in [0; 1]d with the edges of a length h satisfying (1.34)) and is as follows. Let B n be the intersection of B with the observation grid. In view of Lemma 1.3.1, problem (*) associated with the data x; Bh = B; BhÆ = B n ; k is solvable; its optimal solution is certain collection of weights n
o
! !xB = !B (x) j : x 2 B : The order k window estimate of f (x) associated with the cube B is
fb(x; y ) fbnB (x; y ) =
X
:x 2Bx
!B (x)y :
(1.41)
The following proposition summarizes some useful properties of window estimates.
Proposition 1.3.1 Let x 2 [0; 1]d , k, n be given, and let B be a window for x. Given a continuous function f : [0; 1]d 7! R, let us de ne k (f; B ) as the smallest uniform error of approximating f in B by a polynomial of degree k: k (f; B ) = mink max jf (u) p(u)j: u2B p2Pd
Then the error of the order k window estimate of f (x) associated with the window B can be bounded as follows: "
jfn (x; yf ( )) f (x)j Ok;d(1) k (f; B ) + pn D bB
j
#
j ;
d=2 (B ) B ( ) x
where
D(B ) is the edge of the cube B ; xB ( ) is a linear combination of the noises fg with variance 1.
(1.42)
rameter(s) listed in the subscript. Furthermore, let n
n = n ( ) = sup jxB ( )j j x 2 [0; 1]d ; B is a window for x
o
p
Then the random variable n is \of order of ln n":
p
n
8w 1 :
o
Prob n > Ok;d(1)w ln n
exp
(
)
w2 ln n : 2
(1.43)
Proof. Let n(B ) be the number of observation points in the window B and p() be a polynomial of degree k such that max jf (u) p(u)j = k (f; B ): u2B We have = =
xB
B jf (x) fbX n (x; yf ( ))j X jf (x) !B (x) [p(x ) + [f (x ) p(x )]] + !B (x) j :x 2B :x 2B X X jf (x) p(x) + !B (x) [f (x ) p(x )] + !B (x) j :x 2B
[by (1.36)] jf (x) p(x)j + h
X
:x 2B
j
:x 2B i !xB 1 +
k (f; BX ) 1+ k k = k!xB1 k !B (x) : 2
X
jjf (x) p(x )j + j :x 2B k !xB k2 jxB j;
!B (x)
!B (x)
j
(1.44)
:x 2B
By Lemma 1.3.1 (applied with Æ = n
1=d )
one has
k !xB k1 1(k; d); k !xB k2 2 (k; d)n
1=2 D d=2 (B );
and (1.44) implies (1.42). The proof of (1.43) is left to the reader; we just indicate that the key argument is that, as it is immediately seen from the proof of Lemma 1.3.1, for xed B the weights !B (x) are polynomials of x of degree k.
From estimating polynomials to estimating functions from Holder balls. Let us estimate f () at every point by a window estimate, all windows being of the same size; the underlying length of window edges { the \window width" h { is the parameter of our construction. Let us specify somehow the correspondence x 7! Bx , Bx being the window used to estimate f (x); we may, e.g., use the \direct product" of the rules used in the univariate case. Let fbh (; y ) denote the resulting estimate of the regression function f underlying observations y (see (1.33)).
B [0; 1]d we clearly have k (f; B ) Ok;d(1)LDs(B )
(1.45)
(the right hand side is just the standard upper bound for the error, on B , of approximating f by its Taylor polynomial of the degree k taken at a point from B ). From this observation and (1.44) it follows that for the window estimate fbh (x) we have
"
Ok;d(1)pq Lhs + p
1=q
Efjf (x) f (x; y)j g bh
q
nhd
#
;
(1.46)
provided that h satis es (1.34). in Now let us choose the window width h which balances the terms Lhs and pnh d the right hand side of (1.46): !2=(2s+d)
p
h=
L n
:
(1.47)
Assuming that the resulting h satis es (1.34), i.e., that
p
L n 1<
!
d s d
2 2 +
(k + 2) dn;
(1.48)
(cf. (1.12); note that for every pair of (positive) L; this relation is satis ed by all large enough values of n), we come to certain estimate, let it be denoted by fbn (x; y ). In view of (1.46), the q -risk of this estimate on Hds (L) can be bounded as follows:
Rq (fn; H b
s d (L))
Os;d(1)pqL Lp n
!
s s d
2 2 +
:
(1.49)
Note that our estimate, same as in the univariate case, is linear in observations:
fbn (x; y ) =
X
;n(x)y :
Bound for 1-risk. When interested in the estimate of the outlined type with 1-risk being as small as possible, we should choose the window width h in a way
slightly dierent from (1.47) (same as we did so in the previous Section). Indeed, for f 2 Hds (L), the uniform risk of the estimate fbh, in view of (1.45), (1.44), can be bounded as " # s k f () fb(; y) k1 Ok;d(1) Lh + pn n : (1.50) p As we know from (1.43), the \typical values" of n are of order of ln n. Consequently, a reasonable choice p of h should balance the \deterministic term" Lhs and the \typical 1 = 2 value" n ln n of the \stochastic term" in the right hand side of (1.50). We come to the choice p !2=(2s+d) ln n p h= : (1.51) L n
p
!
d
2 2 +
L n sd 1< p < (k + 2) d n; (1.52) ln n and let us denote the resulting estimate fbn1 . From (1.43) combined with (1.50) we get the following bound on the 1-risk of the estimate on Hds (L):
p
R1 (fbn1; Hds (L)) Os;d(1)L Lplnnn
! s2+s d 2
:
(1.53)
Lower bounds on the minimax q -risks of recovering functions from a Holder ball Hds (L) are given by essentially the same reasoning as in the particular case considered in the previous Section; they are particular cases of the bounds from Theorem 2.1.1 proved in Section 2.3. We come to the result as follows: Theorem 1.3.1 For every collection d; s; L; ; n = md satisfying (1.48) and every q 2 [1; 1) the minimax q -risk of estimating functions from Hds (L) via observations (1.33) can be bounded from below as
!
s s d
2 2 +
p : (1.54) L n For every xed > 0, for every collection s; d; L; ; n = md satisfying the assumption Rq (n; Hds (L)) Os;d(1)L p
!
d
2 2 +
s d (k + 2) dn (1.55) Lp n ln n (cf. (1.52)), the minimax 1-risk of estimating functions from Hds (L) via observations (1.33) can be bounded from below as
n
p
s
!
2 2 +
s d R1(n; Hds (L)) O;s;d(1)L Lplnnn : (1.56) Consequently, in the case of (1.48) the estimation method ffbn gn given by (1.41), (1.47) is minimax optimal on Hds (L) with respect to q -risk, 1 q < 1, up to a factor depending on s; d; q only. Similarly, in the case of (1.55) the estimate fbn1 is minimax optimal on Hds (L) with respect to the 1-risk up to a factor depending on s; d; only. As a corollary, we get the following expressions for the minimax risks of estimating functions from Holder balls Hds (L) via observations (1.33): For all large enough values of n (cf. (1.48), (1.55)), one has
Rq (n; Hds (L))
s d
2 2 +
= Os;d;q L Lp n s 1 q < 1; p R1(n; Hds (L)) = Os;d; L Lplnnn
;
s s d
2 2 +
(1.57)
(From now on, we write f (n) = O (g (n)), being a collection of parameters, if both f (n)=g (n) and g (n)=f (n) admit upper bounds depending on only.) Note that the estimates underlying the upper bounds in (1.57) can be chosen to be linear in observations.
The proof of the Fano inequality is given by the following two lemmas. Lemma 1.4.1
Let f (i; j )gN i;j =1 be a probability distribution on p(i)
=
q (j )
=
N P j =1 N P i=1
[1; N ]2 , let
(i; j ); i = 1; :::; N (i; j ); j
= 1; :::; N
be the associated marginal distributions, and let N X
= Then I [ ]
Proof.
) (i; j ) ln p(i()i;j q(j ) i;j P
i P i
We have I [ ]
P
= =
i=1
(i; i):
p(i) ln p(i)
(1 ) ln(N p(i) ln p(i) (1 ) ln(N
Ph
1) + [(1 ) ln(1 ) + ln ] 1) ln 2: i
) (i; j ) ln p(i) + (i; j ) ln q((i;j j) i;j P P ) p(i) ln p(i) + (i; j ) ln q((i;j j ) i i;j P p(i) ln p(i) i X (i; j ) + min (i; j ) ln ; (j ) (;);()2B i;j B= f(i; j ) 0gNi;j=1; f(j ) 0gNj=1 j (i; i) = (i; i); P
i
The function p ln pq
4)
(i; j ) = (j )
8j;
P
j
(1.58)
(1.59)
(j ) = 1 :
is convex and lower semicontinuous in p; q 0, so that the function f (; ) =
X
i;j
(i; j ) ln
(i; j ) (j )
is convex on the convex set B . To compute its minimum on B , let us x f(j ) (j; j )gNj=1 P with (j ) = 1 and minimize f (; ) over those for which (; ) 2 B . Due to the separable j structure of f , this minimization results in min f (; ) = :(;)2B 4)
X
j
min
P
X
f(i)0gNi=1 :(j )=(j;j ); (i)=(j ) i i
(i) ln
By de nition, 0 ln 0q = 0 for all q 0 and p ln p0 = +1 whenever p > 0
(i) : (j )
X
min P (i) ln f(i)0gNi=1 :(j )=(j;j ); (i)=(j ) i i is given by
(
(i) =
so that g ( )
min f (; ) = :(;)2B
(i) (j )
(j; j ); i = j 5) ; (j ) (j;j ) N 1 ; i 6= j
X
[(j ) (j; j )] ln
j
(j ) (N
(j; j ) 1)(j )
+ (j; j ) ln
It remains to minimize g() over
8 <
2 B 0 = :f(j )gNj=1 j (j ) (j; j );
X
(j; j ) (j )
:
9 =
(j ) = 1 : ;
j
We claim that the required minimizer is given by 1 (j ) = (j; j ); j = 1; :::; N: Indeed, g is convex on the convex set B 0 , so that in order to verify that the above (which clearly belongs to B 0 ) minimizes g on B 0 , it suÆces to verify that the derivative of g at P is proportional to the vector of ones (i.e., to the normal to the hyperplane (j ) = 1 j containing B 0 ). We have @ @(j ) g ( )
) = ln (N(j )1)(j;j (j ) + 1 = N1 1 ;
as required. We conclude that min f (; ) = g( ) (;)2B h P = (j; j ) 1
1 ln
j
(j ) (j;j ) (j )
1 N 1
+ ln
= (1 ) ln(1 ) (1 ) ln(N
(j;j ) (j )
i
1) + ln ;
and (1.58) follows. Now let us set Hi = Fim , so that Hi is a probability distribution on ( ; F )m , and let j be the set of those !m 2 m at which D() is equal to j , so that f j gNj=1 is a partition of
into N non-overlapping Borel sets. Given a probability distribution fp(i)gNi=1 on [1; N ], let us set R (i; j ) = dHi (! m ); (i; j )
j
= p(i)(i; j ); so that (; ) is a probability distribution on [1; N ]2 . Note that by evident reasons K(Hi : Hj ) = mK(Fi : Fj ); so that K(Hi : Hj ) mK: (1.60)
Indeed, the function we are minimizing is lower semicontinuous, and it is minimized on a compact set, so that the minimum is attained. Since the set is convex and the function is convex and symmetric in f (i)gi6=j , it has a minimizer where all (i) with i 6= j are equal to each other. 5)
Proof.
P
Denoting H =
j
Hj
and
I [ ] K: dH (!m ) hj (! m ) = dHj(!m ) ,
(1.61) we get
h (!m ) hj (! m ) ln hji (!m ) dH (! m ) k k m P R hj (! m ) ln hhji ((!!m)) dH (! m ) k k ! R hj (!m ) hi (!m ) P m (j; k ) ln (j;k) hj (!m ) dH (! ) k
k
P R
K(Hi : Hj ) = =
(1.62)
[Jensen's inequality for the concave function ln()] P ( j;k ) = (j; k ) ln (i;k) : k
Thus, in view of (1.60)
mK K(Hi : Hj )
We now have I [ ]
X
(j; k ) ln
k
(j; k ) (i; k )
8i; j:
0
=
P
j;k
1
B
p(j )(j; k ) ln B @ P
i
0
=
P
P
j;k
(1.63)
p(j )(j;k) p(j )(j;i)
1
P
i
C C A
p(i)(i;k)
) A p(j )(j; k ) ln @ P p((ij;k )(i;k) i
) p(i)p(j )(j; k ) ln ((j;k i;k) i;j;k
(1.64)
[Jensen's inequality for the convex function f (t) = ln at ] p(i)p(j )mK
P
i;j
[by (1.63)]
= mK:
Combining (1.64) and (1.58), we come to (1.26); setting in the latter inequality p(i) = N1 , i = 1; :::; N , we get (1.27). Remark 1.4.1 In course of proving the Fano inequality (see (1.62), we have obtained a result which is important by itself: Let F; G be two mutually absolutely continuous probability distributions on
, and let
=
I [
i i=1 be a partitioning of into I < 1 mutually disjoint sets from the underlying b be the distributions of the \point index" -algebra. Let Fb , G ! 2 1 2 ; ! 2 2 , i(! ) = > ::: > : I; ! 2 I 8 1; > > <
induced by F , G, respectively. Then
K(Fb : Gb ) K(F : G):
Chapter 2 Estimating regression functions from Sobolev balls We have seen what are the possibilities to recover a \uniformly smooth" regression function { one from a given Holder ball. What happens if the function f in question is smooth in a \non-uniform" sense, i.e., bounds on somehow averaged magnitudes of the derivatives of f are imposed? In this case, the function is allowed to have \nearly singularities" { it may vary rapidly in small neighbourhoods of certain points. The most convenient form of a \non-uniform" smoothness assumption is that the observed function f : [0; 1]d ! R belongs to a given Sobolev ball Sdk;p(L).
A Sobolev ball Sdk;p (L) is given by four parameters: positive integer k { order of smoothness,
positive integer d { dimensionality, p 2 (d; 1], L > 0,
and is comprised of all continuous functions f : [0; 1]d ! R such that the partial derivatives of order k of f (understood in the sense of distributions) form a usual vector-function Dk f () with k Dk f () kp L: It is known [2] that functions f 2 Sdk;p (L) are (k 1) times continuously dierentiable (this is ensured by the restriction p > d), and we denote by Ds f () the vectorfunction comprised of partial derivatives of order s < k of a function f 2 Sdk;p (L). Note that Holder balls Hds (L) with integer s are essentially the same as Sobolev balls Sds;1(L).
The problem we are interested in is as follows. Given n = md observations n
y yf"( ) = y = f (x ) + j = (i1 ; :::; id ) 2 [1# ; m]d x(i ;:::;id) = (i1 =m; i2 =m; :::; id =m)T ; = fg : are independent N (0; 1) 1
29
o
(2.1)
S = Sdk;p(L), we want to recover f along with its partial derivatives D() f = of orders jj
d P i=1
i k
@ jj f @x1 :::@xd d 1
1.
Notation and conventions. In what follows, for an estimate f^n;() (x; y ) of D() f via observations (2.1), we denote by
n
Rq;() (f^n;(); S ) = sup E k f^n;()(; y) D() f () k2q
o1=2
f 2S
the q -risk of the estimate on the Sobolev ball in question; here 1 associated minimax risk is de ned as
q 1.
The
Rq;() (n; S ) = f^inf Rq;() (f^n;() ; S ) n; ( )
Below we deal a lot with the parameters p; q 2 [1; 1] (coming from the description of the Sobolev ball and the risk we are interested in, respectively); let us make the convention to denote 1 1 = ; = : p q We call a cube a subset B [0; 1]d of the form fx j [0 ] ai xi ai + h [ 1]; i 2 [1; d]g and denote by D(B ) = h the edge length, and by jB j = hd the d-dimensional volume of such a cube B . For a collection k; d 2 N; p 2 (d; 1]; q 2 [1; 1] and l 2 [0; k 1] let
l (p; k; d; q ) =
( k l
2k+d ; k l+d d ; 2k 2d+d
22kl++dd ; 22kl++dd
(2.2)
when the parameters p; k; d; q are clear from the context, we shorten l (p; k; d; q ) to l . We denote by A the set of the admissible for us values of the parameters p; k; d, i.e., A = f(p; k; d) j k; d 2 N; p 2 (d; 1]g: In what follows we denote by C (perhaps with sub- or superscripts) positive quantities depending on k; d only, and by P (perhaps with sub- or superscripts) { quantities 1 depending solely on (p; k; d) 2 A and nonincreasing in p. Finally, j j stands both for the absolute value of a real and the Euclidean norm of a vector.
2.1 Lower bounds for the minimax risk The lower bounds for the minimax risk are given by the following
be given. Assume that the volume of observations n is large enough, namely,
p
L n 1 : (2.3) Then the minimax q -risk of estimating D() f for functions f from the Sobolev ball Sdk;p(L) via observations (2.1) can be bounded from below as
Rq;() (n; S ) Ok;d(1)L Lpn
!2 l (p;k;d;q)
:
(2.4)
If the volume of observations n is so large that
p
L n for some positive ", then in the case of \large" ratios q=p, namely, bound can be strengthened to n"
p
ln n p q;() (n; S ) Ok;d;"(1)L L n
R
!2 l (p;k;d;q)
(2.5) q p
22kl++dd , the lower
:
(2.6)
The proof (completely similar to the proof of the lower bounds from Section 1.2) is placed in Section 2.3.
Comments, I. The lower bounds for the minimax risk (2.4), (2.6) (which, as we
shall see in a while, are sharp in order) demonstrate the following behaviour of the minimax risk as a function of the volume of observations n:
1. For given k; d and l = jj < k there exists the \standard" asymptotics of the risk Rq;() (n; S ) which is O(n (k l)=(2k+d) ); this is the behaviour of the risk for \small" ratios q=p, namely, when 2k + d q=p = = < : 2l + d Note that the standard asymptotics is independent of p; q { i.e., of the particular norms in which we measure the magnitude of Dk f and the estimation error. Note also that in the case of l = jj = 0, i.e., when speaking about recovering the regression function itself rather than its derivatives, the standard asymptotics of risk is O(n k=(2k+d) ) { the result already known to us in the particular case of p = 1, q < 1, i.e., when the Sobolev ball in question is in fact the Holder ball Hdk (L). 2. When the ratio q=p is greater than or equal to the \critical level" asymptotics of the minimax risk becomes 0
O@
ln n n
! k l+d d 1 2
k 2d+d
A
2k+d , 2l+d
the
worse, and the worst asymptotics corresponds to q = 1 and is 0
O@
ln n n
! k l d 1 2
k 2d+d
A
:
Comments, II. We have seen that when recovering \uniformly smooth" regression
functions { those from Holder balls { an optimal in order estimate can be chosen to be linear in observations. In the case of \non-uniform" smoothness linear estimates work well in a restricted range of values of q only { essentially, when q p < 1. The exact claims are as follows: (i) The lower bounds from Theorem 2.1.1 in the case of q p < 1 can be achieved (up to independent of n factors) by properly chosen linear estimates; (ii) If 1 > q > p and q 2, no linear estimation method can achieve the rates of convergence indicated in Theorem 2.1.1. We shall check (i) in the case when our target is to recover the regression function, and not its derivatives; namely, we shall demonstrate that the order k 1 window estimate fbn (see Section 1.3) recovers functions f 2 S = Sdk;p (L) with the desired order of convergence of q -risk to 0 as n ! 1 (provided that q p) 1) . Recall that fbn uses windows of the same width h to recover f (x) at all points x. Let us specify this width as (cf. (1.47)) ! 2=(2k+d) h= p (2.7) L n and assume that n is large, namely, that 1
p
L n
!
d k d
2 2 +
(k + 2) dn
(2.8)
(cf. (1.48)). Under this assumption
h
k+2 ; n1=d
(2.9)
so that our estimate fbn is well-de ned. To bound the risk of the resulting estimate, we need the following fact from Analysis (see [2]):
Lemma 2.1.1 Let B [0; 1]d be a cube, let p 2 (d; 1], and let g 2 Sp1;d (). Then the function g is Holder continuous in B with Holder exponent 1 d 1 d=p; namely,
8x; x0 2 B :
jg(x) g(x0)j Op;d(1)jx x0 j
0 Z 1 d @
B
11=p
jDg(u)dujpA :
(2.10)
with Op;d(1) nonincreasing in p > d. 1) One can easily build optimal in order, in the case of q p, window estimates of the derivatives as well.
f
2S
k;p d (L)
0
) k 1(f; B ) Op;k;d(1)Dk
d (B ) @
Z
B
11=p
jDk f (u)jpduA
(2.11)
(B 2 is a cube, D(B ) is the edge length of B ) with Op;d(1) nonincreasing in p > d; here k 1 (f; B ) is the quality of the best uniform, on B , approximation of f by a polynomial of degree k 1, see Proposition 1.3.1. The right hand side in the inequality in (2.11) is nothing but an upper bound (given by (2.10) as applied to g = Dk 1f ) on the error of approximating f in B by its Taylor polynomial of the degree k 1, the polynomial being taken at a point from B . Now we are ready to evaluate the q -risks, q p, of the window estimate fbn on a Sobolev ball S = Sdk;p (L). Let us start with the case of q = p: Assuming f 2 S and combining the bound (1.42) from Proposition 1.3.1 and (2.11), we get [0; 1]d
jfbn(x; yf ( )) f (x)j d(x) + s(x; ); d(x) = Ok;p;d(1)Dk
0
d (B (x)) B @
Z
B (x)
j B (x) j; s(x; ) = Ok;d(1) pnh d x
11=p
jDk f (u)jpduCA ;
(2.12)
here B (x) is the window used by fbn to recover f (x). Now, the function d(x) is non-random, and its p-norm can be evaluated as follows. Let us extend the function `(u) = jDk f (u)j from [0; 1]d to the entire Rd as 0 outside the unit cube. Then Z
p dp(x)dx = Ok;p;d (1)hkp p Ok;p;d
d
2
Z
Z
6 4
`p(u)du75 dx
[0;1]2d B (x) Z Zh Zh 6 kp d (1)h `p(x 4 :::
h
[since B (x) fu j xi =
3
p Ok;p;d (1)hkp d
Zh
h
:::
h
Zh Z Zh
3
u)du75 dx
h ui xi + h; i dg] `p (x
u)dx du
p = Ok;p;d (1)hkp d(2h)d `p (x)dx p Ok;p;d (1)hkpLp :
Thus,
k d(x) kp Ok;p;d(1)hk L:
(2.13)
Furthermore, xB(x) is N (0; 1)-random variable, so that
n
E k s(; ) k2p
o1=2
Ok;p;d(1) p d :
Relations (2.12), (2.13), (2.14) imply that
Rp(fbn; S ) Ok;p;d(1)
nh
"
hk L +
#
p d ; nh
(2.14)
Rq (fbn; S ) Ok;p;d(1)L Lpn
!
k k d
2 2 +
(2.15)
is valid in the case of q = p. Since the left hand side in (2.15) clearly is nondecreasing in q , it follows that the bound is valid for 1 q p as well. It remains to note that the right hand side of our upper bound is, up to a factor depending on k; d; p only, the same as the lower bound on the minimax risk (2.4) (look what is l in the case of l = 0; q=p 1). Now let us verify our second claim { that in the case of 1 > q > p, q 2, the q -risk of a linear estimate on a Sobolev ball Sdk;p(L) never is optimal in order. Let Linn be the set of all linear in observations (2.1) estimates X f^n;() (x; y ) = ;n(x)y
of D() f (), and let
RLin q;() (n; S ) = ^n; inf f 2Lin ( )
sup
f 2Sdk;p (L)
n
E k D() f () f^n;() (; y) k2q
o1=2
be the associated minimax risk.
Theorem 2.1.2 Let us x > 0, (p; k; d) 2 A, l 2 [0; k 1] and (), jj = l. For every q 2 [2; 1) such that q > p and for all large enough volumes n of observations one has
l l (p; k; d; q ) =
2l p ; L n k l d+d < (p; k; d; q ): l 2k 2d+2d+d
RLin q;() (n; S ) Op;k;d;q (1)L
The proof is placed in Section 2.3. As we just have mentioned, the lower bounds on the minimax risk given in Theorem 2.1.1 are sharp in order, so that (2.16) implies that
(2.16)
Rq;() (n; S )
RLin (n; S ) 1 > q 2; q > p ) Rq;() (n; S ) ! 1 as n ! 1; q;()
thus, for \large" q=p linear estimators cannot be optimal in order on Sdk;p (L), independently of whether we are interested in recovering the regression function or its derivatives. Note also that the lower bound (2.16) is valid for an arbitrary n-point observation grid, not necessary the equidistant one.
2.2 Upper bounds on the minimax risk In order to bound the minimax risk from above, we are about to build a particular recovering routine and to investigate its risks. In what follows, n is the equidistant observation grid from (2.1).
with vertices from n , and let n(B ) be the number of observation points in a cube B . Let us associate with every cube B 2 Bn the linear functional (the \B -average")
B (g ) = n
1=2 (B ) X
:x 2B
g ()
on the space of linear functions de ned on the grid n . Let us call a system B Bn normal, if it meets the following requirement: (*) For every cube B cube B 0 2 B such that
[0; 1]d such that jB j > 6dn
B 0 B and jB 0 j 6
d
1,
there exists a
jB j:
Note that normal systems clearly exist (e.g., B = Bn ; in fact one can build a normal system with O(n) cubes). Given observations (2.1) (which together form a function on n ), we may compute all the averages B (y ), B 2 B. Consider the following optimization problem: k;p B (g; y ) max j ( g ) ( y ) j ! min j g 2 S (L): B B d B 2B
(2.17)
It can be easily veri ed that the problem is solvable and that its optimal solution can be chosen to be a Borel function f^n (x; y ) of x; y . f^n (; y ) is exactly the estimate of f we are interested in, and we estimate the derivative D() f () of f just by the corresponding derivative
f^n;() (x; y ) =
@ l ^n f (x; y ) [l jj k @x :::@xdd 1 1
1]
of f^(; y ).
Risks of the estimates f^n;() (; ) on the Sobolev ball Sdk;p (L) are given by the following
Theorem 2.2.1 For every > 0, (p; k; d) 2 A, q 2 [1; 1], l 2 [0; k 1] and (), jj = l, for all large enough volumes n of observations, namely, such that n > P and 1 one has
p
L n p ln n
n k 2
d+d d
(2.18)
2 2
p
Rq;() (f^n;(); Sdk;p(L)) P L Lplnnn l l (p; k; d; q );
2 l
;
here P 1 depends on k; p; d only and is nonincreasing in p > d. In particular (cf. Theorem 2.1.1), in the case of \large" ratios q=p:
q=p
2k + d 2l + d
(2.19)
Rq;() (f^n;(); Sdk;p(L)) P Rq;() (n; Sdk;p(L)) for all large enough values of n. In the case of \small" ratios q=p:
2k + d 2l + d the estimate is optimal in order up to a logarithmic in n factor: for all large enough values of n, Rq;() (f^n;(); Sdk;p(L)) P (ln n) l : Rq;() (n; Sdk;p(L))
q=p <
Proof. In what follows we x p; k; d; ; l; L; q satisfying the premise of the theorem. We write S instead of Sdk;p(L) and denote S = Sk;p d = Let us set and let
[
L>0
Sdk;p(L):
k g kB = max j (g)j; B 2B B q ( ) =k kB max B 2B
X 1 : n(B ) :x 2B
(2.20)
Our central auxiliary result is as follows:
Lemma 2.2.1 There exists P 1 depending on p; k; d only and nonincreasing in p > d such that whenever n P one has
8f 2 S : k f^n(; yf ( )) f kB 2( ) and
8g 2 S; 8(); l j(j < k : ) 1=2 k g kB k gkB k k ( ) ; n k D g kp ; k D g kq P0 max kDk gkp n k D g kp; n
(2.21)
2
2
2
where = l (p; k; d; q ); = l (p; k; d; q ) =
2k+d 2d (p; k; d; q ): l d
From Lemma 2.2.1 to Theorem 2.2.1. Assuming f 2 S , denoting g (x) = f (x) f^n (x; yf ( )) and taking into account (2.21), we get
k g kB 2( );
(2.22)
k Dk g() kp 2L: In view of these observations, (2.22) says that 8 <
k D() g k2q P1 max :L2
2 ( ) L2 n
!2 l
!
2 ( ) ; Ln ; n
Since is the maximum of no more than Card(B) n2 (see (2.20)), we get
n
E k D() g k2q
o1=2
8 <
p
P2 max :L Lplnnn
!2 l
2l
9 = ;
:
(2.23)
N (0; 2) random variables
p
!
ln n ; p ; Ln n
l
9 = ;
:
(2.24)
It is immediately seen that in the case of (2.18) the maximum in the right hand side p 2 l ln of this bound equals to L Lpnn , so that (2.24) is the required bound (2.19).
Proof of Lemma 2.2.1. 10 . Relation (2.21) is evident: since f is a feasible solution
to the optimization problem (2.17) and the value of the objective of the problem at this feasible solution is ( ), the optimal value of the problem does not exceed ( ); consequently, by the triangle inequality
k f () f^n(; y) kB B (f; y) + B (f^(; y); y) 2( ) as claimed in (2.17). 20 . In order to prove (2.21), note rst of all that a function g approximated by a sequence of C1 functions gt in the sense that
[y = yf ( )];
2 Sk;p d
can be
k Dk gt kp!k Dk g k; k gt kB !k g kB ; k D() gt kq !k D() g kq as t ! 1; consequently, it suÆces to prove (2.22) for a C1 function g . 30 . We shall use the following well-known fact (given by embedding theorems for Sobolev spaces, see [2]):
Lemma 2.2.2 For properly chosen P3 ; P4 and for every r; q for every C1 function g : [0; 1]d ! R one has:
either and then
or where
2 [1; 1], l 2 [0; k
1],
k Dk g kp P3 k g k1;
(2.25)
k Dlg k1 P4 k g k1 ;
(2.26)
k Dl g kq P5 k g kr k Dk g k1p ; 8
= : k l d+d k d+d=r ; k
;
l+(k l)=r , k l+(k l)=r k
(2.27)
B [0; 1]d one has
P4jx yj1 d (!g; B ) 8x; y 2 B; 1=p R k p
(g; B ) = jD g(u)j du :
Dk 1 g (x) Dk 1 g (y )
(2.28)
B
40 . From (2.28) it follows that whenever B [0; 1]d , x 2 B and gs (y ) is the Taylor polynomial, taken at x, of degree k 1 of g , then maxy2B jg (y ) gx (y )j Æ
P5 [D(B )]k+Æ 1 d:
1 (g; B );
(2.29)
50 . Let us call a cube B [0; 1]d regular, if
g (B ) max jg(x)j 4P5[D(B )]k+Æ 1 (g; B ): x2B
(2.30)
Note that for a regular cube B , in view of (2.29), one has
8x 2 B :
1 max j g ( y ) g ( y ) j g (B ): x y2B 4
(2.31)
It is clearly seen that if
U = fx 2 (0; 1)d j g (x) 6= 0g; then every point x 2 U is an interior point of a regular cube B ; among these cubes, there clearly exists a maximal one (i.e., a one which is not a proper subset of any other regular cube). For every x 2 U , let us denote by Bx a maximal regular cube containing x as an interior point, and let
U0 =
[
x2U
Bx0 ;
Bx0 being the interior of the cube Bx . By the standard separability arguments, U0 =
1 [
i=1
Bx0i
for properly chosen sequence x1 ; x2 ; :::. In what follows we consider separately two cases
A. The cube [0; 1]d is not regular; B. The cube [0; 1]d is regular. 60 . For the time being, let A be the case. Since [0; 1]d is not regular, every maximal regular cube B must satisfy (2.30) as an equality. In particular, g (Bxi ) = 4P5 [D(Bxi ]k+Æ 1 (g; Bxi ); i = 1; 2; :::
(2.32)
60 .a) We start with the following Lemma (which essentially originates from Banach):
A with the following properties: Cubes from A are mutually disjoint; For every cube B 2 A0 there exists a cube B 0 2 A such that B intersects with B 0 and D(B ) 2D(B 0 ). Proof of Lemma 2.2.3:
with
Let us choose as the rst cube of A a cube B 1 2 A0 D (B 1 )
1 sup D(B ): 2 B2A0
After B 1 is chosen, we set A1 = fB 2 A0 j B \ B 1 = ;g. If A1 is empty, we terminate; otherwise, we choose a cube B 2 from the collection A1 exactly in the same manner as B 1 was chosen from A0 and set A2 = fB 2 A1 j B \ B 2 = ;g. If A2 is empty, we terminate, otherwise choose in the outlined fashion a cube B 3 2 A2 and replace A2 by A3 , and so on. As a result of this construction, we get a nite or a countable collection A of cubes B 1, B 2,...; it is immediately seen that this collection satis es the requirements of Lemma.
60 .2) For B 2 A, let U (B ) be the union of all those cubes from A0 which intersect B and have edges not exceeding 2D(B ). In view of Lemma 2.2.3, we have
U Let us choose We have
U0 r2[
[
B 2A
U (B ):
d + 2k ; 1): d
jg(x)jrdx B2A R jg(x)jr dx U (B ) P d 5 B2A jB jg^r (B ); g^(B ) = supx2U (B) jg (x)j: We claim that for every B 2 A it holds g^(B ) P6 g (B ): k g krr
=
(2.33)
R
UP
(2.34)
(2.35)
Indeed, let y 2 U (B ); then there exists B 0 2 A0 such that y 2 B 0 , B 0 \ B 6= ; and D(B 0 ) 2D(B ). Choosing a point x 2 B \ B 0 and applying (2.31) to the regular cubes B; B 0 , we get 1 max jg(u) gx (u)j max jg(u)j u2D 4 u2D both for D = B and D = B 0. It follows that 4 max jg(u)j max jgx (u)j u2D 3 u2D
(2.36)
is contained in 5 times larger than the cube B concentric to B cube, we have max0 jgx (u)j P7 max jgx (u)j; u2B u2B whence, in view of (2.36), 4 max 0 jg(u)j P7 max jgx (u)j: u2B [B 3 u2B Recalling that maxu2B jgx (u)j 45 g(B ) by (2.31), we come to 5 jg(y)j max j gx (u)j P7 g (B ); 0 u2B 3 so that the choice P6 = 35 P7 ensures (2.35).
Combining (2.34) and (2.35), we get k g krr P8r
X
B 2A
jB jgr (B ):
(2.37)
60 .c) Since A A0 , (2.32) says that for every B 2 A it holds g (B ) = 4P5 [D(B )]k+Æ 1 (g; B ); so that (2.37) yields the inequality
k g krr P9r 60 .d) Let us set
X
B 2A
jB j1+ r k dÆ ( +
B 2A
Let
= Then 1+ r(k+dÆ
B 1)
1 2
+ k+dÆ
1 + r(k+dÆ 1 + k+Æ 2 d
1
1) 1
(B; g ):
:
2 A P11 A (B; g)
kgk r r
[see (2.39)] X P9r P11 A r (B; g ) B 2A
[see (2.38)]
)
(2.38)
B 2A
A P10 sup jB j
)
r (B; g ):
A = sup g (B )jB j1=2 ;
note that by (2.32) we have
) jB j
1)
k g kr
=
P9r P11 A
X
B 2A
!r
p (B; g )
p
[since r p in view of (2.33)] P9r P11 A k Dk g krp [since the cubes B 2 A are mutually disjoint] P12 A k Dk g k1p ; r
(2.39)
we see that it is valid for r = 1 as well, so that + d=r) 1 r 2k d+ d p )k g kr P12 A k Dk g k1 ; = 2(2kk d : (2.40) 2d + d 60 .d) By de nition of A, there exists a regular cube B such that 1 g (B )jB j1=2 A: (2.41) 2 Let x0 2 B ; since B is regular, we have supx2B jg (x) gx (x)j 41 g (B ) [see (2.31)] (2.42) 3 5 ) 4 g(B ) maxx2B jgx (x)j 4 g(B ): In view of the latter inequalities and since gx () is a polynomial of degree k 1, there exists a cube B B such that jB j P13 jB j and jgx (x)j 21 g (B ) for all x 2 B , whence, in view of the rst inequality in (2.42), jg (x)j 41 g (B ) whenever x 2 B . Combining these observations and (2.41), we conclude that jg(x)j; A P14 jB j1=2 xmin 2B 0
0
0
0
so that by (2.40) 9B : i h 1 r 2kd+d p )k g kr P14 jB j1=2 minx2B jg(x)j k Dk g k1p ; (2.43) 2(k d+d=r)
= 2k 2d+d : Consider two possible cases: (I): jB j 6d n 1 ; (II): jB j < 6d n 1 . In the case of (I), since B is a normal system, there exists a cube B^ 2 B such that B^ 6 d jB j and B^ B , and we get k g kB n1=2 (B^ ) min^ 6 d=2n min^ jg(x)jjB^ j1=2 : x2B
x2B
Thus, in the case of (I) relation (2.43) implies that !
k g k2B =2 2k + d 1 r d p )k g kr P15 n k Dk g k2 k Dk g kp : p In the case of (II) relation (2.43) applied with r = 1 yields h
i
k g k1 P14 jB j1=2 minx2B jg(x)j k Dk g k1p ) k g k1 ) k g kr ) k g kr
[ = 2k2(k2dd+)d ] h i jB j1=2 k g k1 k Dk g k1p
P16 jB j P17 jB j k Dk g kp [in view of (2.43)] k d d=r d P18 n [since (II) is the case]: 2(1
)
2(1
)
+
(2.44)
holds
8 <
k g kr P19 max :
k g k2B n k Dk g k2p
! 0 (p;k;d;r)
k Dk g kp; n
0 (p;k;n;r)
9 =
k Dk g kp;
(2.45)
(we have used the fact that for the values of r in question one has =2 = 0 (p; k; d; r), k d+d=r = 0 (p; k; d; r)). d Since the values of 0 (p; k; d; r); 0(p; k; d; r) for r < 2kd+d p are the same as for r = 2kd+d p, relation (2.45) is in fact valid for all r 2 [1; 1]. 60 .e) Since we are in the case of A { i.e., [0; 1]d is not a regular cube { we have k Dk g kp P20 k g k1. Tracing the origin of P20 , one can easily see that we can ensure P20 > P3 , P3 being de ned in Lemma 2.2.2. Thus, in the case under consideration the rst of the alternatives stated by Lemma 2.2.2 does not take place, and therefore (2.26) is valid. Assuming that q 22kl++dd p, let us set r = 2kd+d p, thus getting (k l)k=r+l . Applying (2.26) with the indicated r and (2.45), we get for q 22kl++dd p:
l j j < k )
k
D() g
kq P21
(
) kgk2B l (p;k;d;q) k Dk g k ; n l (p;k;d;q) k Dk g k : p p nkDk gk2p
(2.46)
Since l (p; k; d; q ); l(p; k; d; q ) are independent of q in the segment [1; 22kl++dd p] of values of the parameter, relation (2.46) in fact is valid for all q . Thus, we have proved (2.22) in the case of A. 70 . It remains to prove (2.22) in the case of B, i.e., when [0; 1]d is a regular cube, whence k Dk g k P22 k g k1 : (2.47) In this case we can apply (2.31) to the cube B = [0; 1]n to get the inequality 1 maxn jg (x) g0 (x)j k g k1; x2[0;1] 4 whence, same as in 60 .d), there exists a cube B such that jB j P23 and jg (x)j k g k1 for x 2 B . Since B is a normal system, there exists B 2 B such that B 2 B and jB j P24 , provided that n is large enough, and we get
k g kB P25 n1=2 k g k1; whence
k g k1 P26 k g kB :
(2.48) n Combining (2.47), (2.48) and Lemma 2.2.2, we come to (2.22). The proof of Lemma 2.2.1 is completed. 1 2
2.3 Appendix: Proofs of Theorems 2.1.1, 2.1.2 Proof of Theorem 2.1.1. Let us x a C1 function h() 6 0 such that supp(h) = [0; 1]d ;
k Dk h k1 1; k h k1 1:
(2.49)
C1 = minfk D() h k1 j 0 jj < kg; (recall that Ci stand for positive quantities depending on k; d only). Let us x the volume of observations n = md and a 2 ( m1 ; 81 ), and let B1 ; :::; BN be a maximal in cardinality system of mutually disjoint cubes with the edges , all cubes of the system belonging to [0; 1]d . Note that the number of points from the observation grid n in every one of the cubes Bi does not exceed n = (2)d n: As it is immediately seen, Let
N
maxf8; C2 d g:
(2.50)
h (x) = Lk d h(x=); and let hj be the translation of hÆ with the support Bj , j = 1; :::; N ; it is immediately seen that hj 2 S Sdk;p (L). Now consider N hypotheses on the distribution of observations y , j -th of the hypotheses being that the distribution is the one of the vector yhj ( ), see (2.1). Let us x such that l jj < k; and let 1 "() = C1 k l d+d L: 4 We have i 6= j ) ( ) ( ) k D hi D hj kq k D() hi kq = Lk l d+d k D() h kq Lk l d+d C1 = 4"(): Consequently (cf. the proof of Proposition 1.2.3), under the assumption that the minimax q -risk of estimating D() f , f 2 S is "():
Rq;() (n; S ) < "();
(2.51)
there exists a routine for distinguishing our N hypotheses with probability to reject a hypothesis when it is true at most 1=4. On the other hand, the Kullback distance between pairs of distributions associated with our hypotheses is at most
2 Diameter(fhj ()gNj=1j n ) 2 2 n k h k21 C3 n 2 L2 d+2(k
d) :
Applying the Fano inequality (1.27), we see that the assumption (2.51) implies the relation 1 (2.52) (L= )2 nd+2(k d) C4 ln N C5 ln ; the concluding inequality being given by (2.50). Now let us set 1 = C6
p
L n
!
2
2
k 2d+d
;
then (2.52) fails to be true when = 1 . Consequently, for = 1 (2.51) cannot be valid, and we come to
p L n
R
q;() (n; S ) "(1 ) C7 L
!2 k l d+d 2
k 2d+d
;
this is exactly the bound (2.4) for the case of large ratios q=p (i.e., Now assume that (2.5) takes place, and let us set
p
ln n p 2 = F L n
!
2
(2.53) q p
22kl++dd ).
2
k 2d+d
;
it is immediately seen that for properly chosen F > 0 (depending on k; d; " only) relation (2.52) fails to be true when = 2 . Consequently, for = 2 (2.51) cannot be valid, and we come to
p
Rq;() (n; S ) "(2) C (")L Lplnnn
!2 k l d+d 2
k 2d+d
;
(2.54)
this is exactly the bound (2.5) for the case of large ratios q=p. We have established the desired lower bounds for the case of large ratios q=p. The lower bound (2.4) in the case of small ratios q=p: pq < 22kl++dd is given by exactly the same construction as in the case of Holder balls. Namely, let us rede ne h as follows:
h (x) = Lk h(x=); let hj be the translation of h with the support Bj , and let FN be the set of 2N m P functions "j hj (x), where "j = 1. The set FN clearly is contained in Sdk;1 (L) and j =1 possesses a subset FM comprised of
M
2N=8
functions with the following property: if f; g are two distinct functions from FM , then f diers from g on at least N=8 of the cubes B1 ; :::; BN . Now let us x with l = jj < k; for two distinct functions f; g 2 FM one clearly has
k D() f D() g k1 C8 Lk
l d N
C9Lk l :
Setting
1 "() = C9 Lk l ; 4 we, same as above, conclude that under the assumption that
R1;() (n; Sdk;1(L)) < "()
(2.55)
one can \reliably" distinguish between M hypotheses on the distribution of observations (2.1), the Kullback distances between pairs of the distributions not exceeding
2 n f;gmax kf 2FM
g k21 C10 (L= )2 2k :
we see that (2.55) implies the relation
n(L= )2 2k C12 d : Now let us set = C13
p L n
!
2 2 +
k d
(2.56)
;
for properly chosen C13 and all n satisfying (2.3) the relation (2.56) (and therefore (2.55) as well) fails to be true. Thus, for the indicated values of n one has k;1 d (L))
R
1;() (n; S
"() C14L Lpn
! 2(k l)
k d
2 +
:
Since the risk Rq;() (n; Sdk;p(L)) is nondecreasing in q and nonincreasing in p, the left hand side of this inequality is the one in (2.4), while the right hand side is exactly as required in (2.4) in the case of small ratios q=p.
Proof of Theorem 2.1.2. Same as in the proof of Theorem 2.1.1, below Ci are positive quantities depending on k; d only. Let h() and C1 be the same as in the proof of Theorem 2.1.1, and let C2 be such that
mesfx 2 [0; 1]d j jD() hj > C2 g > C2 8; jj < k: Let us x L; ; d; p; k; q satisfying the premise of Theorem 2.1.2, the volume n of observations (2.1) and , jj l < k. Consider a linear estimate of D() f , f 2 S Sdk;p(L), based on observations (2.1), let it be X f^n (x; y ) = (x)y ;
and let " be the worst-case, with respect to S , q -risk of this estimate:
"2 = sup E f 2S
We have
"2
and Now assume that
n
o
k D() f f^n k2q :
E k P () k2q 2 E k P () k22 [since q 2 by the premise of Theorem 2.1.2] P 2 = k () k22 "2 k D() f ()
X
()f (x ) k2q
" < (0:5C2 )1+ L;
8f 2 S :
(2.57)
(2.58) (2.59)
!
1
(0:5C2 )1+ L k l d d ; (2.60) " note that 1 by (2.59). Setting = 1= and taking into account (2.57), we observe that there exists a cube B [0; 1]d such that the number n(B ) of observation points in (2.1) in B does not exceed 2nd , while
2
Z X
B
+
2 (x)dx 2d "2 :
(2.61)
Now let h () be the translation of the function Lk d h(x=) such that the support of h is B . Setting X g (x) = (x)h (x);
and applying the Cauchy inequality, we get 1=2 P 2 1 = 2 h 1 n (B ) (x) 1=2 P 2 k d + d= 2 1 = 2 C3 L n (x)
jg(x)j k
k
[since by construction n(B ) 2nd ] The resulting inequality combined with (2.61) implies that there exist B C4 such that (a) mes B (1 0:5C2 )d ; (b) x 2 B ) jg(x)j C4 L 1 k d+d=2 n1=2 ": Now note that by construction is less than the quantity (2.60), so that 0:5C2 Lk d l (0:5C2 d ) > ": We claim that 0:5C2 Lk d l C4 L 1 k d+d=2 n1=2 ":
(2.62)
B and (2.63) (2.64) (2.65)
Indeed, assuming that the opposite inequality holds: 0:5C2 Lk
d l > C L 1 k d+d=2 n1=2 " 4
and combining this inequality with (2.63), we would get x 2 B ) jg (x)j < 0:5C2 Lk d l [B B; mes B (1 0:5C2 ) mes B ]: Recalling the origin of C2 , we would further conclude that there exists B B such that mes B o0:5C2nd ; n o x 2 B ) jg (x)j 0:5C2 Lk d l & jD () h (x)j C2 k d l L : Combining these observations and (2.58), we would get " k D () h
g kq 0:5C2 Lk d l (0:5C2 d ) ;
which is impossible in view of (2.64).
"
C5 n 1=2 l d=2 d l G1 n 1=2 L" k l d 2
2
2(
+2
d)
+
[see the origin of ] with G1 > 0 depending on k; p; d; q only. From the resulting inequality it follows that
" > G2 L
p
L n
!2l (p;k;d;q)
:
(2.66)
with G2 of the same type as G1 . We have established the implication (2.59) ) (2.66); in view of this implication, (2.16) is valid for all large enough values of n, as stated in Theorem 2.1.2.
Chapter 3 Spatial adaptive estimation on Sobolev balls 3.1 Spatial adaptive estimation: the goal We have seen what are the minimax risks of recovering functions f from Sobolev balls Sdk;p(L) via their n = md noisy observations n
; m] d y yf"( ) = y = f (x ) + j = (i1 ; :::; id ) 2 [1 # x(i ;:::;id) = (i1 =m; i2 =m; :::; id =m)T ; = f g : are independent N (0; 1) 1
o
(3.1)
and have developed the associated, optimal in order up to logarithmic in n factors, estimates. These estimates, however, suer two serious drawbacks:
The estimates are not adaptive to the parameters of smoothness p; k; L of the regression function f to be recovered. An estimate depends on a particular a priori choice of these parameters and guarantees certain quality of recovering only in the case when f belongs to the corresponding Sobolev ball. In reality we hardly can know in advance the precise values of the parameters of smoothness of f and should therefore use certain guesses for them. If our guesses \underestimate" the smoothness of f , then the associated estimate does ensure the risk bounds corresponding to the guessed smoothness; these bounds, however, may be much worse than if we were capable to t the estimate to the actual smoothness of f . And if our guesses for the parameters of smoothness of f \overestimate" the actual smoothness, we simply cannot guarantee anything.
The estimates are not spatial adaptive: assume, e.g., that we know that the function f : [0; 1] ! R to be recovered is continuously dierentiable with, say, k f 0 k2 = L, and that we know the value of L, so that there seemingly is no diÆculty with tuning the recovering routine to the actual smoothness of f . Note, however, that k f 0 k2 may come from a \local singularity" of f { a relatively small part of our \universe" [0; 1] where f varies rapidly, and there still may be large segments B 0 [0; 1] where f is much more smooth than it is said by the inclusion f 2 S11;2 (L). If we knew these \segments of high smoothness of f ", along with the corresponding smoothness parameters, in advance, we 49
the entire [0; 1]. However, the recovering routines we know to the moment are \too stupid" to adapt themselves to favourable local behaviour of the regression function in question. For estimates aimed at recovering smooth regression functions, the \adaptive abilities" of an estimate can be quanti ed as follows. For a cube B = fx j jxi ci j h i = 1; :::; dg contained in [0; 1]d , let Sdk;p (B ; L) be the set of functions f : [0; 1]d ! R satisfying the following assumptions:
f is continuous on [0; 1]d; f is k times dierentiable on B , and k Dk f kp;B L: Here k kp;B is the standard Lp -norm on B . In this de nition, similar to the de nition of a Sobolev ball in Chapter 2, k is a positive integer { order of smoothness; d is a positive integer { dimensionality; p 2 (d; 1]; L > 0. From now on, we x the dimension d of the regression functions in question. In the sequel, we use for Sdk;p (B ; L) also the shortened notation S [ ], where stands for the collection of \parameters" (k; p; B; L), and call the set S [w] a local Sobolev ball. Let us once for ever x a \margin" { a real 2 (0; 1) { and let B , B being a cube, be the times smaller concentric cube:
B = fx j jxi
ci j h; i = 1; :::; dg [0; 1]d
+ B = fx j jxi ci j h; i = 1; :::; dg B
Given an estimate fbn based on observations (3.1), (i.e., a Borel real-valued function of x 2 [0; 1]d and y 2 Rn ), let us characterize its quality on a set S [ ] by the worst-case risks n o Rcq fbn ; S [ ] = sup E k fbn(; yf ( )) f () k2q;B 1=2 ; and let
f 2S [ ]
Rcq (n; S [
]) = inf sup
fbn f 2S [ ]
n
E k fbn(; yf ( )) f () k2q;B
be the corresponding minimax risks 1) . For a particular estimate fbn , the ratio
Rcq fbn; S [ Rcq (n; S [
o1=2
;
(3.2)
] () ]) 1) Note that we prefer to measure the estimation errors in the integral norms associated with a little bit smaller than B cube B ; this allows to avoid in the sequel boring analysis of \boundary eects".
on the set S [ ]. It is natural to measure adaptive abilities of an estimate fbn by looking at \how wide" is the spectrum of local Sobolev balls for which the ratio (*) is \moderately large". The formal de nition is as follows.
De nition 3.1.1 Let 1. S = fSn gn1 be a \nested family" of local Sobolev balls on Rd , i.e.,
Sn = fS [ ] j and
2 n g
Sn+1 Sn
for every n; 2. ffbn gn1 be an estimation method { a collection of estimates indexed by volumes n of observations (3.1) used by the estimates; 3. (n) be a real-valued function. We say that the S-nonoptimality index of the estimation method ffbn g1 n=1 is (), if, for every q 2 [1; 1] and all large enough values of n, one has
Rcq fbn; S [ sup c (n; S [ =(k;p;B;L)2 n R q
] ])
O((n)):
An \ideal" adaptive routine for recovering smooth regression functions would have a constant nonoptimality index with respect to the widest possible nested family of local Sobolev balls { the one for which Sn , for every n, contains all local Sobolev balls. As we shall see in the mean time, such an ideal routine simply does not exist. Recently, several adaptive routines of nearly the same \adaptive power" were proposed (the wavelet-based estimators of Donoho et al. [5, 7], and Juditsky [15], adaptive kernel estimates of Lepskii, Mammen and Spokoiny [20])2) . What we are about to do is to build an extremely simple recovering routine with \nearly ideal" adaptive abilities { one for which the nonoptimality index with respect to certain \rapidly extending" nested family fSn g grows with n \very slowly" { logarithmically. We shall also see that \logarithmic growth" of the nonoptimality index is an unavoidable price for ability of a routine to adapt itself to rapidly extending nested families of local Sobolev balls.
3.2 The estimate The recovering routine we are about to build is aimed at estimating functions with order of smoothness not exceeding a given upper bound + 1; (which should, of course, be a nonnegative integer) is the only parameter our construction depends upon. 2) In the cited papers, the smoothness of the signal is speci ed as membership in the Besov or Triebel spaces { extensions of the Sobolev spaces we deal with.
same as in Chapter 1, use point-wise window estimator of f . Namely, to estimate f at a given point x 2 int [0; 1]d , we choose somehow an admissible window { a cube
Bh (x) = fu j jui
xi j h=2; i = 1; :::; dg [0; 1]d
centered at x and containing at least ( + 3)d observation points:
+3 : (3.3) m Note that since the window should be centered at x and be contained in [0; 1]d , the point x should be not too close to the boundary of [0; 1]d : h
+3 +3 xi 1 ; i = 1; :::; d; (3.4) 2m 2m which we assume from now on. The estimate fbn (x; y ) will be just the order window estimate (Chapter 1, Section 1.3 3) ) with the window width depending on x and chosen on the basis of observations. Thus, the dierence of the estimate we are about to build with the estimator from Chapter 1 is that now we choose its own window width for every point rather than to serve all points with the same window width. The central issue is, of course, how to choose the window width for a given x, and the underlying idea (which goes back to Lepskii [19]) is as follows. Let, as in Chapter 1, (f; Bh (x)) = pmin max jf (u) p(u)j; 2P u2Bh (x)
P being the space of polynomials on Rd of total degree .
Applying Proposition 1.3.1, we come to the following upper bound on the error of estimating f (x) by the estimate fbnh(x; ) { the window estimate associated with the centered at x window of width h: " # h (3.5) errh (f; x) jf (x) fbn (x; yf ( ))j C1 (f; Bh (x)) + p d n ; nh
n = n ( ) being a deterministic function of the observation noises; from now on, C (perhaps with sub- or superscripts) are positive quantities depending on d; ; only. As we remember from (1.43), one has
8w 1 :
p
n
o
Prob n > O;d (1)w ln n
exp
(
)
w2 ln n ; 2
(3.6)
Note that (3.5) implies that "
#
(3.7) errh (f; x) C1 (f; Bh (x)) + p d n : nh 3) In this chapter we assume that the window estimate associated with a window B does not use the observations at boundary points of the cube B ; this is why we write + 3 instead of + 2 in (3.4).
values" of this variable do notpexceed O( ln n). Let us x a \safety factor" ! in such a way that the event n > ! ln n is \highly un-probable", namely, n
p
o
Prob n > ! ln n
n
4(+1) ;
by (3.6), the required ! may be chosen as a function of ; d only. Let us set p n = f j n ! ln ng: Note that (3.7) implies the \conditional" error bound
2 n ) errh (f; x) C1 [p (f; Bh (x)) + Sn (h)] ; Sn (h) = pnhd ! ln n:
(3.8) (3.9)
(3.10)
The two terms in the right hand side of the resulting error bound { the deterministic term (f; Bh (x)) and the stochastic term Sn (h) possess opposite monotonicity properties with respect to h: as h grows (i.e., as the window extends), the deterministic term does not decrease, while the stochastic term does not increase. It follows that if we were clever enough to nd the \ideal window" { the one for which the deterministic term is equal to the stochastic one { we would get the best possible, up to factor 2, error bound (3.10). Of course, we never can be clever enough to specify the \ideal window", since we do not know the deterministic term. It turns out, however, that we can act nearly as if we knew everything. Let us de ne the \ideal window" B (x) as the largest admissible window for which the stochastic term dominates the deterministic one: B (x) = Bh (x) (x); (3.11) h (x) = maxfh j h m+3 ; Bh (x) [0; 1]d ; (f; Bh (x)) Sn (h)g: Note that such a window not necessarily exists: it may happen that f varies in a neighbourhood of x too rapidly, so that already for the smallest possible admissible window the deterministic term majorates the stochastic one. In this case we de ne B (x) as the smallest possible window which is admissible for x. Thus, the ideal window B (x) is well-de ned for every x possessing admissible windows; we call it good if it is given by (3.11) and bad in the opposite case. It is immediately seen that whenever 2 n , the error bound (3.10) associated with the ideal window is, up to factor 2, better than the bound associated with any other (admissible) window, which motivates the term \ideal window". To explain the idea of the estimate of f (x) we are about to build, assume that the ideal window for x is a good one, and let 2 n . Then the errors of all estimates fbnh (x; y ) associated with admissible windows smaller than the ideal one are dominated by the corresponding stochastic terms: +3 2 n ; h 2 ; h (x) ) errh (f; x) 2C1 Sn (h); (3.12) m indeed, for the (good) ideal window B (x) the deterministic term is equal to the stochastic one, so that for smaller windows the deterministic term is not greater than the stochastic one.
associated estimate fbnh (x; y ) diers from every estimate associated with a smaller admissible window by no more than 4C1 times the stochastic term of the latter estimate: Window Bh (x) is normal (
m
Bh (x) his admissible i 8h0 2 m+3 ; h : jfbnh0 (x; y) fbnh(x; y)j 4C1Sn(h0 ) [y = yf ( )]
(3.13)
Note that if x possesses an admissible window, then it possesses a normal one as well (e.g., the smallest admissible for x window clearly is normal). Note also that (3.12) says that (!) If 2 n (i.e., if n is not \pathologically large"), then the ideal window B (x) is normal. Indeed, for a good ideal window the claim follows from (3.12), while a bad ideal window is just the smallest window admissible for x and is therefore normal. Now observe that the property of an admissible window to be normal is \observable" { given observations y , we can say whether a given window is or is not normal. Besides this, it is clear that among all normal windows there exists the largest one B + (x) = Bh (x) (x) (to ensure the latter property, we have rede ned window estimates as ones using observations from the interior of the underlying windows rather than from entire windows). From (!) it follows that +
(!!) If 2 n (i.e., if n is not \pathologically large"), then the largest normal window B + (x) contains the ideal window B (x). By de nition of a normal window, under the premise of (!!) we have
jfbnh
+
(x) (x; y )
fbnh(x) (x; y )j 4C1 Sn (h (x));
and we come to the conclusion as follows: (*) If
2 n (i.e., if n is not \pathologically large"), then the error
of the estimate
fbn (x; y ) fbnh
+
(x) (x; y )
is dominated by the error bound (3.10) associated with the ideal window:
jfn(x; y) f (x)j b
2 n )h i 5C1 (f; Bh(x)(x)) + Sn(h(x)) :
(3.14)
Thus, the estimate fbn (; ) { which is based solely on observations and does not require any a priori knowledge of smoothness of f { possesses basically the same accuracy as the \ideal" estimate associated with the ideal window (provided, of course, that the realization of noises is not pathological: 2 n ). Note that the estimate fbn (x; y ) we have built { let us call it the adaptive estimate { depends on a single \design parameter" (and, of course, on , the volume of observations n and the dimensionality d).
Our main result is as follows:
Theorem 3.3.1 Let 2 (0; 1), 0 be an integer, let S = Sdk;p(B ; L) be a local Sobolev ball with order of smoothness k not exceeding + 1 and with p > d. For properly chosen P 1 depending solely on ; d; p; and nonincreasing in p > d the following statement takes place: If the volume n = md of observations (3.1) is large enough, namely,
P 1n
k 2d+d 2d
2
h
qbLn P D
bn =
ln n ; n
k 2d+d
2
=
2
1 p
i
(B )
(3.15)
(D(B ) is the edge of the cube B ), then for every q 2 [1; 1] the worst case, with respect to S , q -risk of the adaptive estimate fbn (; ) associated with the parameter can be bounded as follows (cf. (2.2)):
n
Rcq fbn; S supf 2S E k fbn(; yf ( )) f () k2q;B P L bLn 2 (p;k;d;q) Dd(p;k;d;q)(B ); ( k ; 2kd+d k + d 2 (p; k; d; q ) = d ; k+d d 2k 2d+d ; 2k+d =
(p; k; d; q ) =
1;
q (
0;
d 2k+d ;
o1=2
(3.16)
2kd+d ; 2kd+d
here B is the concentric to B times smaller in linear sizes cube.
Proof. 10 . In the main body of the proof, we focus on the case p; q < 1; the case of in nite p and/or q will be considered at the concluding step 40 . Let us x a local Sobolev ball Sdk;p (B ; L) with the parameters satisfying the premise of Theorem 3.3.1 and a function f from this class. Recall that by (2.11)
8 (x 2 int B ) 8 (h; Bh(x) B ) : (f; Bh (x)) P1 hk
d (f; B
h (x));
(f; B 0 ) =
0 Z @
B0
11=p
jDk f (u)jpduA
;
(3.17)
from now on, P (perhaps with sub- or superscripts) are quantities 1 depending on ; d; ; p only and nonincreasing in p > d, and j j stands both for the absolute value of a real and for the Euclidean norm of a vector from Rk . 20 . Our central auxiliary result is as follows:
Lemma 3.3.1 Assume that (a) n k d p (b) n d ln n
d (1 2( )+3) D(B ) ; P1( + 3)k d+d=2 !L :
(3.18)
(a) (b) P1 hk Then
d (f; B
h ( x )) h
(1 )D(B ); Sn(h):
(3.19)
+3 ; (3.20) m and the error at x of the adaptive estimate fbn as applied to f can be bounded as follows: (a) in the case of 2 n : jfbn(x; y) f (x)j C2Sn(h(x)); (3.21) (b) in the case of 62 n : jfbn(x; y) f (x)j P2Dk d (B )L + C2 n: h(x)
a0 . Let h = m+3 . From (3.18) it follows that h satis es (3.19.a), so that Bh (x) B . Moreover, (3.18.b) implies that Proof of Lemma.
P1 hk d L Sn (h
);
the latter inequality, in view of (f; Bh (x)) L, says that h satis es (3.19.b) as well. Thus, h(x) h , as claimed in (3.20). b0 . Consider the window Bh(x) (x). By (3.19.a) it is admissible for x, while from (3.19.b) combined with (3.17) we get (f; Bh(x) (x)) Sn (h): It follows that the ideal window B (x) of x is not smaller than Bh(x) (x) and is good. c0 . Assume that 2 n . Then, according to (3.14), we have
jfbn(x; y)
f (x)j 5C1
h
i
(f; Bh(x) (x)) + Sn (h (x)) :
(3.22)
Now, by the de nition of a good ideal window, either case (a): (f; Bh(x) (x)) = Sn (h (x)), or case (b): (f; Bh (x) (x)) Sn(h (x)) and B (x) is the largest cube centered at x and contained in [0; 1]d . If both cases, the right hand side in (3.22) does not exceed 10C1 Sn(h (x)) 10C1 Sn (h(x)) (recall that, as we have seen, h (x) h(x)), as required in (3.21.a). d0 . Now let 62 n . Note that fbn(x; y) is certain estimate fbh (x; y) associated with a centered at x and admissible for x cube Bh (x). There are two possible cases: case (c): Bh (x) B ; case (d): Bh (x) 6 B .
jfbn(x; y)
f (x)j
h
C1 (f; Bh (x)) + p d n nh P 0 D k d (B )L + C 0 n ;
i
(3.23)
the concluding inequality being given by (3.17) as applied to the cube Bh(x) B combined with the fact that this cube is admissible for x and therefore nhd 1. If (d) is the case, then the window Bh (x) contains the cube Bh(x) (x). For the estimate associated with the latter window we have (by the same reasons as in (3.5)) jfbnh(x)(x; y) f (x)j P 0Dk d (B )L + C 0n; and since the estimate fbn (x; y) is associated with a normal cube containing Bh(x) (x), we have
jfbnh(x)(x; y)
fbn (x; y )j 4C1 Sn (h(x)) C 00 n ;
the concluding inequality being given by the de nition of Sn () and the fact p that ! ln n n due to 62 n . Combining our observations, we see that in both cases (c), (d) we have
jfbn(x; y)
f (x)j P2 D k d (B )L + C2 n ;
as required in (3.21.b).
30 . Now we are ready to complete the proof. Assume that (3.18) takes place, and let us x q , 2kd+d p q < 1. 30 .a) Note that for every x 2 B { either h(x) = (1 )D(B ); { or P1 hk d (x) (f; Bh(x) (x)) = Sn (h(x)) m (3.24) k d d b n : h(x) = P (f;Bh x (x)) 2 +
1
2
2
( )
Let U; V be the sets of those x 2 B for which the rst, respectively, the second of this possibilities takes place. If V is nonempty, let us partition it as follows. 1) Since h(x) is bounded away from zero in B by (3.20), we can choose x1 2 V such that 1 h(x) h(x1 ) 8x 2 V: 2 After x1 is chosen, we set
V1 = fx 2 V j Bh(x) (x) \ Bh(x ) (x1 ) 6= ;g: 1
2) If the set V nV1 is nonempty, we apply the construction from 1) to this set, thus getting x2 2 V nV1 such that 1 h(x) h(x2 ) 8x 2 V nV1 ; 2
V2 = fx 2 V nV1 j Bh(x) (x) \ Bh(x ) (x2 ) 6= ;g: If the set V n(V1 [ V2 ) still is nonempty, we apply the same construction to this set, thus getting x3 and V3 , and so on. 2
The outlined process clearly terminates after certain step; indeed, by construction the cubes Bh(x ) (x1 ); Bh(x ) (x2 ); ::: are mutually disjoint and are contained in B , while the sizes of these cubes are bounded away from 0. On termination, we get a collection of M points x1 ; :::; xM 2 V and a partition 1
2
V = V1 [ V2 [ ::: [ VM with the following properties: (i) The cubes Bh(x ) (x1 ); :::; Bh(xM ) (xM ) are mutually disjoint; (ii) For every ` M and every x 2 V` we have 1
1 h(x) h(x` ) and Bh(x) (x) \ Bh(x` ) (x` ) 6= ;: 2 We claim that also (iii) For every ` M and every x 2 V` : 1 h(x) max [h(x` ); k x x` k1] : 2
(3.25)
Indeed, h(x) 21 h(x` ) by (ii), so that it suÆces to verify (3.25) in the case when k x x` k1 h(x`). Since Bh(x) (x) intersects Bh(x`)(x` ), we have k x x` k1 1 (h(x) + h(x )), whence ` 2
h(x) 2 k x x` k1 h(x` ) k x x` k1 ; which is even more than we need.
k fbZn(; y) f () kqq;B
C2q Snq (h(x))dx B
=
C2q
=
C2q
Z
U
Z
U
[by (3.21.a)]
M Z qX q Sn (h(x))dx + C2 Snq (h(x))dx `=1 V` " #q M Z bn qX dx + C2 Snq (h(x))dx ((1 )D(B ))d=2 `=1 V`
[since h(x) = (1 )D(B ) for x 2 U ]
M Z q b q d dq=2 q bq X C3 n D (B ) + C3 n (max [h(x` ); x x` 1 ]) dq=2 dx `=1 V` M Z1 q bq X q b q d dq=2 (B ) + C4 n rd 1 (max [h(x` ); r]) dq=2 dr C3 n D `=1 0 M X C3q bnq Dd dq=2 (B ) + C5q bnq [h(x` )]d dq=2 `=1 [note that dq=2 d + 1 2k2+d p d + 1 d2 =2 + 1 in view of q 2kd+d p, k 1 and p > d] # 2d dq M " 2k 2d +d bn q b q d dq=2 q bq X [by (3.24)] C3 n D (B ) + C5 n
( f; B ( x )) P 1 h ( x ) ` ` `=1 M h i dq 2d q b q d dq=2 q b 2 (p;k;d;q)q X C3 n D (B ) + C5 n P1 (f; Bh(x` ) (x` )) 2k 2d+d `=1
= =
k
k
(3.26)
[see the de nition of (p; k; d; q )]
Now note that
dq 2d 2k 2d+d
p in view of q 2kd+d p, so that M h P
i dq
P1 (f; Bh(x` ) (x` ))
`=1 " M P
2
2d k 2d+d
# dq 2d p p(2k 2d+d)
`=1 P1 (f; Bh(x`)(x` )) dq d [P1pLp] p k d d (2
2
2
+ )
(see (3.17) and take into account that the cubes Bh(x` ) (x` ), ` = 1; :::; M , are mutually disjoint by (i)). Thus, (3.26) results in
2 n ) k fbn(; yf ( )) f () kq;B C6bn Dd d=2(B ) + P2bn2 (p;k;d;q)L 2 (p;k;d;q) = C6 bn Dd d=2 (B ) + P2 L bLn
2
d 2d k 2d+d
(3.27)
30 .c) Now assume that 62 n . In this case, by (3.21),
jfbn(x; y) f (x)j P2Dk whence
h
d (B )L + C : 2 n
k fbn(; y) f () kq;B P2Dk
8x 2 B ;
d (B )L + C 2 n
i
Dd=q (B ):
(3.28)
n
E k fbn(; y) f () k2q;B
o1=2
b n 2 (p;k;d;q)
C7 max bnD (B ); P4 L L ; J (f ) ; n h io1=2 J (f ) = E 62n P2D2k 2d (B )L2 + C2 22n P2Dk d (B )LProb1=2 f 62 ng + C2 Prob1=2 f 62 ng (E f4ng)1=2 1=2 P2Dk d (B )LProb1=2 f 62 ng + C2pProb1=4 f 62 ng (E f4ng)1=4 P2Dk d (B )Ln 2(+1) + C2n (+1) ln n d 2d 2
[we have used (3.6) and (3.8)]
Thus, under assumptions (3.18) for all d < p < 1 and all q ,
n
E k fbn(; y) f () k2q;B
C7 max bn D P5 Dk
d 2d 2
o1=2
(B ); P4 L
b n 2 (p;k;d;q)
L (+1)
d (B )Ln 2(+1) ; C n 8
2k+d p d
p
(3.29)
q < 1 we have
;
(3.30)
ln n :
Now, it is easily seen that if P 1 is a properly chosen function of ; d; ; p nonincreasing in p > d and (3.15) takes place, then, rst, the assumption (3.18) is satis ed and, second, the right hand side in (3.30) does not exceed the quantity bn 2 (p;k;d;q) bn 2 (p;k;d;q) d(p;k;d;q) PL = PL D (B ) L L (see (3.16) and take into account that we are in the situation q 2kd+d p, so that (p; k; d; q ) = 0). We have obtained the bound (3.16) for the case of d < p < 1, 1 > q 2kd+d p; passing to limit as q ! 1, we get the desired bound for q = 1 as well. 40 . Now let d < p < 1 and 1 q q 2kd+d p: By Holder inequality,
k g kq;B k g kq;B jB j q
1
whence
Rcq fbn; S Rcq fbn; S Dd(1=q
1
q ;
1=q ) (B );
combining this observation with the (already proved) bound (3.16) associated with q = q , we see that (3.16) is valid for all q 2 [1; 1], provided that d < p < 1. Passing in the resulting bound to limit as p ! 1, we conclude that (3.16) is valid for all p 2 (d; 1], q 2 [1; 1].
3.4 Optimality index of the adaptive estimate Let us rst point out lower bounds for the minimax risks of estimating functions from local Sobolev balls. These bounds can be immediately derived from Theorem 2.1.1: by \similarity arguments", to recover functions from Sdk;p (B ; L) via n observations (3.1) is clearly the same as to recover functions from Sdk;p([0; 1]d ; L0 ) via nDd (B ) similar observations, where L0 is readily given by the parameters of the local Sobolev ball (in fact, L0 = Dd+k (B )L). The results are as follows:
B [0; 1]d be given. Assume that the volume of observations n is large enough, namely, p k d d L n D ( B ) i h (3.31) = p1 Then the minimax q -risk (3.2) of estimating functions f from the local Sobolev ball S = Sdk;p(B ; L) via observations (3.1) can be bounded from below as 2
2 2
p L n
Rc (n; S ) O
k;d; (1)L
q
+
!2 (p;k;d;q)
where (); () are given by (3.16). If the volume of observations n is so large that
Dd(p;k;d;q) (B );
p
L n (B ) : for some positive ", then in the case of \large" ratios q=p, namely, bound can be strengthened to n" D
k 2d+d
2
(3.32)
(3.33)
2
p
q p
2kd+d , the lower
!2 (p;k;d;q)
Rcq (n; S ) Ok;d; ;"(1)L Lplnnn Dd(p;k;d;q)(B ): (3.34) Comparing the statements of Theorems 3.3.1 and 3.4.1, we come to the following Theorem 3.4.2 Let us x the dimensionality d of the regression problem, a real p > d, a nonnegative integer , and let us associate with these data the nested family of local Sobolev balls S Sp;d; = fSn g1 n=1 de ned as follows: 8 > > > <
(a) k;p0 ( Sn = >Sd (B ; L) b) > > : (c) P 1 n
2
d+d d
2 2
p; + 1; PD
p0 1k
bLn
2( +1)
9 > > > =
2
d+d
2
(B )
> > > ;
(3.35)
where P is given by Theorem 3.3.1, = p1 and
p
ln n bn = p : n The S-nonoptimality index of the adaptive estimation method ffbn g1 n=1 from Section 3.2 is not worse than the logarithmic in n function
(n) = (ln n)
+1 2( +1)+
d:
(3.36)
We see that the nonoptimality index of our adaptive estimate on certain nested families of local Sobolev balls is \not too large" { it grows with n logarithmically. We are about to demonstrate that this logarithmic growth is, in a sense, unavoidable price for \reasonable adaptive abilities". For the sake of de niteness, in the below statement the parameter from (3.2) is assumed to be 0.5.
Consider the nested family S of local Sobolev balls given by n
Sn = Sd+1;p (B ; L) P 1 n"
d+d
2( +1) 2 2
d
where, as always,
bLn P D
2( +1)
2
d+d
2
(B )
o
(3.37)
p
ln n bn = p n (note that for small enough " this nested family is contained in the one of Theorem 3.4.2). There exist positive constants C; N such that for every estimation method ffbn g1 n=1 one has nN )
Rcp fbn; S sup c C (ln n) S2Sn R p (n; S )
( +1) 2( +1)+
d:
(3.38)
Thus, the S-nonoptimality index of every estimation method with respect to the nested family S is at least O (ln n) d . In particular, the adaptive estimation method from Section 3.2 possesses the best possible S-nonoptimality index. +1 2( +1)+
Proof of Proposition is similar to the one used by Lepskii [18] to demonstrate that
it is impossible to get optimal in order adaptive to smoothness estimator of the value of a smooth regression function at a given point. Let us x 2 (0; ") and an estimation method ffbn g, and let
k = + 1: 10 . Given n, consider the Sobolev ball S n from Sn with the largest possible B , namely, B = [0; 1]d , and the smallest possible, for our B and n, value of L { namely,
L = L(n) = bn P:
(3.39)
Let
c (fb ; S n ) r(n) = R p n be the p-risk of the estimate fbn on this ball, and let c (n; S n ) (n) = R p
be the corresponding minimax risk. From the results of Section 2.1 (see (2.15)) we know that k (n) Op;;d(1)L(n) Ls(nn) k d p d (3.40) = Op;;d(1)sn ln n k d ; sn = pn : Now let us set h(n) = n =d ; (3.41) 2 2 +
2 +
Sobolev balls Sdk;p(Lb (n); B ) with
k 2d+d
Lb (n) = bn P (2h(n)) ; D(B ) = 2h(n) [B [0; 1]d ]: 2
(3.42)
2
c (fb ; ) over all these balls. Let also b(n) be Let rb(n) be the upper bound of the risks R p n c the upper bound of the minimax risks Rp (n; ) over the same family of local Sobolev balls. From (2.15) we know that for large enough values of n one has
b(n)
2 +
Finally, let
k
2 2 +
Op;;d(1)Lb (n) Lbs(nn) k d (2h(n)) kkdd p d Op;;d(1)sn ln n k d hd d=2 (n): p 2 d ln n d=2 p h (n): Æ = Æ (n) = 5 n 2 2 +
(3.43)
(3.44)
20 . We claim that for all large enough values of n one has " # r(n)n=4 rb(n) 1 max ; d : (3.45) Æ Æh (n) 4 Postponing for a moment the justi cation of our claim, let us derive from (3.45) the assertion of Proposition. Indeed, by (3.45), { either r(n) 41 n =4 Æ p Op;;d;(1)snpln nn =4 h d=2 (n) [see (3.41)] Op;;d;(1)sn ln nnk =4 n=2 = 4 k d Op;;d;(1) (ln n) n (n) [see (3.40)], { or rb(n) 41 Æhd (n) p d d=2 (n) [see (3.44)] Op;;d;(1)sn ln nh k [see (3.43)] Op;;d;(1)(ln n) k d b(n) In both cases, the worst-case, over the local Sobolev balls from Sn , ratio of the risks k of fbn and the minimax risks associated with the balls is at least Op;;d;(1)(ln n) k d , as stated by Proposition (recall that k = + 1). 30 . To establish (3.45), let us look what happens when fbn is used to recover a particular function from S n { namely, the function f 0. The result will be some random function f~n depending deterministically on the observation noises ; by de nition of r(n), we have n o E k f~n kp r(n): (3.46) 2 +
2 +
2 +
Lemma 3.4.1 For all large enough values of n there exists a cube B [0; 1]d with edge length h(n) such that the twice larger concentric cube B + (B ) is contained in [0; 1]d and n o Prob k f~n kp;B > 2n=4 r(n)hd (n) 2n =8 ; (3.47)
the probability being taken w.r.t. the distribution of observations (3.1) associated with f 0.
of the event k f~n kp ur(n). From (3.46) it follows that Probf = 0g u 1 :
(3.48)
On the other hand, assuming that n is so large that h(n) < 0:1, we have 8 > < Z
) ) 9B : D(B ) = ) )
n
E >:
k (f~n) k9pp uprp(n) > =
j(f~n)(x)jpdx>
[0;1]d h(n); B + (B ) 8
E:
B
;
uprp(n)
[0; 1]d and9 = j(f~n)(x)jpdx; 2uprp(n)hd (n) o
Prob k nf~n kpp;B > 2v puprp(n)hd (n)o Prob k f~n kp;B > 2uvr(n)hd (n)
1 vp
Probf k f~n kpp;B > 2v puprp (n)hd (n)g + Probf = 0g 1 1 [see (3.48)] vp + u
It remains to note that uv = t = n=4 and u 1 + v p = 2t p=(p+1) 2n =8 . Let B be given by Lemma 3.4.1 and g be a continuous function taking values between 0 and Æ (n) and such that g is equal to Æ (n) on B + (B ) and vanishes outside twice larger than B + (B ) concentric to B + (B ) cube. Consider the following two hypotheses on the distribution of observations (3.1): H0 says that the observations come from f 0, while H1 says that they come from f = g . Let us associate with fbn the following procedure for distinguishing between H0 and H1 via observations (3.1): Given y , we build the function fbn (; y ) and restrict it on the cube B . If the p-norm of this restriction is 0:5Æhd (n), we accept H0 , otherwise we accept H1 .
We claim that if (3.45) is not valid, then our procedure possesses the following properties: (a) probability p1j1 to accept H1 in the case when H1 is true is at least 1/2; (b) probability p1j0 to accept H1 in the case when H0 is true is at most 2n =8 . Indeed, we have g 2 Sd+1;p (Lb (n); B + (B )). Now, whenever H1 is true and is rejected by our procedure, the k kp;B -error of estimate fbn , the true regression function being g, is at least 0:5Æhd (n); since the expectation of this error is at most rb(n) by origin of the latter quantity, 1 p1j1 is at most 2rb(n)(Æhd (n)) 1 ; if (3.45) is not valid, the latter quantity is 1=2, so that p1j1 1=2, as claimed in (a). Now, whenever H0 is true and is rejected by our procedure, we have k f~n kp;B 0:5Æhd (n). When (3.45) is not valid, we have 0:5Æhd (n) > 2r(n)n=4 hd (n), so that here k f~ kp;B 2r(n)n=4 hd (n), and the H0 -probability of the latter event, by (3.47), does not exceed 2n =8 , as claimed in (b).
associated with the hypotheses Hi , i = 0; 1, by construction does not exceed ln n K = (4h(n))d 2Æ2 n = 25 : As we remember from the proof of the Fano inequality (see Remark 1.4.1), the Kullback distance may only decrease when we pass from the original pair of distributions to their \deterministic transforms" { to the distributions of the results of our routine for hypotheses testing. Thus, denoting by pijj the probability to accept Hi when the true hypothesis is Hj , i; j = 0; 1, we get ln n 25
K p1j1 ln pp jj + p0j1 ln pp jj =
h
11
01
10
i0 0
p1j1 ln p1j1 + p0j1 ln p0j1 p1j1 ln p1j0 p0j1 ln p0j0 ln 2 p1j1 ln p =1j0 1 n ln 2 + 2 ln 2 [we have used (a) and (b)] 8
The resulting inequality cannot be valid for large values of n, so that for these values of n (3.45) does take place. We conclude this chapter with demonstrating a reasonably good numerical behaviour of the adaptive estimate we have built (for details of implementation, see [9]). Our numerical results deal with univariate functions and two-dimensional images. As the test univariate signals, we used the functions Blocks, Bumps, HeaviSine and Doppler given in [6, 5]. The level of noise in experiments is characterized by the signal-to-noise ratio
0P 2 @
f (x ) 11=2 A ; n 2
the less it is, the more diÆcult is to recover the regression function.
Figure 3.1: \Blocks", n = 2048. Blocks 20
15
10
5
0
-5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True signal Noisy Blocks
Combined Estimate
20
20
15 15
10 10 5 5 0
0 -5
-10 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Recovered signal Signal-to-noise ratio = 7
Noisy Blocks
Combined Estimate
12
7
10
6 5
8
4 6 3 4 2 2 1 0
0
-2
-4 0
-1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-2 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recovered signal Signal-to-noise ratio = 3
Figure 3.2: \Bumps", n = 2048. Bumps 30
25
20
15
10
5
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True signal Noisy Bumps
Combined Estimate
35
30
30
25
25 20 20 15 15 10 10 5 5 0
0
-5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Recovered signal Signal-to-noise ratio = 7
Noisy Bumps
Combined Estimate
16
12
14 10 12 10 8 8 6
6
4 4 2 0 2 -2 -4 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recovered signal Signal-to-noise ratio = 3
Figure 3.3: \HeavySine", n = 2048. HeaviSine 10
5
0
-5
-10
-15 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True signal Noisy HeaviSine
Combined Estimate
15
10
10 5 5 0
0
-5
-5
-10 -10 -15
-20 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-15 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Recovered signal Signal-to-noise ratio = 7
Noisy HeaviSine
Combined Estimate
8
4
6
3 2
4
1 2 0 0 -1 -2 -2 -4 -3 -6
-4
-8 -10 0
-5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recovered signal Signal-to-noise ratio = 3
Figure 3.4: \Doppler", n = 2048. Doppler 15
10
5
0
-5
-10
-15 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True signal Noisy Doppler
Combined Estimate
15
15
10
10
5
5
0
0
-5
-5
-10
-10
-15 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-15 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Recovered signal Signal-to-noise ratio = 7
Noisy Doppler
Combined Estimate
8
5
6
4 3
4
2 2 1 0 0 -2 -1 -4 -2 -6
-3
-8 -10 0
-4
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Observations
0.8
0.9
1
-5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recovered signal Signal-to-noise ratio = 3
Figure 3.5: \Ball", n = 5122. 50 100 150 200 250 300 350 400 450 500 50
100
150
200
250
300
350
400
450
500
True image 50
50
100
100
150
150
200
200
250
250
300
300
350
350
400
400
450
450
500
500 50
100
150
200
250
300
350
Observations
400
450
500
50
100
150
200
250
300
350
400
450
500
400
450
500
Recovered image Signal-to-noise ratio = 3
50
50
100
100
150
150
200
200
250
250
300
300
350
350
400
400
450
450
500
500 50
100
150
200
250
300
350
Observations
400
450
500
50
100
150
200
250
300
350
Recovered image Signal-to-noise ratio = 1
Figure 3.6: \Lennon", n = 2562 .
50
100
150
200
250 50
100
150
200
250
True image 50
50
100
100
150
150
200
200
250
250 50
100
150
Observations
200
250
50
100
150
200
250
200
250
Recovered image Signal-to-noise ratio = 3
50
50
100
100
150
150
200
200
250
250 50
100
150
Observations
200
250
50
100
150
Recovered image Signal-to-noise ratio = 1
Chapter 4 Estimating signals satisfying dierential inequalities 4.1 The goal Let us look again at the problem of recovering a univariate function f : [0; 1] via n equidistant observations
!R
y = yf ( ) = fyt = f (t=n) + t gnt=1 ; = ft gnt=1 be a collection of independent N (0; 1) noises. To the moment we have developed a number of theoretically eÆcient techniques for solving this problem in the case when f is a smooth function { it belongs to a (local) Sobolev ball. At the same time, these techniques fail to recover regression functions, even very simple ones, possessing \bad" parameters of smoothness. Assume, e.g., that f (x) = sin(!x) and the frequency of this sine may be large. In spite of the fact that sine is extremely \regular", there does not exist a single Sobolev ball containing sines of all frequencies, As a result, with the techniques we have to the moment (as with all other traditional regression estimation techniques { all of them are aimed at estimating functions with somehow xed smoothness parameters) the quality of recovering a sine is the worse the larger is the frequency, and no uniform in frequency rate of convergence is guaranteed. In fact, all our estimates as applied to a sine of high frequency will recover it as zero. The same unpleasant phenomenon occurs when the function f to be recovered is an \amplitude modulation" of a smooth (belonging to a given Sobolev ball) signal g : f (x) = g (x) sin(!x + )
(4.1)
and the frequency ! is large: We are about to extend our estimation techniques from the classes of smooth functions to wider classes including, in particular, the signals of the type (4.1). This extension comes from the following simple observation: A Sobolev ball S k;p(B ; L) S1k;p (B ; L) is comprised of functions satisfying the \dierential inequality": !
d f k r dx 73
kp;B L
(4.2)
10
8
6
4
2
0
−2
−4
−6
−8
−10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.1: An \amplitude modulated" signal. associated with the linear dierential operator !
d dk r = k dx dx of order k. A natural way to extend this particular family of functions is to consider functions satisfying dierential inequalities associated with other linear dierential operators of order k, each function being \served" by its own operator.
Guided by the above observation, we come to the family of functions de ned as follows:
Dk;p(B ; L)
De nition 4.1.1 Let k be a positive integer, p 2 [1; 1], L > 0 and B be a segment contained in [0; 1]. We say that a function
f : [0; 1] ! C 1 ) is contained in the family Dk;p(B ; L), if f is k 1 times continuously dierentiable, f (k 1) is absolutely continuous and there exists a linear dierential operator with constant (perhaps complex-valued) coeÆcients !
d dk dk 1 d r = k + r1 k 1 + ::: + rk 1 + rk dx dx dx dx such that
!
d k r dx f
kp;B L:
The families Dk;p(B ; L) are wide enough to contain both functions from the usual Sobolev ball S k;p (B ; L) and the sines of arbitrary frequencies: a sine is a solution of a homogeneous dierential equation of order 2, so that sin(!t + ) 2 D2;p ([0; 1]; 0). 1) In this chapter it is more convenient to deal with complex-valued functions than with real-valued ones.
signals from Dk;p(; ). Indeed,
g 2 S k;p (L) ) f (x) g (x) sin(!x + ) = f1 (x) + f2 (x); f1 (x) = 21i g (x) expfi(!x + )g; f2 (x) = 21i g (x) expf i(!x + )g; setting we have
i! )k ; r2 (z ) = (z + i! )k ;
r1 (z ) = (z
rj
d dx
fj (x) = "j 21i expf"j i(!x + )gg (k)(x); "j = ( 1)j
k
rj
d dx
+
1
fj kp 21 L;
so that function (4.1) associated with g 2 S k;p(B ; L) is the sum of two functions from Dk;p(B ; L=2). Motivated by the latter example, we see that it makes sense to know how to recover regression functions from the families W l;k;p(B ; L) de ned as follows:
De nition 4.1.2 Let k; l be positive integers, p 2 [1; 1], L > 0 and B be a segment contained in [0; 1]. We say that a function f : [0; 1] ! C belongs to the family W l;k;p(B ; L), if f can be represented as
f (x) = with fj 2 Dk;p(B ; Lj ) and
l X j =1
l X j =1
fj (x)
Lj L:
Below, we build estimates for regression functions from classes W l;k;p(B ; L); since we have agreed to work with complex-valued functions, it makes sense to speak about complex-valued noises, so that our model of observations from now on will be
y = yf ( ) = fyt = f (t=n) + t gnt=1 ;
(4.3)
where = ft gnt=1 is a collection of independent complex-valued standard Gaussian noises (i.e., of random 2D real Gaussian vectors with zero mean and the unit covariance matrix). Of course, if the actual observations are real, we always can add to them arti cial imaginary Gaussian noises to t the model (4.3). Note that when recovering a highly oscillating function f via observations (4.3), we may hope to say something reasonable only about the restriction of f on the observation grid n = fxt = t=ngnt=1 , and not on the behaviour of f outside the grid. Indeed, it may happen that f is a sine of amplitude 1 which vanishes on n, so that observations (4.3) give no hint that f is not identically zero. By the just indicated
n
only, and we measure the estimation error in the \discrete versions" of q -norms 0
jgjq;B = @ n1
11=q
X
x2 n \B
jg(x)jq A ;
with the standard interpretation of the right hand side in the case of q = 1; here g is a complex-valued function de ned at least on n , and B [0; 1] is a segment. We shall see that our possibilities to recover functions from class W l;k;p(B ; L) are essentially the same as in the case when the functions belong to the Sobolev ball S1k;p(B ; L) (up to the fact that now we are recovering the restriction of a function on n rather than the function itself), in spite of the fact that the former class is \incomparably wider" than the latter one. Our strategy will be as follows. When estimating a smooth function f { one satisfying the dierential inequality
k rk
!
d f dx
kp L
[rk (z ) = z k ]
{ at a point x, we observe that locally it can be well approximated by a polynomial of degree k 1, i.e., by a solution of the homogeneous dierential equation !
rk
d p=0 dx
associated with our dierential inequality; and when estimating f (x), we act as if f were equal to its local polynomial approximation in the neighbourhood of x used by the estimate. Basically the same strategy can be used for estimating a regression function satisfying a general dierential inequality !
d f k r dx
kp L;
[deg r = k]
with the only dierence that now a \local model" of f should be a solution of the associated homogeneous equation !
d r p=0 dx
(4.4)
rather than an algebraic polynomial. This is, however, an essential dierence: it is easy to act \as if f were an algebraic polynomial", because we know very well how to recover algebraic polynomials of a given order from noisy observations. Now we need to solve similar recovering problem for a solution to unknown homogeneous dierential equation of a given order, which by itself is a nontrivial problem. We start with this problem; after it is resolved, the remaining part of the job will be carried out in the same manner as in the standard case of estimating smooth regression functions.
When restricting a solution of a homogeneous dierential equation (4.4) on an equidistant grid, we get a sequence satisfying a homogeneous nite-dierence equation. Since we are interested to recover signals on the grid only, we may temporarily forget about \continuous time" and focus on estimating sequences satisfying nite-dierence equations.
4.2.1 Preliminaries
Space of sequences. Let F be the space of two-sided complex-valued sequences = ft gt2Z , and F be the subspace of \ nite" sequences { those with nitely many nonzero entries. In what follows we identify a sequence = ft g 2 F with the rational function
(z ) =
X
t
t z t :
The space F is equipped with the natural linear operations - addition and multiplication by scalars from C, and F { also with multiplication ( )(z ) = (z ) (z ) (which corresponds to the convolution in the initial "sequence" representation of the elements of F ). For 2 F we denote by deg() the minimum of those 0 for which t = 0, jtj > , so that
(z ) =
X
jtjdeg()
t z t ;
if is a sequence with in nitely many nonzero entries then by de nition deg() = 1. Let FN denote the subspace of F comprised of all with deg() N ; clearly, one always has 2 Fdeg() (by de nition F1 F ). Further, let stand for the backward shift operator on F : ()t = t 1 : Given 2 F , we can associate with the nite dierence operator () on F : (
() =
X
s
)
s
t s
t2Z
;
2 F:
Discrete Fourier transformation. Let N be a nonnegative integer, and let GN
be the set of all roots
(
)
2k k = exp i ; k = 0; 1; :::; 2N; 2N + 1 of the unity of the degree 2N +1. Let C(GN ) be the space of complex{valued functions on GN , i.e., the vector space C2N +1 with the entries of the vectors indexed by the
the usual formula (FN )( ) =
X 1 t t ; 2 GN : 2N + 1 jtjN
p
Clearly, for 2 FN one has
1 ( ); 2 GN : 2N + 1 The inverse Fourier transformation is given by X 1 (FN )( ) t; jtj N: t = p 2N + 1 2GN (FN )( ) =
p
Norms on F . For 0 N 1 and p 2 [1; 1] let 0
k kp;N = @
11=p
N X t= N
jtjpA
(if p = 1, then the right hand side, as usual, is maxjtjN jt j). This is the standard p-seminorm on F ; restricted on FN , this is an actual norm. We shall omit explicit indicating N in the notation of the norm in the case of N = 1; thus, k kp is the same as k kp;1. Let 2 F be such that there exists a positive integer k satisfying (i)
k k1;k = 1,
(ii) the smallest of t's with nonzero t is zero, and the largest is k;
in this case we say that 2 F is normalized polynomial of the degree k. In the other words, the sequence from F is normalized polynomial of the degree k if it k P can be identi ed with polynomial (z ) = t z t with 0max j j = 1. tk t t=0 It is well{known that the Fourier transformation FN being restricted on FN is an isometry in 2-norms:
h; iN
X
jtjN
t
t = hFN ; FN
i
X
2GN
(FN )( )(FN )( ); ;
2 F;
(4.5)
where a denotes the conjugate of a 2 C. The space C(GN ) also can be equipped with p-norms 0 1
k g() kp=
@
X
2GN
jg( )j
pA
1=p
with the already indicated standard interpretation of the right hand side in the case of p = 1. Via Fourier transformation, the norms on C(GN ) can be translated to F , and we set k kp;N =k FN kp; these are seminorms on F , and their restrictions on FN are norms on the latter subspace.
the sequel.
k k2;N =k k2;N ; k kp;N k k1k kp;N +deg() ; p k k1;N k k1;N 2N + 1; k k1;N (2N + 1)1=2 1=p k kp;N ; deg() + deg( ) N )k k1;N k k1;N k k1;N ;
(4.6) (4.7) (4.8)
(4.9) (4.10) Proofs of the above inequalities are straightforward; we note only that (4.6) is the Parseval equality, and (4.7) is the Young inequality.
4.2.2 Estimating sequences The problem we want now to focus on is as follows. Assume we are given noisy observations
y = yf ( ) = fyt = ft + t gt2Z (4.11) of a sequence f 2 F ; here ft g is a sequence of independent random Gaussian 2D noises with zero mean and the unit covariance matrix. Assume that f \nearly satis es" an (unknown) nite-dierence equation of a given order k: j()f j "; (4.12) for some normalized polynomial of degree k; here " is small. We want to recover a given entry of f , say, f0 , via a given number of observations (4.11) around the time instant t = 0. For our purposes it is convenient to parameterize the number of observations we use to estimate f0 as 8T + 1;
where is a once for ever a priori xed positive integer (\order" of the estimate to be built) and T 2 N is the parameter (\window width"). Thus, we want to estimate f0 via the vector of observations
y T = fyt gjtj4T :
The idea of the estimate we are about to build is very simple. Assume for a moment that our signal satis es a homogeneous dierence equation { " = 0. If we knew the underlying dierence operator , we could use the Least Squares approach to estimate f , and the resulting estimator would be linear in observations. By analogy, let us postulate a \ lter" form
fb =
X
jsj2T
s y s :
(4.13)
of estimate of f in the case of unknown as well (By reasons which will become clear in a moment, our lter recovers f via reduced number of observations { 4T +1 observations around instead of the allowed number 8T + 1.) If we knew , we could specify \good weights" s in advance, as we did it when estimating algebraic polynomials. Since we do not know , we should determine
j sj2 should be small enough in order to suppress the observation noises. that jsj2T Imposing such a restriction on the weights s , we can determine the weights themselves by a kind of \bootstrapping" { by tting the output ffb g of our lter to its input { to the sequence of observations fyt g. Our hope is that if our lter suppresses the noises, then the only possibility for the output to \reproduce" the input is to reproduce its deterministic component f { since the \white noise" component of the input (the sequence of observation noises) is \irreproducible". In other words, let us form the residual g [T; ; y ] 2 F according to P
8 < t
gt [T; ; y ] = :
y + 0;
X
jsj2T
s yt s ;
jtj 2T ; jtj > 2T
(4.14)
and let us choose the weights by minimizing a properly chosen norm of this residual in under the restriction that the lter associated with \suppresses the observation noises". After these weights are found, we use them to build the estimate fb0 of f0 according to (4.13). Note that a procedure of the outlined type indeed recovers f0 via y T , since our residual depends on y T rather than on the entire sequence of observations (the reason to reduce the number of observations used by fb was exactly the desire to ensure the latter property). We have outlined our \estimation strategy" up to the following two issues: (a) what is an appropriate for us form of the restriction \the lter with weights suppresses the observations noises"; (b) what is a proper choice of the norm used to measure the residual. Surprisingly enough, it turns out that it makes sense to ensure (a) by imposing an upper bound on the k k1 -norm of the Fourier transform of , and to use in (b) the k k1-norm of the Fourier transform of the residual. The \common sense" reason for such a choice is that the dierence between a highly oscillating \regular" signal observed in noise and the noise itself is much better seen in the frequency domain than in the time domain (look at the plots below!)
The estimate we have outlined formally is de ned as follows. Let us x a positive integer { the order of our estimate. For every positive T we de ne the estimate fb[T; y ] of f0 via observations y T = fyt gjtj4T , namely, We associate with T; y the optimization problem (PT [y ]) : s.t. (a) (b)
k g[T; ; y] k1;2T ! min 2 F2T ; q k k1;2T (T ) 22+2 T :
As we remember, for 2 F2T the residual g [T; ; y ] depends on y T only, so that our optimization problem involves only the observations yt with jtj 4T . The problem
6
120
4
100
2
80
0
60
F128
)
z}|{
−2
−4
−6
40
20
0
50
100
150
200
250
0
300
3
120
2
100
1
80
0
60
F128
)
z}|{
−1
−2
−3
0
50
100
150
200
250
300
0
50
100
150
200
250
300
40
20
0
50
100
150
200
250
300
0
Figure 4.2: Who is who? a noisy sum of 3 sines and the modulus of its Fourier transform (257 observations, signal-to-noise ratio 1) Down: noise and the modulus of its Fourier transform
Up:
be a Borel function of observations. By de nition,
fb[T; y ] =
X
s [T; y
b
jsj2T
T ]y
s:
The main result on the estimate we have built is as follows. Theorem 4.2.1 Let k; l be two positive integers such that kl , being the order of the estimate fb[; ];
T be a positive integer; " 0.
Assume that the sequence f underlying observations (4.11) can be decomposed as
f=
l X j =1
fj
(4.15)
and for every component f j there exists normalized polynomial j of degree k such that l X k j ()f j kp;4T ": (4.16) j =1
Then the inaccuracy of the estimate fb[T; ] of f0 can be bounded from above as follows: h
jf0 fb[T; y]j C T k
i
1=p " + T 1=2 T ( )
(4.17)
where C depends on only and
T ( ) = maxjsj2T k s k1;2T 2T X 1 t : p = max max t s jsj2T 2G T 4T + 1 t= 2T
(4.18)
2
Proof. Let us x f satisfying the premise of our theorem, and let j ; fj be the associated sequences. 10 . We start with
Lemma 4.2.1 There exists 2 F2T such that (i) (z ) = Æ (z ) + ! (z ), Æ (z ) 1 being the convolution unit, with and
k k 1 2 p
k ! k1;N 2 2NT + 1 8N 2T ;
(4.19) (4.20)
(ii) for every j = 1; :::; l there exists representation
(z ) = j (z )j (z ) :
j 2 F2T ; k j k1 22 T k 1 :
(4.21)
20 . Let be a feasible solution of the optimization problem (PT [yf ( )]). We claim that the value of the objective at can be bounded from above as follows:
k g[T; ; yf ( )] k1;2T k8g[ ; fX ] k1;2T +22+4 T ( ); < ft + s ft s ; jtj 2T gt[ ; f ] = : : jsj2T 0; jtj > 2T
(4.22)
Indeed, we have (a) (b)
)
g [T; ; yf ( )] = 8 g [ ; f ] +X h[ ; ]; < t + s t s ; h[ ; ] = : jsj2T 0;
k g[T; ; yf ( )] k1;2T (c) k h[ ; ] k1;2T
jtj 2T jtj > 2T k g[ ; f ] k1;2T P+ k h[ ; ] k1;2T ; k k1;2T + j sj k s k1;2T jsj2T
[by de nition of h] "
T ( )
1+
P
jsj2T
#
j sj
(4.23)
[see (4.18)] = T ( ) [1+ k k1;2T ] p h i T ( ) 1+ k k1;2T 4T + 1 [by (4.8)] 22+4 T ( ) [in view of the constraints in (PT [])]; and (4.22) follows. 30 . We claim that the optimal value P in (PT [yf ( )]) can be bounded from above as follows: P 22+3 3=2 T 1=2+k 1=p " + 22+4 T ( ): (4.24) Indeed, let ; ! 2 F2T be given by Lemma 4.2.1. Applying (4.20) with N = 2T , we conclude that ! is a feasible solution of PT [yf ( )]. In view of (4.22), to prove (4.24) it suÆces to verify that
k g[!; f ] k1;2T 22+3 3=2 T 1=2+k which is given by the following computation. Let
j = j ()f j ; j = 1; :::; l:
1=p "
(4.25)
k g[!; f ] k1;2T = = =
l P
k g [!; f j ] k1;2T j =1 [since g [!; ] is linear in the second argument] l P k f j + !()f j k1;2T j =1 l P
j =1 l P j =1
k ()f j k1;2T
k j () [j ()f j ] k1;2T
[the origin of j , see Lemma 4.2.1.(ii)] =
as required in (4.25). 40 . We claim that
l P
j =1
k j j k1;2T l P
(4T + 1)1=2 1=p k j j kp;2T j =1 [by (4.9) applied with N = 2T ] l P
(4T + 1)1=2 1=p k j k1 k j kp;4T j =1 [by (4.7) and since deg(j ) 2T ] l P
(4T + 1)1=2 1=p (2T + 1) k j k1k j kp;4T j =1 [since j 2 F2T ] (4T + 1)1=2 1=p (2T + 1)22 T k [by (4.21)] 22+3 3=2 T 1=2+k 1=p " [see (4.16)];
k g[!; f ] k1 22+3 T k
1=p ":
l 1 P j =1
k j kp;4T
(4.26)
k g[!; f ] k1 = = = =
l P j =1 l P j =1 l P j =1 l P j =1 l P j =1 l P
k g[!; f j ] k1 k f j + !()f j k1;2T k ()f j k1;2T k j () [j ()f j ] k1;2T k j j k1;2T
k j k1 max k s j k1;2T jsj2T j =1 [since deg(j ) 2T ] l 22T k 1 jP=1 jsmax k sj k1;2T j2T [by (4.21)]
l P
22 T k 1 (4T + 1)1 1=p k j kp;4T j =1 [by Holder inequality] 22+3 T k 1=p " [see (4.16)];
as required. 50 . Let us x a realization of the noises, and let solution of (PT ). By (4.24) one has 22+3 3=2 T 1=2+k
1=p " + 22+4 T ( )
b
be the corresponding optimal
P = k g [ b; f ] + h[ b; ] k1;2T
[see (4.23.a)] k g[ b; f ] k1;2T
k h[ b; ] k1;2T ;
whence
k g[ b; f ] k1;2T A( ) 22+3 3=2 T 1=2+k 22+3 3=2 T 1=2+k [see (4.23.c)]
1=p " + 22+4 T ( )+ 1=p " + 22+5 T ( )
k h[ b; ] k1;2T (4.27)
60 . Let ; ! be the same as in 30 {40 , and let
= (1 + b())f: Note that by the de nition of g one has
gt [ b; f ] = t We claim that
8t : jtj 2T:
j(())0j 24+5 2T k
1=p ":
(4.28) (4.29)
)
() = ()(1 + b())f = (1 + b())[ ()f ] P b ( ()f ) ( ())0 = ( ()f )0 + s = g0 [!; f ] +
) j(())0j
jP sj2T
jsj2T
s
s g s [!; f ]
b
[since gs [!; f ] = ( ()f )s , jsj 2T ] jg0[!; f ]j + k b k1;2T k g[!; f ] k1;2T [by Parseval equality and since jhu; v ij k u k1 k v k1 , u; v 2 C(G2T )] 22+3 T k 1=p"+ k b k1;2T k g [!; f ] k1;2T [by (4.26)] 22+3 T k 1=p" + 22+2 1=2 T 1=2 22+3 3=2 T 1=2+k 1=p " [since is feasible for (PT ) and by (4.25)];
as claimed. 70 . Now { the concluding step. Setting fb = fb[T; y ], we have
f0
fb = f (0) + b()y 0 [the construction of the estimate] b = (1 + ())f 0 + b() 0 = 0 + b() 0 [the de nition of ]
) P b b jf0 f j j0j + s s jsj2T j0j + k b k1;2T k k1;2T Thus,
[same as in the previous computation] j0j + 22+21=2 T 1=2 T ( ) [since b is feasible for (PT ) and by de nition of T ];
jf0 fbj j0j + 22+2 1=2 T
1=2 T ( ):
(4.30)
0 = ( ())0 (! ())0 ) j0j j(())0 j + j(!())0 j 24+5 2T k 1=p" + j(!())0j
[see (4.29)] 24+5 2T k 1=p"+ k ! k1;2T k k1;2T [as in the previous two computations] = 24+5 2 T k 1=p" + 22+2 1=2 T 1=2 k k1;2T [by (4.20) applied with N = 2T ] 24+5 2T k 1=p" + 22+2 1=2 T 1=2 k g[ b; f ] k1;2T [by (4.28)] 24+5 2T k 1=p" h i +22+2 1=2 T 1=2 22+3 3=2 T 1=2+k 1=p " + 22+5 T ( ) [by (4.27)]
Thus,
j0j 24+6 2T k
1=p " + 24+7 3=2 T 1=2 T ( ):
Combining this inequality with (4.30), we come to (4.17).
4.2.3 Discussion Theorem 4.2.1 has a number of important consequences already in the \parametric case" { when the signal f we observe according to (4.11) satis es a homogeneous nite dierence equation with constant coeÆcients:
()f
0;
(4.31)
being normalized polynomial of degree . In the notation of Theorem 4.2.1, this is in the case when l = 1, k and " = 0.
A) In the case of (4.31) relation (4.17) becomes
jf0 fb[T; yf ( )]j CT
1=2 T ( ):
Due to the origin of T , we have
E
so that We see that
n
n
(T ( ))2
o1=2
E jf0 f [T; yf ( )]j b
2
O(1)
o1=2
(4.32)
p
ln T ; s
O(1) lnTT :
(4.33)
(!) For every T , it is possible to recover an entry ft in a sequence f satisfying unknown homogeneous dierence equation with constant coeÆcients of a given order via O (1)T noisy observations of the entries of the q ln sequence around the instant t with \nearly parametric risk" O (1) TT .
all dierence equations of a given order, which is rather surprising. Note that if
the equation were known in advance, p the quality of recovering ft could be slightly improved { we could get rid of the ln T -factor, thus coming to the result completely similar to the case of recovering algebraic polynomials of order 1 (their restrictions on an equidistant observation grid are the solutions of a particular nite dierence equation of order , namely, (1 ) f = 0). In the case when the equation is unknown, the logarithmic factor turns out to be unavoidable: it is proved in [22] that when the signal to be recovered is known to be a harmonic oscillation ft = c sin(!t + ), the uniform, with respect to all values of c; !; , risk of an arbitrary estimate qof f0 via 2T + 1 observations (4.11) around the time instant t = 0 is at least O(1) lnTT . Note that the problem of recovering a harmonic oscillation c sin(!t + ) is a parametric problem; indeed, all we need is to recover the triple of parameters c; !; . As we see, the minimax risk associated with this parametric estimation problem is not the parametric risk O(T 1=2 ).
B) The estimate we have built solves an \interpolation" problem { it recovers f0 via observations \placed symmetrically" around the time instant t = 0 we are interested
in. In some applications we should solve the \forecast" problem { we would like to estimate f0 via a given number of observations (4.11) placed at least units of time before the instant t = 0, i.e., via the observations y 4T ; y 4T +1 ; :::; y . What can be done in this situation? Slight modi cation of the construction we have presented demonstrates the following: (!!) In addition to the premise of Theorem 4.2.1, assume that every nitedierence equation
j ()h = 0
is \quasi-stable": every solution of this equation grows with t no faster than an algebraic polynomial (equivalently: all roots of the polynomial j (z ) are 1 in absolute value). Then the result of the theorem is valid for a properly chosen \forecast" estimate, namely, for the estimate b+
f [T; y ] = where
b
4X T
s=T
s y s;
b
is an optimal solution to the optimization program
s.t.
k 4T (I +
())y k1;4T ! min
k
s 1;4T
k
2 F4T ;
= 0; 4T s < T ; B ()T 1=2
with properly chosen B (m).
As a consequence, given N subsequent noisy observations (4.11) of a solution to an unknown quasi-stable homogeneous dierence equation of order , we may predict
solutions of all quasi-stable equations of order , risk not exceeding O (1) lnNN . Note that the assumption of quasi-stability of the underlying nite-dierence equation is crucial in the forecast problem. E.g., given all observations yt , t < 0, of a solution to a known (unstable) equation
ft+1
2ft = 0;
you cannot say de nitely what is the solution at 0 (provided, of course, that > 0).
4.3 From sequences to functions The main step in passing from estimating sequences \nearly satisfying" homogeneous dierence equations to estimating functions satisfying dierential inequalities is given by the following simple
Lemma 4.3.1 Let
n, k, k and T be positive integers; g : ( 1; 1) ! C be a k 1 times continuously dierentiable function with absolute continuous g (k
1) ;
gn 2 F be the restriction of g on the grid
n
= ft=ng1 t= 1 :
gtn = g (t=n) t 2 Z;
q(z)
= z k + q1 z k coeÆcient;
1
+ ::: + qk be a polynomial of degree k with unit leading
B be a segment centered at the origin and containing at least 8T +2k +1 points of the grid
n;
p 2 [1; 1]. There exists a normalized polynomial (z ) of degree k such that
k
()g n
kp;4T Ok (1)n
k+1=p
!
d g kp;B : k q dx
(4.34)
The proof is placed in Section 4.4. Combining Lemma 4.3.1 with Theorem 4.2.1, we can extend { in a quite straightforward manner { basically all estimation techniques we have considered so far to the case of functions satisfying unknown dierential inequalities. We shall focus on \the best" { the spatial adaptive { estimate.
The recovering routine we are about to build, same as the spatial adaptive estimate from Chapter 3, is speci ed by a single parameter { its order which should be a positive real. Let observations (4.3) be given, and let x = t=n be a point from the observation grid. For every positive integer T such that the grid contains 8T + 1 observations around x { i.e., such that 0 < t 4T , t + 4T n { we have built in the previous Section an estimate fbT (x; y ) of f (x) via the segment of observations fyt 4T ; yt 4T +1; :::; yt+4T g. Let us associate with the estimate fbT (x; y) its window
BT (x) = [x (4T + 2)n 1 ; x + (4T + 2)n 1 ]: From Theorem 4.2.1 and Lemma 4.3.1 we know that
(*) Let f be the function underlying observations (4.3), x = t=n be a point from the observation grid n , and let T 1 be such that the window BT (x) is contained in [0; 1]. (i) For every collection U comprised of positive integers k; l with kl ; l polynomials j , j = 1; :::; l, normalized of degree k each; a decomposition f (u) =
l X j =1
f j (u); u 2 BT (x);
p 2 [1; 1]
the error of the estimate fbT (x; y ) can be bounded from above as h
i
jfbT (x; yf ( )) f (x)j C1() "(T; U ) + T 1=2 n( ) ; l P "(T; U ) = T k 1=p k j ()fej kp;4T ; j =1 fesj = f j
s t n
(4.35)
;
where n( ) is the maximum of the k k1-norms of discrete Fourier transforms of all segments, of odd cardinality, of the sequence fsgns=1 (so that n is the maximum of norms of n2 standard Gaussian 2D vectors with zero mean and unit covariance matrix). (ii) Let l; k be positive integers with kl , let p 2 [1; 1] and let f 2 W l;k;p(BT (x); A) for some A. Then there exists a collection U of the type described in (i) such that
"(T; U ) C2 ()(T=n)k
1=p A
C3()Dk
1=p (B
T (x))A;
(4.36)
here, as always D(B ) is the length of a segment B .
Combining (*.i) and (*.ii) and observing that T 1=2 , up to a factor depending on only, is the same as pnD(1B (x)) , we come to the conclusion as follows: T
us set
(f; B ) = inf fDk 1=p (B )A j p 2 [1; 1]; k; l 2 N; kl ; A 0; f 2 W l;k;p(B ; A)g: Then for every point x = t=n from the observation grid n and every integer T 1 such that BT (x) [0; 1] one has 2
3
jfbT (x; yf ( )) f (x)j C () 4f (x; BT (x)) + q n ( )5 : nD(BT (x)) (4.37)
Besides this,
p
(
)
w2 ln n : (4.38) 8w 1 : Prob n O(1)w ln n exp 2 Finally, from the de nitions of the classes W and the quantity it immediately follows that If f 2 W l;k;p(B ; L) with lk , then there exists a function fe : B ! R+ such that k fe kp;B L; (4.39) 8B 0 B : (f; B 0) Dk 1=p(B 0) k fe kp;B0 Note that (4.37), (4.38), (4.39) are completely similar to the basic relations (3.5), (3.17), (3.6), respectively, underlying all developments of Chapter 3. n
o
4.3.2 Spatial adaptive estimate: construction and quality The construction. Let us choose ! = ! () so large that n p o Prob n > ! ln n n 4
(4.40)
(cf. (3.8)). The adaptive estimate fbn (x; y ) of the value f (x) at a point x 2 n is as follows (cf. Section 3.2). Let us say that a positive integer T is admissible for x, if the segment BT (x) is contained in [0; 1]. Assume that x admits admissible T 's, i.e., that 6n 1 < x < 1 6n 1 : (4.41) We already have associated with every T admissible for x certain estimate fbT (x; ) of f (x) via observations (4.3). Given these observations y , let us call a positive integer T x normal for x (cf. (3.13)), if it is admissible for x and
p
jf (x; y) fbT (x; y)j 4C () q ! ln n nD(BT 0 (x)) bT 0
8T 0; 1 T 0 T;
C () being the constant from (4.37). Normal for x values of T clearly exist (e.g., T = 1); let T (x; y ) be the largest of these values; note that this indeed is a wellde ned deterministic function of x; y . Our order adaptive estimate of f (x), by construction, is fbn (x; y ) = fbT (x;y) (x; y ): (4.42)
Theorem 4.3.1 Let 2 (0; 1), let be a positive integer, and let W = W l;k;p(B ; L), where kl and pk > 1. For properly chosen P 1 depending solely on ; p; and nonincreasing in p the following statement takes place: If the volume n of observations (4.3) is large enough, namely, P 1n
k 2+1
2
h
2
qbLn P D ln n ; n
bn =
k 2+1
2
=
2
1 p
i
(B )
(4.43)
(D(B ) is the length of segment B ), then for every q 2 [1; 1] the worst case, with respect to W , discrete q -risk of the order adaptive estimate fbn (; ) can be bounded as follows (cf. (3.16)):
n
Rfq fbn; W supf 2W E jfb(; yf ( )) f ()j2q;B P L bLn 2 (p;k;q) D(p;k;q)(B ); ( k ; 2k1+1 (p; k; q ) = 2kk++1 1 ; 2k 2+1 ; 2k+1 =
(p; k; q ) =
1; q
(
0;
2k+1 ;
o1=2
(4.44)
2k1+1 ; 2k1+1
here B is the concentric to B times smaller segment and 11=q
0
X jg(x)jq A jgjq;B = @ n1 x2 n \B
:
Proof of the theorem repeats word by word the proof of Theorem 3.3.1, with (4.37), (4.38), (4.39) playing the role of (3.5), (3.17), (3.6), respectively.
Optimality issues. The upper bounds on risks given by Theorem 4.3.1 are exactly the univariate (d = 1) versions of bounds from Theorem 3.3.1. Since now we are
working with wider families of functions wider than local Sobolev balls, all results of Chapter 3 (see Section 3.4) on the non-optimality index of the adaptive estimate remain valid for our new estimate as considered on the nested family of collections of regression functions (cf. Theorem 3.4.2)
W Wp; = fWng1 n=1
(p 2 (1; 1], 2 N) de ned as follows:
(a) p0 p; 1 kl ; Wn = >W l;k;p0 (B ; L) (b) > : bLn P D3 (c) P 1 n 2 1 P is given by Theorem p 4.3.1, = p 5 4 bn = plnn n 8 > > <
2
2 +1 2
9 > > =
2+1
2
2
(B );
> > ;
(4.45)
(n) = (ln n)
2 +1
:
2)
4.3.3 \Frequency modulated signals" A function f from class W l;k;p([0; 1]; L) is a sum of l functions f j satisfying each its
own dierential inequality of order k on the entire segment [0; 1]. What happens if we \localize" this property, allowing the decomposition to vary from point to point? The precise de nition is as follows:
De nition 4.3.1 Let us x positive integers k; l and reals p 2 [1; 1], L > 0, d 2 (0; 1=6]. We say that a function f : [0; 1] ! C belongs to the class Al;k;p;d(L), if there exists a function Lf (x) 2 Lp [d; 1 d] such that
k Lf kp;[d;1 d] L and
8x 2 [d; 1 d] : f 2 W l;k;p([x d; x + d]; (2d)1=pLf (x)): Note that the classes A extend our previous classes W : 8d 2 (0; 1=6] : W l;k;p([0; 1]; L) Al;k;p;d(L) Indeed, let f 2 W l;k;p([0; 1]; L), let f = tion, and let
fj
j =1 q j (z ) = z k + q1j z k 1 + ::: + qkj be l X j =1
Let us set
l P
k qj
(4.46) (4.47)
(4.48)
be the corresponding decomposithe associated polynomials:
d f j kp L: dx
l X d f j kp;[x d;x+d] Lf (x) = (2d) 1=p k qj dx j =1
and let us verify that this choice ts (4.46), (4.47). The latter relation is evident, while the former one is given by the following computation: setting L() =
l d X q j f j ; dx
j =1
Formally, the announced statement is not a straightforward corollary of the lower bounds on the minimax risk established in Chapter 3: there we were dealing with the usual q-norms of the estimation errors, while now we are speaking about discrete versions of these norms. However, looking at the proofs of the lower bounds, one can observe that they remain valid for the discrete versions of q-risks as well. 2)
6
4
2
0
−2
−4
−6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.3: A \frequency modulated" signal. and assuming p < 1, we have
k Lf kpp;[d;1
k Lf kp;[d;1
d]
d]
8
= (2d)
1
= (2d)
1
1Z d> < xZ+d
9 > =
Lp (u)du dx
> : d 8x d min[1 Z1 > Z ;u+d] < > :
0 max[0;u d] p p;[0;1]
k L() k L;
)
> ; 9 > =
dx Lp (u)du > ;
as required in (4.46). We have established the latter relation in the case of p < 1; by continuity, it is valid in the case p = 1 as well.
Our interest in classes A comes from the fact that they contained not only \amplitude modulated", but also \frequency modulated" signals. Consider, e.g., the following construction. Given a positive integer N , we partition the segment [0; 1] into N nonoverlapping segments Bt , t = 1; :::; N , of the length 2d = N1 each; let xt = 2td, t = 1; :::; N , be the right endpoints of these segments. Now let g 2 S k;p ([0; 1]; L) be an \amplitude" which is supposed to vanish, along with its derivatives of order < k, at all points xt , t = 1; :::; N . Consider the family of functions obtained from g by \frequency modulation": A function f from the family on every segment Bt is of the form g (x) sin(!t x + t ) with somehow chosen frequency !t and phase t . One can immediately verify that all functions from this family belong to A4;k;p;d(4L). It turns out that the quality of our adaptive estimate on classes A is basically the same as on narrower classes W :
Theorem 4.3.2 Let 2 (0; 1), let be a positive integer, and let A = Al;k;p;d(L), where kl and pk > 1. Assume that the volume of observations n is large enough (the critical value depends on ; p; L= only and is independent of d), and that d is
d
bn L
2 2 +1
k
p
ln n bn = p : n
;
(4.49)
Then, for every q 2 [1; 1], the j jq;[0;1] -risk of the order adaptive estimate fbn on the class A can be bounded as follows:
n
Rfq fbn; A supf 2A E jfb(; yf ( )) f ()j2q;[0;1] P( L bLn 2 (p;k;q) ; k ; 2k1+1 (p; k; q ) = 2kk++1 1 ; 2k 2+1 ; 2k+1 = =
o1=2
(4.50)
1 p; 1: q
Here [0; 1] = [0:5(1 ); 1 0:5(1 )] is the -shrinkage of the segment [0; 1] to its center and P depends on ; p; only and is nonincreasing in p. If (4.49) is equality rather than inequality, then, for all large enough values of n, the upper bound (4.50) coincides (up to a factor depending on ; p; only) with the minimax j jq;[0;1] -risk of estimating functions from A via observations (4.3).
For proof, see [10].
4.4 Appendix: Proofs of Lemmas 4.2.1, 4.3.1 4.4.1 Proof of Lemma 4.2.1
10 . To simplify notation, let us assume that every polynomial j is of degree k (the modi-
cations in the case of deg j < k are quite straightforward), and let j` , ` = 1; :::; k, be the roots of the polynomial j (taken with their multiplicities). For every j , let Lj be the set of those ` for which j` are 1 in absolute value, and let Sj be the set of the remaining indices from [1; k ].
20 . Let us x j l, and let
0
j (z ) = @
Then We claim that
Y
`2Lj
10
(1 z=j` )A @
Y
`2Sj
j (z ) = cj j (z ):
jcj j 2
k:
1
(z
j` )A :
(4.51) (4.52)
(4.53) Indeed, it is clear that the maximum of absolute values of the coeÆcients of j is not greater than the one of the polynomial (z +1)k , i.e., is 2k ; since the product cj j (z ) is a normalized polynomial (i.e., with the maximum of modulae of coeÆcients equal to 1), the factor cj must satisfy (4.53).
that
(i) If jj 1, then 1 + T (z ) = (1 z=)rT (z ) with
rT k rT k1 k rT k1
(a) (b) (c) If
jj < 1
, then
z 2T with
(4.54)
+ T (z ) = (z
)rT (z )
2 F2T ; 2T; 2: p 8N 2T : k T k1;N 2NT + 1 :
(4.56)
k T k1 1:
(4.57)
rT k rT k1 k rT k1
(a) (b) (c) (ii) One has
2 F2T ; 2T; 2;
and
(4.55)
Indeed, let us set (z ) = T (z )
=
8 T > > 1 P (z=)t ; >
t=1 TP1 > > t T t > 1 : T t=0 z ; 2 (z ):
if jj 1 otherwise
;
Note that in the case of jj 1 we have 1 + T (z )
=
T P T 1 [1 t=1
!
(z=)t ] T 1
= (1 z=) |
(1 and
rT
k rT k1
z=)rT (z )
T X t 1 X
= 2T;
(z=)
t=1 {z=0 q1 (z )
2 F2T ; k q1 k1 k q2 k1 T 2
T P 1 + T 1 (z=)t t=1 !
1+T 1
}|
!
T X
t=1 q2 (z ) {z
(z=)t
! }
[by (4.8) with p = 1] [since jj 1]
k rT k1 k q1 k1 k q2 k1 [by (4.8) with p = 1] 2 [since jj 1]: as required in (4.54). Completely similar computation demonstrates (4.55) in the case of jj < 1.
2T and 2 GN we have j(FN T )( )j = p2N1 +1 jT ( )j = p2N1 +1 j ( )j2 h i2 p = 2N + 1 p2N1 +1 j ( )j p = p2N + 1j(FN )( )j2 ) k T k1;N = p2N + 1 k FN k22 [by (4.6) and in view of 2 FT ] = p 2N + 1 k k22
that for N
=
40 . Now let us set
2N +1 T
[by construction of ]
j (z ) +
=
j (z )
=
j (z )
=
(z )
=
(z )
Y
s2Lj Y
(1 + Tjs (z )); (z 2T + Tjs (z ));
s2Sj j + (z )j (z ); l Y j (z ); j =1 z 2T M (z );
= where M is the sum, over j l, of the cardinalities of the sets Sj . Let us verify that meets all requirements of Lemma 4.2.1. 40 .1) By construction, we have deg(+j (z )j (z )) 2T k, whence 2 F2lkT ; since M lk , we have 2T M 2lkT as well. Since (z ) is a polynomial, we have 2 F2klT F2T , as required. 40 .2) By (4.57) we have k T k1 1, whence
k j k1 2 k ;
(4.58)
so that k k1 =k k1 (2k )l = 2kl 2 , as required in (4.19). 40 .3) Let N 2T . Let us x j l, and let Mj = Card(Sj ). By construction, we have j (z )z 2T Mj
=
k Y s=1
(1 + sj (z ));
(4.59)
j where (see Mini-lemma) p2N +1s 2 F2T are of k k1 -norms not exceeding 1 and of k k1;N norms not exceeding T . By (4.7) applied with p = 1 and (4.10), every nonempty product of a number of sj (z ) with distinct values of the index s is of k k1 -norm not exceeding 1 and of k k1;N -norm not exceeding . When opening parentheses in the right hand side of (4.59), we get the sum of 1 and 2k 1 products of the just outlined type; consequently,
j (z ) j (z )z 2T Mj
1 2 F2kT
is of k k1 -norm not exceeding 2k and of k k1;N -norm not exceeding 2k . Observing that (z ) =
l Y
(1 + j (z ))
j =1
and repeating the reasoning we just have used, we conclude that !(z ) = (z ) k k1;N -norm not exceeding 2kl , as required in (4.20).
1 is of
By construction (see Mini-lemma and (4.51)) we have = 1 (z ) l (z )
(z ) 1 (z )
=
lY1 j =1
k Y
s=1
rTls (z );
j (z );
whence, in view of (4.52), k Y (z ) = l (z ) cl 1 z 2MT 1 (z ) rTls (z ) : s=1 | {z } l (z )
Due to its origin, l 2 F2T . Furthermore, we have
jcl j
k 1 k1 k rTls (z) k1 k rTls (z) k1 1
2k 2k(l 2T 2
[by (4.53)] [by (4.58)] [by Mini-lemma] [by Mini-lemma];
1)
applying (4.7), we come to (4.21).
4.4.2 Proof of Lemma 4.3.1 We may assume that q has k distinct roots 1 ; :::; k 3) . Let the rst of the roots belong to the closed left half-plane, and the rest of the roots belong from the open right half-plane. Let us set s = expfs n 1 g; s = 1; :::; k ;
Proof.
b(z )
=
Y
!0
(1 zs )
k Y
@
=1 8s
1
(z
s= +1
1=s )A ;
0; x0 ; s exp f x g ; x>0 s s (x) = expfs xg; x 0 ; k s > ; > > : 0; x>0 > > <
note that the fundamental solutions expfs xg of the homogeneous equation q dxd p = 0, being restricted on the grid ft=ngt2Z , are proportional to the progressions fts gt2Z and therefore satisfy the homogeneous dierence equation b()g
Let (a b)(x) =
1
a(u)b(x
u)du
The case when q has multiple roots can be obtained from the one with simple roots by perturbing to make its roots distinct and then passing to limit as the perturbation tends to 0. 3)
q
Z1
0:
(x) = 1 ::: k :
We have q(z ) = (z
1 ):::(z
n ),
so that
d q dx
= ( dxd 1 )1 = |Æ {z ::: Æ} k times = Æ;
whence, setting
(
we get
q
h(x)
=
r
=
d g q dx 0;
h;
d d r= q
dx dx
::: ( dxd
d q dx
(g
r)
whence
b()(g n
rn )
t
x 62 B
h=h=
q
d g x2B : dx
(x) = 0 ) g(x) r(x) =
k X s=1
cs expfs xg;
8t : t n k 2 int B )k b()(gn
=0
(x); x 2 B ;
Thus, for x 2 int B we have
k )k
r n ) kp;4T = 0
(4.60)
(recall that for jtj 4T the points (t k)=n belong to B , since B is centered at the origin and contains at least 8T + 2k + 1 points of the grid n). Now let us compute b()rn . Let be the shift by n 1 in the space of functions on the axis: (f )(x) = f (x n 1 ): Then b()r n = b()r (tn 1 ); t b()r = b()( h) [since r = h] b = () h [since (f e) = (f ) e] = 1 1 1 ::: |
{z
1
+1 |
}
|
1 +1 +1
{z
+1
{z
}
}
::: k |
k 1 k
{z
k
}
h:
Now note that every one of the functions s () in absolute value does not exceed 1, and that it vanishes outside [0; n 1 ]. It follows that the function = 1 ::: k vanishes outside [0; kn 1 ] and does not exceed in absolute value the quantity k 1 k1 ::: k k 1 k1 k k k1 n (k 1) .
k b()rn kpp;4T = = =
4P T
p b ()r (tn 1 )
t= 4T 4P T
(
t= 4T
h)(tn 1 ) p
p 1 kn Z 1 ( u)h(tn u)du t= 4T 00 1 kn 4P T B Z n (k 1) h(tn 1 u) @
4P T
t= 4T 4P T
t= 4T
j
0
n (k 1)p
kn Z 0
Z
1p
jduC A
1
jh(tn
u)jp du(kn 1 )p 1
1
= kp 1 n kp+1 jh(u)jp C (u)du C (u) = Card t 2 Z : jtj 4T; ( k + t)n
) k b()rn kp;4T
k p n kp+1 k h kpp;B d gk kn k+1=p k q dx p;B :
1
u tn
1
Combining the resulting inequality with (4.60), we get
k b()gn kp;4T kn
k+1=p
kq
d g kp;B : dx
(4.61)
This inequality was obtained in the case of p < 1; by continuity reasons, it is valid for as well. Relation (4.61) is nearly what we need; the only bad thing is that the polynomial b is not normalized. Let w be the maximum of absolute values of the coeÆcients of b; setting p=1
b = w 1 ;
we get a normalized polynomial of degree k such that
k ()gn kp;4T w
1 kn k+1=p
kq
d g kp;B : dx
(4.62)
It remains to bound from above the quantity w 1 ; which ispimmediate: there exists a point z 1 ; :::; 1 in the unit circle which is at the distance at least d = (1+ k ) 1 from all points +1 k and at at least at the same distance from the boundary of the circle (otherwise the circles of the radius d centered at the points were covering the circle of the radius 1 d centered at the origin, which is clearly impossible { compare the areas!). From the formula for b it follows that k p X jbsj jb(z )j (1 + k) k ; whence
s=0
w 1
(k + 1)(1 +
p
k )k :
Combining this inequality with (4.62), we come to (4.34).
Chapter 5 Aggregation of estimates, I 5.1 Motivation The non-parametric regression estimates we have built so far heavily depend on a priori assumptions on the structure of the function to be recovered. As a matter of fact, this dependence of estimation techniques on a priori hypotheses concerning the structure of \true signals" is a characteristic feature of the non-parametric regression estimates; we can reduce sometimes the \size" of the required a priori knowledge, but we never can get rid of it completely. Now, typically there are many \concurrent" a priori hypotheses on the structure of a signal rather than a single hypothesis of this type; if we knew which one of our a priori hypotheses indeed takes place, we would know how to recover the signal. The diÆculty, however, is that we do not know in advance which one of our concurrent hypotheses actually takes place. We already met this situation in adaptive estimation of smooth functions, where the hypotheses were parameterized by the smoothness parameters of local Sobolev balls, a particular hypothesis saying that the signal belongs to a particular Sobolev ball (and similarly in the case of recovering functions satisfying dierential inequalities). As another example of this type, assume that we are recovering a smooth regression function f of d > 1 variables and that we have reasons to suppose that in fact f depends on d0 < d \properly chosen" variables: f (x) = F (P T x); (5.1) where P is a d0 d matrix. If we knew P in advance, we could reduce the problem of recovering f to the one of recovering F . Since the rates of convergence of nonparametric estimates rapidly slows down with the dimensionality of the problem (e.g., for Lipschitz continuous functions of d variables the convergence rate is O(n d ) { think how many observations we need to get a reasonable accuracy when estimating a Lipschitz continuous function of just 4 variables), such an opportunity would look very attractive. But what to do when we know that a representation (5.1) exists, but do not know the matrix P ? The \general form" of the situation we are interested in is as follows. We have a family H of a priori hypotheses on the signal f , and we know in advance that at least one of these hypotheses is true. If we knew that f ts a particular hypothesis H 2 H, we would know how to recover f { in other words, every hypothesis H is associated with a recovering routine f^H which \works ne" when f ts H . However, we do not know what is the hypothesis the observed signal ts. What to do? 1 2+
101
dierential inequalities) we may act as if we knew the \true" hypothesis, but this possibility heavily depends on the speci c nature of the corresponding family of hypotheses H; for other families H, no results of this type are known. This is the case, e.g., for the family associated with representation (2.1) with given d; d0 and varying P. In the general case we could act as follows: we could partition our observations y into two groups and use the observations of the rst group, y I , to build all estimates f H () = f^H (; y I ), H 2 H, of f ; after this is done, we could use the second group of observations, y II , in order to \aggregate" the estimates f H { to build a new estimate which reproduces f (nearly) as good as the best of the functions f H , H 2 H. Since in our approach the family of hypotheses/estimates is given in advance and is therefore beyond our control, our problem is how to implement the \aggregation" stage; how we resolve this problem, it depends on what exactly is our target. Mathematically natural targets could be to nd an \aggregated" estimate which is nearly as good as
L. The closest to f linear combination of the functions f H , H 2 H; C. The closest to f convex combination of the functions f H , H 2 H; V. The closest to f of the functions f H , H 2 H. To the moment, the three outlined versions of the Aggregation problem were investigated in the case when
The number of \basic estimates" is nite. The estimation error is measured in L2 (X; ), X being a space on which f; f H are de ned and being a probability measure on this space.
The majority of known results relate to the version V of the aggregation problem (see [11] and references therein). In our course, we prefer to start with the version C, postponing the versions L, V till Chapter 6.
5.2 The problem and the main result 5.2.1 Aggregation problem We are about to consider the following Aggregation problem C. Let
RM be a convex compact set contained in the k k1-ball, i.e., let maxfk k1 j 2 g 1; X be a Polish space equipped with Borel probability measure ; fj : X ! R, j = 1; :::; M , M 3, be given Borel functions; f : X ! R be a Borel function.
z = fzt = (xt ; yt = f (xt ) + et )gnt=1 ;
(5.2)
where xt are mutually independent random points from X , each of them being distributed according to , and et are independent of each other and of fxt g random noises such that
Efet g = 0 and Efe2t g 2 < 1; t = 1; :::; n:
(5.3)
Let f be the closest to f , in L2 (X; ), linear combination of functions f1 ; :::; fM with coeÆcients from : M P
f = j fj ; j =1 2 Argmin (); () =
Z
X
2
(f (x)
M X j =1
(5.4)
j fj (x))2 (dx):
Our goal is to nd, given f1 ; :::; fM and n observations (5.2), a combination P j fj with 2 which is nearly as close to f as f . j
It should be stressed that we do not assume that the measure is known in advance. From now on, we make the following crucial for us Boundedness assumption: Functions f; f1 ; :::; fM are bounded.
From now on, we set
L = maxfk f k1; k f1 k1 ; :::; k fM k1g < 1; the 1-norm being associated with the measure .
(5.5)
5.2.2 The recovering routine Our recovering routine is extremely simple. The function () from (5.4) is a convex quadratic form of : () = () + c ; M P () = Qij i j
Qij =
M P
i;j =1 Z
j =1
qj j ;
fi (x)fj (x)(dx);
XZ
qj = 2 f (x)fj (x)(dx); c =
(5.6)
ZX
X
f 2 (x)(dx):
Given a quadratic form () on RM with (0) = 0: () =
M X i;j =1
Qij i j
M X j =1
qj j
[Qij = Qji]
M (M +1) 2
Coef() = (fQij g1j iM ; fqj gM j =1 )
+ M -dimensional vector of coeÆcients of the form. Note that every observation (xt ; yt = f (xt ) + et ) provides us with a noisy observation t = (ffi (xt )fj (xt )g1j iM ; f2yt fj (xt )gM (5.7) j =1 ) of the vector = Coef( ); and that t is the vector of coeÆcients of the convex quadratic form of rank 1
the
0
12
M X
zt () = @yt
j =1
j fj (xt )A
yt2:
Our aggregation procedure is as follows: given observations (5.2), we 1) build the form n 1X zt (); z () = n t=1 and 2) solve the convex optimization problem z () ! min j 2 :
(Pz )
An optimal solution (z ) to this problem clearly can be chosen to be Borel in z ; 3) We de ne our \aggregated estimate" as
fb(; z ) =
M X j =1
j (z )fj ():
5.2.3 Main result Our main result bounds the dierence between the quality of the \ideal", as far as closeness to f is concerned, aggregate of fj with coeÆcients from and the expected quality of the aggregate fb we have built, i.e., the dierence between the quantities Z
M X
X
j =1
( ) = min (f (x) 2 and
8
j fj (x))2 (dx) 9 =
E f ((z))g = E : (f (x) fb(x; z))2 (dx); : X
Note that a meaningful \quality measure" for an aggregation routine should be exactly of this type { it should bound the dierence between the expected distance from f to the result of the aggregation routine in question and the distance from f to the \ideal" aggregate f , not the distance from f to the result of the aggregation routine separately. Indeed, since we make no assumptions on how well the \ideal" aggregate
(which cannot be closer to f than the ideal aggregate) is a good approximation of f ; all we should worry about is to get an aggregate which is nearly as good as the ideal one. Theorem 5.2.1 For the aggregation routine fb we have built, one has
"n E f ((z ))g
p
(L2 + L ) ln M p ( ) O(1)
n
(5.8)
with absolute constant O(1).
Discussion The quantity "n in the left hand side of (5.8) can be treated as the \aggregation price" { the loss in accuracy of approximating f by a linear combination of fj (with coeÆcients from ) coming from the fact that we do not know the \true" optimal combination (since neither f nor even are known in advance) and are enforced to recover a nearly optimal combination from observations. Note that "n is the expected loss in the squared k k2 -distance from f (k k2 is associated with the measure ). A more natural price is the loss in the k k2 -distance itself { the quantity
n =k f fb k2 k f f k2 : Since for 0 a b one has (b a)2 b2 a2 , (5.8) implies that En
E
n
n2
o1=2
p p (L + L )(ln M )1=4 : "n O(1) n1=4
(5.9)
A good news about the latter bound is that it is \nearly independent of the number M of functions we are estimating" { it is proportional to (ln M )1=4 . Thus, if our aggregation problem comes from the desire to aggregate estimates associated with a number of concurrent hypotheses on the signal f to be recovered, this number can be \very large". From the applied viewpoint, it means that our abilities to handle many concurrent hypotheses are limited not by the statistics { by growth of the aggregation price with the number of hypotheses { but by the necessity to process these hypotheses computationally. And a bad news about our aggregation routine is that the aggregation price En decreases rather slowly (as n 1=4 ) as the volume n of observations used for aggregation grows. We shall see, however, that in our setting of the aggregation problem this rate is unimprovable. Note that one can replace the \o-line" aggregation routine we have described (where we rst accumulate all observations (5.2) and only then solve a (large-scale, for large M ) convex optimization problem (Pz ) to build the desired aggregate) with a Stochastic Approximation-type on-line routine where neither the observations should be stored, nor a separate stage of solving a convex optimization problem is needed (for details, see [16]).
Proof of Main result Proof of Theorem 5.2.1 is given by combination of two simple observations; the second of them is interesting by its own right. The rst observation is given by
unbiased estimates of : Besides this,
Ef tg = : n
(5.10)
o
E k t k21 4(2L2 + L)2 ; (5.11) (From now on, for = (1 ; :::; K ) 2 RK k k1 is the norm maxk jk j of the vector
).
Proof. Mutual independence of f t gnt=1 and relation (5.10) are evident. To establish (5.11), note that
jqj
jQij fi(xt )fj (xt )j 2(f (xt ) + et )fj (xt )j k t k21 E fk t k21g
2L2 ; 4L2 + 2Ljet j ) 4(2L2 + Ljetj)2 ) 4(2L2 + L)2:
20 . Our second observation is an extremely useful \Tschebyshev inequality in the 1-norm". Recall that the usual Tschebyshev inequality gives a rough upper n P bound on the probability of the event j t j > a, where t are independent scalar t=1 random variables with zero mean and nite variance; this inequality is an immediate consequence of the observation that in the case in question (
E j
n X t=1
j
t 2
)
=
n X t=1
Efj tj2g:
Similar equality with respect to the Euclidean norm takes place if t are independent vectors with zero mean and bounded variances: (
E k
n X t=1
k
t 22
)
=
n X t=1
Efk t k22g;
()
where for = (1 ; :::; K ) 2 RK 8 > <
k kp= >:
K P
!1=p
jij maxi ji j; i=1
p
; 1p<1: p=1
Now, (*) re ects speci c algebraic properties of the Euclidean norm k k2 and fails to be valid for the standard norms k kp with p 6= 2. As far as statistical consequences are concerned, the \k kp-version" of (*) is played by the following result1) :
Lemma 5.2.2 Let t 2 RK , t = 1; :::; n, be independent random vectors with zero means and nite variance, and let K 3. Then for every p 2 [2; 1] one has (
E k
n X t=1
)
n X
n
k O(1) min[p; ln K ] E k t k2p
t 2p
t=1
o
;
(5.12)
here, as always, O(1) is an absolute constant. 1) I am using this fact for more than 20 years; all this time I was (and still am) sure that the fact is well-known, all this time I was looking for a reference and found none.
V ( ) =k k2 : RK ! R: The function V is continuously dierentiable with Lipschitz continuous gradient; it can be easily veri ed (for the proof, see [21]) that
V ( + ) V ( ) + T rV ( ) + CV ( ) with absolute constant C . We conclude that (
kP +1
E V ( t=1
)
t)
(
k P
E V (t=1
t ) + ( k+1)T
=
t)
(
k P
)
k P
rV (t=1
)
t)
(5.13)
+ C E fV ( t )g
E V (t=1 + CE fV ( t)g [since Ef k+1g = 0 and k+1 is independent of 1 ; :::; k ] The resulting recurrence implies that whenever p 2 [2; 1), one has (
E k
n X t=1
)
n X
n
o
k Cp E k t k2p :
t 2p
(5.14)
t=1
To complete the proof of (5.12), it suÆces to verify that we can replace the factor Cp in the right hand side by a factor of the type O(1) ln K . This is immediate: there is nothing to prove when p p(K ) 2 ln K . Now let us assume that p > 2 ln K . Since for p p0 1 one has k kpk kp0 K p0 p k kp 8 2 RK we have n n P P t 2 t 2 E k kp E k kp(K ) 1
t=1
t=1 n
1
n
Cp(K ) tP=1 E k t k2p(K )
o
[by (5.14) applied with p = p(K )] n P Cp(K ) E K p K p k t k2p
t=1 Cp(K )K p K
n P
2
n
E k t k2p t =1 o n n P 2Ce ln K E k t k2p t=1 2 ( )
=
2 ( )
o
[since p(K ) = 2 ln K ] 30 . We are basically done. Indeed, since is contained in the unit k k1 -ball in RM , the uniform, on , distance between a pair of quadratic forms ; 0 of , both forms being with zero constant terms, does not exceed 3 times the k k1 -distance between the coeÆcient vectors of the forms: () =
) max j () 2
M P i;j =1 M P
Qij i j
M P j =1 M P
qj j ;
0 () = Q0ij i j qj0 j i;j =1 j =1 0 ()j 3 k Coef( ) Coef( 0 ) k1 :
then 0 is a \nearly minimizer" of on :
0 (0 ) min ()
2
Argmin 0 () ) 2max j () 0()j 2 6 k Coef( ) Coef( 0 ) k1 :
(5.15)
Now, the output of our aggregation routine { the vector of aggregation weights (z ) { n P by construction is a minimizer, on , of a random quadratic form z () = n1 zt (), t=1 so that our quality measure { the \aggregation price" { can be bounded as follows: ((z ))
min () = ((z )) min () [since diers from by a constant] 6 k Coef( ) Coef( z ) k1 [by (5.15)] n P = n6 k [ t ] k1 t=1 [by construction]
) "n
E f ( (z )) min ()g n n6 E k tP=1[ t] k1 n 1=2 P 6 t 2 n E k t=1[ ] k1 1=2 n P 6 2 2 n O(1) ln M t=1 4(2L + L)
[by Lemmasp5.2.1, 5.2.2] pn) ln M ; O(1) (L +L 2
as required.
5.2.4 \Concentration" From the computational viewpoint, a drawback of our aggregation routine is that the resulting aggregate fb can involve all our M functions f1 ; :::; fM . If M is very large (and this is the case we indeed are interested in), such an aggregate is computationally diÆcult to use. We are about to prove that in fact the aggregate fb can be enforced to involve at most O(n) or even O(n1=2 ) of the functions f1 ; :::; fM , provided that is \simple", e.g., = = =
f 2 RM j k k1 1g f 2 RM j 0; k k1 1g f 2 RM j 0; k k1 = 1g
(5.16) (5.17) (5.18)
nates 1, let us set M RM ! = f 2 R j !j j 0; j = 1; :::; M g:
Let us call k-simple, if the intersection of with every one of 2M \orthants" RM ! is M a polyhedral set cut o R! by at most k linear equalities and inequalities (in addition to M \sign constraints" which de ne RM ! itself). E.g., every one of the sets (5.16) { (5.18) is 1-simple. Note that the weight vector (z ) yielded by our aggregation routine is not necessarily unique. Indeed, we can choose as (z ) any minimizer (on ) of the quadratic form z (). The quadratic part of each of the forms zt (), t = 1; :::; n, is of rank 1, so that the rank of the quadratic part of the form z () is of rank at most n. It follows that there exists a linear subspace E z RM of codimension at most n + 1 such that z () is constant along every translation of this subspace. In particular, after we have found a minimizer (z ) of z () on , we can \re ne" it as follows. Let ! be such that (z ) 2 RM ! . Consider the set z P = \ RM ! \ [E + (z )] :
Every point of this set (which contains (z )) is a minimizer of z () on , along with (z ) (since z is constant on E z + (z )). Assuming that is k-simple, we observe that P is a compact polyhedral set given by M \sign constraints" de ning RM ! and no more than k + n + 1 additional linear inequalities and equations (at most M k linear constraints which cut o \ RM ! from R! plus n + 1 linear equation de ning the aÆne plane E z + (z )). As any compact polyhedral set, P has extreme points, and by the standard results of Linear Programming every extreme point of P ts at least M of equations/inequalities de ning P as equations. We are in our right to choose, as a minimizer of z () on , any one of these extreme points, let the chosen point be denoted + (z ), and to treat (z ) as an intermediate, and + (z ) { as the actual output of our aggregation routine. It remains to note that among M of equations/inequalities de ning P which are satis ed at + (z ) as equalities, at least M (k + n + 1) must come from the sign constraints de ning the orthant RM ! , i.e., + at least M (k + n + 1) coordinates in (z ) must be zero. We have arrived at the following
Proposition 5.2.1 Assume that is k-simple. Then in our aggregation routine we can specify the rules for choosing the weight vector (z ) in such a way that the
aggregate
f (; z ) = b
M X j =1
j (z )fj ()
will include, with positive weights j (z ), no more than k + n + 1 of the functions fj .
\n1=2 -concentrated" aggregation. The construction we are about to present goes
back to Maurey [28]. We shall implement the construction under the assumption that is the k k1 -unit ball (5.16); however, our reasoning can be easily modi ed to handle the case of simplices (5.17), (5.18).
routine to get the vector of aggregation weights (z ). After it is found, we set
(z ) =
M X j =1
jj (z)j
(note that (z ) 1) and de ne a probability measure fjz gM j =0 on the set f0; 1; :::; M g as follows: z j = jj (z )j; j > 0 1 ; j = 0 For 0 j M , let us set 8 > <
0; j=0 ) = >: fj (); j > 0; j (z) 0 : fj (); j > 0; j (z ) < 0
gjz (
M P
Note that we can represent the aggregate fb(; z ) = j (z )fj () as the expectation j =1 of \random function" gjz with respect to the distribution z of the index j :
fb(; z ) =
M X j =0
jz gjz ():
Now let us draw independently of each other K indices j1 ; :::; jK according to the probability distribution z and let us set
fe(; z; j ) =
K 1X g z () K l=1 jl
[j = (j1 ; :::; jk )]
Note that the resulting function is obtained from f1 ; :::; fM by linear aggregation with the weight vector e (z; j ) 2 which is \K -concentrated" { has at most K nonzero entries. Now let us look at the \aggregation price"
"en(K ) Ez;j (e (z; j )) min ()
of our new { randomized { aggregation routine. Treating gjz () as a random element of L2 (X; ), the conditional, for z xed, distribution of j being z , we observe that (a) gjz ; :::; gjzK are conditionally, z being xed, independent and identically distributed with conditional expectation fb(; z ) (b) The conditional, z being xed, expectation of k gjzl () fb(; z ) k22; does not exceed L2 , where k k2; is the standard norm of L2 (X; ). 1
n
=
o
Ez;j( (e((z; j )) )) K P 1 z 2 Ez Ejjz k K l=1 gjl f k2; (
E E
=
E E
E E =
"
K 1 P z j jz z K l=1[gjl ( ) ( ( K 1 P [g z ( ) j jz z K l=1 jl ( n K 1 P gjzl ( ) z K2 j jz n 2 l=1 L b z K +n f ( ; z ) f ( ) L2 + fb( ; z ) f ( ) z K 2 L + ((z )) : K
=
=
(
k
k
E
k
k E k Ef g
#
h
i
f (; z )] + f (; z ) f () b
b
)
k
2 2;
))
fb(; z )] k22; + k fb(; z ) f () k22;
fb(; z ) k22; + k fb(; z ) f () k22;
o
)
)
o
k22; o k22;
[by (a)] [by (a)] [by (b)]
Combining the resulting inequality with (5.8), we come to the result as follows: Proposition 5.2.2 For the randomized, with parameter K , aggregate fe(; z; j ), the aggregation price can be bounded from above as
"en (K ) Ez;j
p
(L2 + L ) ln M L2 pn + : ((z; j )) min ( ) O (1) K e
(5.19)
q
In particular, choosing K as the smallest integer which is lnnM , we get a randomp ized aggregation routine which is p \ n-concentrated" { the resulting aggregate always is combination of at most K n of the functions f1 ; :::; fM , and the aggregation price of the routine, up to factor 2, is the same as for our basic aggregation routine, see Theorem 5.2.1.
5.3 Lower bound We have seen that when aggregating M functions on the basis of n observations (5.2), the expected aggregation price
E
(()) min () ;
p
() =
Z
X
0 @
f (x)
M X j =1
12
j fj (x)A (dx)
can be made as small as O( ln Mn 1=2 ). We are about to demonstrate that this bound is optimal in order in the minimax sense.
Theorem 5.3.1 For appropriately chosen absolute constant > 0 the following is true. Let positive L; and integer M 3 be given, and let n be a positive integer such that 2 ln M 2 M ln M : (5.20) n L2 L2 For every aggregation routine B solving the Aggregation problem C on the basis of n observations (5.2) one can point out
solute value,
a function f which is a convex combination of the functions f1 ; :::; fM ,
with the following property. Let
f (; z ) = bB
M X j =1
Bj (z )fj ()
be the aggregate yielded by the routine B as applied to the Aggregation problem with the data given by { fj , j = 1; :::; M , as the basic functions, { the uniform distribution on X = [0; 1] as the distribution of the observation points, { the N (0; 2 ) observation noises et , { f as the \true" function, and { the simplex (5.18) as . The expected aggregation price of the aggregate fbB can be bounded from below as
E
(B )
min () = E
p
L ln M (B ) p : n
n
o
(5.21)
In particular, under assumption (5.20) the aggregation price associated with the routines from Section 5.2.2 is optimal in order, in the minimax sense, provided that L = O(1) .
Proof. Let M 3, and let fj (x) = L cos(2jx); j = 1; :::; M: Given a positive integer p M=2, let us denote by Fp the set of all convex combinations of the functions f1 ; :::; fM with the coeÆcients as follows: 2p of the coeÆcients are equal to (2p) 1 each, and the p remaining coeÆcients vanish. It is easily seen that if p M , then Fp contains a subset Fp with the following properties: I. Every two distinct functions from Fp have at most p common nonzero coeÆcients in the basis f1 ; :::; fM , so that
L2 4p
2
k f g k22 L2p
(5.22)
(note that f1 ; :::; fM are mutually orthogonal in L2 [0; 1] and that k fj k22 = 21 ); II. The cardinality K of Fp satis es the relation
K M p 1
(from now on, i > 0 are appropriate absolute constants).
(5.23)
"(p) = fmax 2F p
where
E
f (Bf )
min f ()
f () =k f
M X j =1
= max f 2F p
h
E
n
oi
f (Bf ) ;
(5.24)
j fj k22
and Bf is the vector of aggregation weights yielded by the aggregation routine B, the observations being associated with f . Note that the second equality in (5.24) comes from the fact that is the simplex (5.18) and all f 2 Fp are convex combinations of f1 ; :::; fM . p We claim that if p M , then, for properly chosen 2 , the following implication holds true: L2 2 p2 ln M "(p) < ) n 2 : (5.25) 64p L2 Note that (5.25) implies the conclusion of the Theorem. Indeed, believing in (5.25), choosing s n L b p =c 2 ln M
p
and taking into account that in the case of (5.20) the resulting p is M , provided that is chosen properly, we see that the conclusion in (5.25) fails to be true, so that
"(p)
L2 64p
p
O(1) L plnn M ;
the latter inequality, in view of the origin of "(p), is exactly what we need. It remains to prove (5.25), which can be done by our standard information-based considerations. Indeed, let p satisfy the premise in (5.25), and let B be a method for solving the Aggregation problem with the data we have built. Let us associate with B a method B0 for distinguishing between K hypotheses H` , ` = 1; :::; K , on the distribution of the observation (5.2), `-th of them saying that the observations are associated with `-th signal f ` from Fp . Namely, given observations z , we call B to solve the Aggregation problem; after the corresponding aggregated estimate FB = fB (z ) is obtained, we nd the k k2 -closest to fB function f ` in Fp (if there are several functions of this type, we choose, say, the rst of them) and accept the hypotheses H` . p Since the pairwise k k2 -distances between the signals from Fp are d L= 4p by (5.22), and for every f 2 Fp it holds E fk fB f k22 g "(p) by (5.24), we see that the probability to reject hypothesis H` if it is true is, for every ` = 1; :::; K , at most q "(p)=(d=2) 1=4. On the other hand, it is immediately seen that (!) The Kullback distance between every pair of distributions associated with our K hypotheses does not exceed 2
2 nL : K 2n2 f;gmax k f g k 2 2Fp 4p 2
(5.26)
butions of observations (5.2). Since the entries zt are independent identically distributed, we have
K(Ffn : Fgn) = nK(Ff18: Fg1 ) < Z1
Z1
= n dx : h0
(t f (x)) ln
9 =
(t f (x)) dt (t g(x)) ;
1 i p (t) = 21 expf t2 =(22 )g
=
n 2 2
=
n 2 2
Z1
(f (x) g(x))2 dx
0
kf
g k22 ;
and we conclude that the Kullback distance between Ffn and Fgn does not exceed the quantity K de ned in (5.26). The inequality in (5.26) is given by (5.22).
Applying the Fano inequality (Theorem 1.2.1), we come to
nL2 4p 2
43 ln(K
1)
ln 2;
taking into account (5.23), we come to the conclusion of (5.25).
5.4 Application: Recovering functions from Barron's class Usually, the \complexity" of approximating a multivariate function (e.g., the number of \simple terms" used in approximation) grows rapidly with dimensionality. This is why the Arti cial Intelligence community was happy with the following \dimensionindependent" result:
Theorem 5.4.1 [Barron '93 [1]] Let f : Rd ! R be the Fourier transform of a complex-valued measure of variation 1:
f (x) =
Z
expfi! T xgF (d! );
Z
jF (d!)j 1;
and let be a probability distribution on Rd . Then for every n n-term sum of cosines n X fe(x) = aj cos(!jT x + j )
1 there exists an
j =1
such that
Z
jfe(x) f (x)j2(dx) n1 :
In fact, this theorem goes back to Maurey. In order to simplify notation, assume that Z
jF (d!)j = 1;
(d! ) = jF (d! )j
is a probability distribution. Let
F (d! ) (d! ) be the density of complex-valued measure F with respect to the probability measure , and let g () be random element of the space L2 (Rd ; ) of complex-valued -square summable functions on Rn given by g! (x) = p(! ) expfi! T xg; ! being distributed according to . The expected value of the random function g! () clearly is f , while the second moment of the L2 (Rd ; )-norm of this random function does not exceed 1: p(! ) =
E
Z
jg! (x)j2 (d!) 1;
since k g! () k1 1 and is a probabilistic measure. It follows that if !1 ; :::; !n is a sample of n independent random vectors !j distributed each according to , then
E
8 >
:
9 > =
2
n Z 1 X f (x) (dx)> = 2 E jg!j (x) n j =1 ;
n X
1 g (x) n j =1 !j
f (x)j
2 (dx)
n1
and, consequently, there exists a particular collection ! 1 ; :::! n such that Z
2
n 1X 1 g! j (x) (dx) ; n j =1 n
f (x)
it suÆces to take, as fe(), the real part of the function n n 1X 1X g! j (x) = p(! ) expfi! jT xg: n j =1 n j =1 j The advantage of Barron's result is that the quality of approximation in his theorem depends on the \number of simple terms" in the approximating aggregate and is independent of the dimensionality of the function to be approximated. A disadvantage of the construction is that in order to build the approximation, we need complete knowledge of F , or, which is the same, of f . We are about to demonstrate that the aggregation technique developed in the previous section allows to build a \simple" approximation of f directly from its noisy observations, with basically no a priori knowledge of the function. Namely, assume that all our a priori knowledge about f is that f is the Fourier transform of a complexvalued measure of variation not exceeding a given upper bound L=2 < 1 and vanishing outside the ball of a given radius R: 8 > <
f (x) 2 F (L; R) = >f (x) = :
Z
k!k2 R
expfi! T xgF (d! )
Z
9 > =
jF (d!)j L=2>; :
(5.27)
z = fzt = (xt ; yt = f (xt ) + et )gnt=1
(5.28)
of the values of f , where the observation points xt are independent of each other and are distributed according to certain probability measure , and the observation noises et are independent of each other and of fxt gnt=1 and have zero mean and bounded variance: n o E fetg = 0; E e2t 2: (5.29) We do not assume the measure to be known; all our a priori information on this measure is that Z k x k22 (dx) x2 (5.30) with certain known in advance x < 1. In order to recover f via observations (5.28), we act as follows: Initialization. Given ; n; d; L; R; x , we set (a) = (b) " =
p
L2 +L ; n1=4 p 2 = 2 L2 +L Lx n1=4 Lx
(5.31)
and build an "-net = f!k gKk=1 in the ball WR = f! 2 Rd j k ! k2 Rg. It is easily seen that the cardinality K of the net can be chosen to satisfy the bound K (1 + 2" 1R)d : (5.32) Estimation. We set M = 2K , = f 2 RM j k k1 1g and de ne the basic functions fj , j = 1; :::; M; as
f2k 1 (x) = L cos(!kT x); f2k (x) = L sin(!kT x); k = 1; 2; :::; K: Then we use the aggregation routine from Section 5.2.2 to get \nearly closest to f " weighted combination
fn (; z ) = b
M X
j =1
j (z )fj ()
[
M P
j =1
jj (z)j 1]
of functions fj and treat this combination as the resulting approximation of f .
Remark 5.4.1 Applying our \n-concentration" technique, we can enforce fbn to be a weighted sum of at most n + 2 cosines, similarly to the approximation given by Barron's Theorem.
The rate of convergence of the outlined approximation scheme is given by the following
Theorem 5.4.2 Let f 2 F (L; R) and let (5.29), (5.30) be satis ed. Then for all n
one has n
E k fbn(; z) f () k
2 2;
o
p
1=4 2 x O(1) (L + Lp)n d ln Mn ; Mn = 2 + pn L2LR : (5.33) + L
fe(x) = with 2 such that
M X j =1
j fj (x)
k f fe k2; :
(5.34)
Indeed, we have
f (x) =
Z
WR
expfi! T xgF (d! ) with
Z
WR
jF (d!)j L=2:
Since is an "-net in WR , we can partition WR into K non-overlapping sets k in such a way that !k 2 k and k is contained in the ball of radius " centered at !k , for all k. Setting
pk = fe =
R
F (d! ) = ak + bk i; k = 1; :::; K;
k K P
n
o
< pk expfi!kT xg k=1
=
2k 1 = L1 ak ; 2k = L1 bk ;
we get
M X j =1
jj j
and
jf (x) f (x)j e
=
p
2L
P K k=1 k Z h P K k=1
k K Z P
K 1X k=1
p expf
k=1
k
K P k=1
[2k 1 f2k 1 (x) + 2k f2k (x)] ;
jpk j
Z
WR
g f (x)
i!kT x
expfi! T xg
expfi! T xg
[since f is real-valued] i
expfi!kT xg F (d! )
expfi!kT xg jF (d! )j
K Z
" k x k2 kP=1 jF (d!)j
k
) k fe f k2;
jF (d!)j 1
[since j!
!k j " 8! 2 k ]
" k x k2 L=2 0:5"Lx
= [see (5.31.(b))]
as required. 20 . Applying Theorem 5.2.1, we get n
o
p
M X
E k f () fbn(; z) k22; O(1) (L +Lpn) ln M + min kf j fj k22; 2 j =1 p M ln (L +L ) 2 e O(1) pn + k f f k2;; 2
2
n
E k f () fbn(; z) k22; It remains to note that M
o
p
2 pn) ln M + 2: O(1) (L + L
2Mnd by (5.32) and that 2 L p+nL by (5.31.(a)). 2
Discussion. Theorem 5.4.2 establishes \nearly dimension-independent" rate of convergence of approximations of a function f 2 F (L; R) to the function: when all but the dimension parameters (i.e.,p; L; R; x ) are xed, the rate of convergence (measured n o 2 b as E k f fn k2; ) is O( dn 1 ln n), so that the volume of observations required to approximate f within a given margin is just proportional to the dimension d. To
understand that this linear growth indeed means \nearly dimension-independence" of the complexity of recovering a function, note that for the \usual" functional classes, like Sobolev and Holder balls, the number of observations (even noiseless) needed to recover a function within a given inaccuracy grows with the dimension d like expfdg ( > 0 depends on the parameters of smoothness of the class in question). It should be stressed that the rate of convergence given by (5.33) is nearly independent of the parameters R; x ; we could allow p these parameters to grow with n in a polynomial fashion, still preserving the O( dn 1 ln n)-rate of convergence. By similar reasons, we would not loose much when replacing the assumption that the Fourier transform of f vanishes outside a given compact with bounds on the \tails" of this transform, thus coming to the classes like
F (L; )
=
ff =
R
R
R
expf gF (d!) jF (d!)j L; jF (d!)j R 8R > 0g:
i! T x
k!k2 >R
As compared to the original result of Barron, the result stated by Theorem 5.4.2 has, essentially, only one drawback: the rate of convergence (5.33) is nearly O(n 1=2 ), while in Barron's theorem the rate of convergence is O(n 1 ). This \slowing down" is an unavoidable price for the fact that Theorem 5.4.2 deals with the case of approximating unknown function from Barron's-type class. In this case, the convergence rate O(n 1=2 ) is nearly optimal in the minimax sense, as stated by the following result of [16]:
Theorem 5.4.3 Let L > 0. Consider the problem of estimating a univariate function f : R ! R via observations (5.28), where xt are uniformly distributed on [0; 1] and et N (0; 2). Let Fn be the class of all real-valued trigonometric polynomials of degree n with the sum of absolute values of the coeÆcients not exceeding L. Then, for appropriately chosen absolute constant > 0 and for all large enough values of n, for every algorithm B approximating f 2 Fn via n associated with f observations
(5.28) it holds
s
(5.35) k f fbB k L lnnn ; f 2Fn here fbB is the estimate yielded by B, the function underlying observations being f . sup E
n
2 2
o
Following [16], consider a nonlinear time-invariant dynamic system:
yt = f (yt 1 ; yt 2 ; :::; yt d ) + et ;
(5.36)
e0 ; e1 ; ::: being independent noises. We do not know f , and our target is to predict, given y0 ; :::; yn, the state yn+1 . A natural way to approach our target is to recover f from observations and to form the prediction as ynp+1 = fbn (yn; :::; yn d+1 ); (5.37) fbn being the estimate of f built upon the rst n observations (5.36). Setting xt = (yt 1 ; :::; yt d )T , we can represent the observations accumulated at time instant n as z = fzt = (xt ; yt = f (xt ) + et )gnt=d :
(5.38)
The situation resembles the observation scheme (5.2), up to the fact that now the points xt where we observe f depend on each other in a complicated and unknown fashion rather than to be i.i.d. Let us ignore this \minor dierence" (we are not going to prove anything, just to look how it works) and act as if fxt g were i.i.d. Assume that the dynamic system in question is known to be semilinear (\a system with single output nonlinearity"):
f (x) = (pT x): If p were known, we could project our observation points xt onto the corresponding axis, thus reducing the situation to the one where we are observing a univariate function . As a result, we would be capable to recover the multivariate function f as if it were a univariate function. In the case when p is unknown (this is the case we are interested in) it makes sense to use the approach outlined in Section 5.1, namely, to choose a \ ne nite grid" in the space of d-dimensional directions and to associate with every direction p 2 the estimate fbp of f corresponding to the hypothesis that the \true" direction is p. We can use, say, the rst half of our n observations to build the associated realizations fp , p 2 , of our estimates, and use the remaining half of observations to aggregate the resulting basic estimates, as described in Section 5.2.2, thus coming to the aggregated estimate fbn to be used in the predictor (5.37). We are about to present the results yielded by the just outlined scheme as applied to systems of the type 8 > > > <
(Dd ) : > > > :
yt = F (pT x) + t ; xTt = (yt 1 ; :::; yt d ); F (z ) = cos(4z ) + cos(5z ); t N (0; 1); p = d 1=2 (1; :::; 1)T 2 Rd :
In our simulations, we dealt with the dynamics (Dd ) with d = 2; 3. In the case of d = 2, the grid of directions was 1 M cos( 0 + jM ) pi = sin( + jM 1 ; 0 j =1
the grid was comprised of M = 3144 randomly generated directions in R3 . In both cases, the basic estimates fp were the zero order spatial adaptive estimates from Chapter 3 (modi ed in an evident manner to get the possibility to work with nonequidistant grids of observation points). In our experiments, we used the rst 1024 observations zt to build the basic estimates, the next 1024 observations to aggregate these estimates by the aggregation routine from Section 5.2.2, the underlying set being the standard simplex X
f 2 RM j 0; j = 1g; j
and used the resulting predictor (5.37) at 2048 subsequent time instants in order to measure the empirical standard deviation
Æ=
v u u t
X 1 4096 (f (xt ) 2048 t=2049
ytp )2 :
In order to understand what is the eect of our \structure-based" prediction scheme { one which exploits the a priori knowledge that the actual dynamics is semilinear, we have compared its performance with the one of the \standard" prediction scheme based on the zero order spatial adaptive non-parametric recovering of f (treated as a \general-type" function of d variables) from the rst 2048 observations (5.38). The results of the experiments are as follows: Method = 0:1 = 0:33 Structure-based predictor, dynamics (D2 ) 0.093 0.275 Standard predictor, dynamics (D2 ) 0.483 0.623 Structure-based predictor, dynamics (D3 ) 0.107 0.288 Standard predictor, dynamics (D3 ) 0.244 1.013
Empirical standard deviation
The histograms of the prediction errors f (xt ) ytp and typical prediction patterns are as follows: Finally, this is how the function f itself was recovered in the case of dynamics (D2 ):
450
160
400
140
350
120
300 100 250 80 200 60 150 40
100
20
50 0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Structure-based predictor, = 0:1 180
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Standard predictor, = 0:1 120
160 100 140 120
80
100 60 80 60
40
40 20 20 0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Structure-based predictor, = 0:33 Standard predictor, = 0:33 Figure 5.1: Distribution of prediction errors, dynamics (D2 ).
450
140
400 120 350 100 300 250
80
200
60
150 40 100 20 50 0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Structure-based predictor, = 0:1 200
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Standard predictor, = 0:1 80
180
70
160 60 140 50
120 100
40
80
30
60 20 40 10
20 0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Structure-based predictor, = 0:33 Standard predictor, = 0:33 Figure 5.2: Distribution of prediction errors, dynamics (D3 ).
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2 3000
3005
3010
3015
3020
3025
3030
3035
−2 3000
3005
3010
3015
3020
3025
3030
3035
Structure-based predictor, = 0:33 Standard predictor, = 0:33 Figure 5.3: Prediction patterns, dynamics (D2 ). [circles: f (xt ); crosses: ytp ]
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2 3000
3005
3010
3015
3020
3025
3030
−2 3000
3035
3005
3010
3015
3020
3025
3030
3035
Structure-based predictor, = 0:1 Standard predictor, = 0:1 Figure 5.4: Prediction patterns, dynamics (D3 ). [circles: f (xt ); crosses: ytp ]
2 1 0 −1
2 1
−2 −2
0 −1 0
−1 1 2
−2
Dynamics (D2 )
2
3 2
1 1 0
0 −1
−1
2
2 −2
1
−2 −2
0 −1
1
−3 −2
0 −1
0
−1 1
0
−1 1
2
−2
Structure-based reconstruction
2
−2
Standard reconstruction
Figure 5.5: Reconstructions of dynamics D2 , = 0:1.
2 1 0 −1
2 1
−2 −2
0 −1 0
−1 1 2
−2
Dynamics (D2 )
2
3 2
1 1 0
0 −1
−1
2
2 −2
1
−2 −2
0 −1
1
−3 −2
0 −1
0
−1 1
0
−1 1
2
−2
Structure-based reconstruction
2
−2
Standard reconstruction
Figure 5.6: Reconstructions of dynamics D2 , = 0:33.
Chapter 6 Aggregation of estimates, II We proceed with aggregating estimates associated with a number of \concurrent hypotheses" on the observed regression function. In the previous chapter our goal was, essentially, to reproduce the best convex combination of the estimates, while now we focus on reproducing the best of the estimates or their best linear combination.
6.1 Gaussian white noise model of observations It makes sense now to switch from the \discrete" models of observations we dealt with to the moment to the \continuous" model. In this new model, a signal f : [0; 1] ! R is observed in continuous time Gaussian white noise of intensity "2 . In other words, our observation is the random function
y (x) = yf;"(x) =
Z x
0
f (s)ds + "W (x);
(6.1)
W (x) being the standard Wiener process. Model (6.1) is very popular in Nonparametric Statistics by the reasons as follow. There exists a rich \L2 regression theory", where the quality of restoring f is measured in the L2 norm and a priori assumptions on the signal are expressed in geometric form { usually, as hypotheses on the rate at which f can be approximated by elements of a given sequence of nite-dimensional subspaces E1 E2 ::: of L2 . A typical example is a periodic with derivatives of order < k, of the period 1, signal from the Sobolev ball Sk;1 2 (L): Z 1 (f (k) (x))2 dx L2 : 0
The indicated properties of f are equivalent to the fact that 1 X
j =1
(2j )2k [f22j 1 + f22j ] L2 ;
(6.2)
where ffj g1 j =0 are the Fourier coeÆcients of f in the standard trigonometric orthonormal basis of L2 [0; 1]
p
p
0 (x) 1; 2j 1(x) = 2 cos(2jx); 2j (x) = 2 sin(2jx); j = 1; 2; ::: Note that (6.2) is just a way to x the rate at which f can be approximated, in the L2 -metric, by a trigonometric polynomial of degree j . 125
venient model of observations, since it admits a very transparent and simple \translation" to the language of the L2 -geometry. As a result, with this model we get a \convenient road" to a number of interesting and instructive results. Now the role of \volume of observations" n is played by the quantity " 2 ; instead of asking \how well can we recover a signal from a large number n of noisy observations of the signal", we now ask how well we can recover a signal aected by Gaussian white noise of small intensity "2 . \Scienti c practice" demonstrates that the majority of asymptotic, " ! 0, results of the L2 regression theory with observations (6.1) can as well be established (under appropriate technical assumptions) for more (or less?) realistic discrete models of observations like the one where we observe the signal along an equidistant (or random) n-point grid, variance of the noise aecting a particular observation being 2 . The \translation" of the results obtained for the continuous model of observations to those for the discrete model is given by the correspondence 2 n 1 = "2 . Which one of these models to use, it is, essentially, the question of mathematical convenience, and in our course we have reached the point when it de nitely is easier to deal with model (6.1).
L2 regression theory: the language. It is well-known that observations (6.1) are equivalent to the possibility to observe the L2 [0; 1]-inner products of the signal f with functions 2 L2 [0; 1]. Namely, given a function 2 L2 [0; 1], one can convert a realization of observation (6.1) in a realization of the random variable Z 1
0
where
(x)dy (x) = (f; ) + " ;
(f; g ) =
Z 1
0
(6.3)
f (x)g (x)dx
is the standard inner product in L2 [0; 1]. It turns out that the vector of random noises fi gki=1 corresponding to every nite collection of i 2 L2 is Gaussian, and its covariance matrix is just the Gram matrix of 1 ; :::; k :
E f g = (;
)
8; 2 L2 :
(6.4)
It should be mentioned that for every 2 L2 the left hand side in (6.3) is wellde ned with probability one, the probability space in question being generated by the underlying Wiener process; thus, it makes no sense to speak simultaneously about values of all random noises f j 2 L2 g, but it does make sense to speak about values of any countable collection from this set, and this is the only situation we shall deal with. The outlined properties of model (6.1) allow to pass from the \functional" language to the geometric one and to represent the situation we are interested in as follows. We x a real separable Hilbert space H with inner product (; ) and the associated norm k k; the \signals" we are observing are just the elements of this space. An observation y of a signal f 2 H is comprised of noisy measurements
fy(f; ") = (f; ) + "g2H
(6.5)
Gaussian random vectors with covariance matrices given by (6.4). In (6.5), " is given \noise intensity"1) Note that a suÆcient statistics for (6.5) is given already by the sequence of observations n
y f;" = yif;" yi (f; ") = (f; i ) + "i (f; i ) + "i
o
(6.6)
associated with a xed orthonormal basis fi g1 i=1 of H ; given these observations, we can recover y(f; ") for every 2 H according to
y (f; ") =
1 X i=1
(; i)yi (f; "):
Thus, in fact our observations are just noisy observations of the coordinates of the signal in a somehow xed orthonormal basis of H , the noises in the observations forming a sequence of independent N (0; "2) random variables. Our goal is to recover signal f from the associated observations. A recovering routine fb() is a Borel mapping acting from the set RZ of sequences with real entries to H , the set of sequences being equipped with the usual Tikhonov topology of the direct product (in this topology, RZ is a Polish space). The reconstruction associated with observations (6.5) is the random vector
fb(y f;") 2 H; where y f;" are given by (6.6), fi g being a xed orthonormal basis in H (it does not matter how we choose this basis: as explained above, the observations associated with a basis can be converted to those associated with any other basis). Given noise intensity ", we measure the quality of a recovering routine fb at a signal f 2 H by the quantity
n
R"(f;b f ) = E k fb(yf;") f k2
o1=2
;
(6.7)
the expectation being taken over the observation noise. Given a subset F H , we measure the quality of the routine fb on the set F by the corresponding worst-case risk R" (f;b F ) = sup R"(f;b f ): (6.8) f 2F
The minimax risk associated with F is the function
R ("; F ) = infb R"(f; F ) = infb sup R"(f;b f ): f
f f 2F
(6.9)
Finally, an estimation method is a family ffb" g">0 of recovering routines parameterized by the noise intensity; we say that such a method is asymptotically optimal/optimal in order in the minimax sense on a set F H , if
R"(fb"; F ) C (")R("; F ) where C (") converges to 1, respectively, remains bounded as " ! +0. 1) In the standard terminology, the intensity of noise in (6.5) is "2 rather than ". In order to get a name for the quantity ", we prefer to call it, and not its square, the intensity of noise.
estimates The problem. Assume we observe signals from a separable Hilbert space H according to (6.6) and are given a collection n
o
M = Mj = ffb"j g">0; j = 1; :::; M of M estimation methods. For every signal f 2 H , let X M"(f; y) = min k f j fb"j (y ) k M 2R j
be the distance from f to the linear span of the estimates fb"j (y ). When aggregating the given estimates in a linear fashion (however, with the weights which may depend on observations), and being clever enough to nd, for every signal f underlying observations and every sequence of observation noises, the best { the closest to f { \mixture" of this type, we would recover f with inaccuracy M"(f; y ); the risk of this \ideal linear aggregation" would be
n
o1=2
2 f;" RLA M ("; f ) = E M" (f; y )
:
(6.10)
The problem we are about to address is as follows:
Aggregation problem L. Given a collection M of M estimation methods, nd an estimation method with the risk, at every f 2 H , \close" to the risk RLA M ("; f ) of the \ideal" linear aggregation of the methods from M.
A solution: the idea. The problem we are interested in admits an extremely simple (and, as we shall see in a while, quite powerful) \solution" as follows. Assume we observe a signal f 2 H twice, so that we have two realizations y 0; y 00 of observation y f;, the noises aected the realizations being independent of each other; let the intensities of noise in y 0 and y 00 be "0 ; "00 , respectively. Let us use the rst realization of observations to build the M estimates f j = fb"j0 (y 0 ), j = 1; :::; M . Consider the linear span 8 <
L = L(y 0 ) = :g =
M X j =1
9 =
j f j j 2 R M ; H
of these estimates; this is a random linear subspace of H of dimension not exceeding M . To simplify notation, assume that this dimension almost surely is equal to M (what follows can be modi ed in an evident fashion to capture the general case as well). Applying the orthogonalization process, we may build a basis in L comprised of M orthonormal vectors h1 ; :::; hM ; these vectors are deterministic functions of y 0 . Now let us use the second observation, y 00, to evaluate the orthogonal projection fL of f onto L = L(y 0). The orthogonal projection itself is given by
fL =
M X
(f; hj )hj ;
j =1
(6.11)
k f fL k2 = M2"0 (f; y0):
(6.12)
Observation y 00 provide us with noisy observations
zj = (f; hj ) + "00 j00 ; j00 being independent of each other and of y 0 N (0; 1) random noises. Using zj in (6.11) instead of the \true" Fourier coeÆcients (f; hj ), we come to the estimate M X
fe = fe(y 0; y 00 ) =
j =1
zj hj = fL + "00
M X j =1
j00 hj :
(6.13)
Let us evaluate the quality of the resulting estimate of f . We have
k f f k =k f fL k e 2
2
+2"00 (f
fL ;
M X j =1
j00 hj ) + ("00 )2 M:
Taking expectation over noises aecting y 0; y 00 and taking into account that j00 are independent of y 0 , we get n
E k f fe k2
o
=
whence, in particular,
n
E k f fe k2
2
0 RLA M (" ; f )
o1=2
+ ("00 )2 M;
(6.14)
0 00 p RLA M (" ; f ) + " M:
(6.15)
The simple result we have obtained looks as a \nearly solution" to the Aggregation problem L: in the right hand side of (6.15) we see the risk RLA M of the \ideal" linear 0 aggregation (associated,p however, with noise intensity " rather than "), plus the \aggregation price" "00 M . As we shall see, in many important cases this price is negligible small as compared to the risk of the ideal linear aggregation, so that (6.15) is, basically, what we need. There is, however, a diÆculty: our estimation method requires two independent observations of the signal, while in our setting of the Aggregation problem we are allowed to use only one observation. We are about to demonstrate that this diÆculty can be easily avoided { we always can \split" a single observation we have into two (or 1000) independent observations.
Splitting observations. Let us start with the following simple situation: we are given a realization of an N (a; 2 ) random variable; is known, a is unknown. Can we \split" our observation in a given number k of independent of each other realizations ` , ` = 1; :::; k, of N (a; `2 ) random variables? What could be the corresponding variances `2 ? The answer is immediate: we claim that the required partitioning is possible, provided that k 1 1 X = : (6.16) 2 `=1 `2
0
1
1 q12 q13 ::: q1k B 1 q22 q23 ::: q2k C C Q=B B @ ::: A ::: ::: ::: ::: C 1 qk2 qk3 ::: qkk such that for the rows q1 ; :::; qk of the matrix it holds
qjT q` = Æj` being the Kronecker symbols.
j2 Æ ; 2 j`
(6.17)
Matrix Q can be built as follows. Indeed, let e1 ; :::; ek be the standard basic orths in Rk , and let rj
=
j ej ; j
= 1; :::; k; u =
T ; :::; : 1 k
By construction we have rjT u = 1; j
= 1; :::; k;
and u is a unit vector by (6.16). Let us pass to an orthonormal basis of Rk where the rst vector of the basis is u, and let qj be the vector of coordinates of rj in this new basis. Then (qj )1 = rjT u = 1
2
8j and qjT q` = rjT r` = j2 Æj`;
(6.18)
as required.
Now assume that we are given ; fi gki=1 satisfying (6.16), and a realization of N (a; 2) random variable, and our goal is to \split" in a sample of k independent N (a; i2) random variables 1; :::; k . To this end let us build matrix Q satisfying (6.17), generate k 1 independent of each other and of \arti cial" N (0; 1) random variables !1 ; :::; !k 1 and set 0 1 0 1 1 B ! C 1 C B C B C B 2 C B (6.19) B C = Q B !2 C : B C @ ::: A @ ::: A k !k 1 From (6.17) combined with the fact that the rst column of Q is comprised of ones it immediately follows that j , j = 1; :::; k, are independent of each other N (a; j2 ) random variables. After we know how to split a single realization of an N (a; 2 ) random variable, we know how to split a single realization y f;" of observation (6.6) into a desired number k of independent realizations y f;"i , i = 1; :::; k, of the same signal with prescribed noise intensities 1 ; :::; k satisfying the \balance equation" k 1 1 X = "2 i=1 "2i
(6.20)
yjf;", using, for every index j = 1; 2; :::, \its own" arti cial random ! -variables. Thus, from the statistical viewpoint, we always may assume that instead of observation (6.6) we are given a desired number k of independent of each other similar observations, the noise intensities of the observations being linked by (6.20). From the implementation viewpoint, our \observation splitting" just means that we pass from deterministic recovering routines to randomized ones. As a byproduct of our \splitting result", we see that as far as model (6.6) of observations is concerned, the quantity " 2 indeed behaves itself as the \volume of observations": given an observation of \volume n = " 2 ", we can partition it into a prescribed number k of independent observations of prescribed volumes nk = "k 2 , provided that n = n1 + ::: + nk . And of course vice versa: given k independent observations y f;"i of the samePsignal, we can aggregate them into a single observation P y of the volume n = ni "1i : it suÆces to set i
i
2
k X 1 y= ni y f;"i : n1 + ::: + nk i=1
Finally, note that in the discrete model of observations similar \splitting" is given by \physical" splitting of observations, like partitioning all observations in subsequent segments, 5 observations per each, and putting the rst observation from every segment to the rst group, two next { to the second group, and two more { to the third one.
Intermediate summary. Let us come back to Aggregation problem L, and let us x somehow n o a set F H of signals we are interested in, along with a collection M = fb"j () "> of M estimation methods. Assume that 0
j =1;:::;M
A. The worst-case, over f 2 F , risk LA RLA M ("; F ) = sup RM ("; f )
f 2F
of \ideal linear aggregation" of the estimation methods from M is a \wellbehaved" function of " as " ! 0: whenever Æ (") " is such that Æ (")=" ! 1, " ! +0, one has LA RLA (6.21) M (Æ ("); F ) (1 + o(1))RM ("; F ); " ! 0: B. The worst-case, over f 2 F , risk of \ideal linear aggregation" of the estimation methods from M is \non-parametric": " 1 RLA (6.22) M ("; F ) ! 1; " ! 0:
Both of these assumptions are very natural. As about B, note that already the minimax risk of estimating k-parametric signals
f
2 Fk (L) = ff =
k X i=1
fi i j
k X i=1
fi2 L2 g
known to us. Under assumptions A, B we can implement the outlined aggregation scheme as follows: Setup. Choose Æ1 ("); Æ2 (") satisfying the relations 1 1 1 (a) Æ (") + Æ (") = " ; Æ (") ! 1; " ! +0; (b) " 1 LA (c) Æ2 (")RM ("; F ) ! 1; " ! 0; 2 1
2 2 1
2
(6.23)
which is possible in view of (6.22). Aggregation.
1) Given observation y = y f;" with known noise intensity ", we split it into two independent observations y 0 = y f;Æ (") ; y 00 = y f;Æ (") . 2) We build M vectors f j = fbÆj (") (y 0), j = 1; :::; M , and apply to these vectors orthogonalization procedure to get an orthonormal system h1 ; :::; hM with the same linear span. 3) We use the observation y 00 to get estimates zj = (f; hj ) + Æ2 (")j of the projections of f on the directions h1 ; :::; hM and de ne the resulting estimate of f as 1
2
1
fe"(y ) =
M X
j =1
zj hj :
Relation (6.15) immediately yields the following
Proposition 6.2.1 Under assumptions A, B one can solve the Aggregation problem L associated with the collection M of estimation methods and the set of signals F \asymptotically ideally" in the minimax sense. Namely, for the outlined estimation method ffe"g">0 one has
n
R" (fe"; F ) sup E k f fe"(yf;") k2 f 2F
o1=2
(1 + o(1))RLA M ("; F ); " ! +0:
(6.24)
Indeed, from (6.15) it follows that
p R" (fe"; F ) RLA M (Æ1 ("); F ) + Æ2 (") M:
By (6.23.b) and Assumption A, the rst term in the right hand side of this bound is LA (1 + o(1))RLA M ("; F ), while the second term, by (6.23.c), is o(1)RM ("; F ). Note that we are not restricted to deal with mimicking the best linear aggregation of once for ever xed number M of estimation methods: we can allow the collection to extend at certain, not too high, rate as " ! +0. Thus, assume that we have a \nested family" n o M = fb"j () "> 0
j =1;:::;M (")
of estimation methods. The notions of the ideal linear aggregation of the methods LA from the family and the associated \ideal aggregation risks" RLA M ("; f ) and RM ("; F )
extended to our new situation. The assumptions A, B which allowed us to get Proposition 6.2.1 now should be modi ed as follows: A remains unchanged, and B is replaced with the assumption
B.1. The worst-case, over f 2 F , risk of \ideal linear aggregation" of
the estimation methods from M satis es the relation q
" M (")
1
RLA M ("; F ) ! 1; " ! +0
(6.25)
which is an upper bound on the rate at which M (") is allowed to grow as " ! +0. Finally, the setup rule (6.23.c) should be replaced with
q
Æ2 (") M (")
1
RLA M ("; F ) ! 1; " ! +0:
(6.26)
With these modi cations of the assumptions and the construction, we still ensure (6.24), i.e., still are able to get \asymptotically ideal" linear aggregation of our, now extending as " ! +0, nested family of estimation methods.
6.3 Application: aggregating projection estimates Recall that we have xed an orthonormal basis fi g1 i=1 in our \universe" { in the Hilbert space H . To the moment this basis was playing a completely technical role of representing an observation as a countable (and thus { \tractable") sample of random variables. In fact, in the traditional L2 regression theory basis plays a much more important role { the majority of the traditional estimators are \basis-dependent". As the simplest { and typical { example, consider a linear or, better to say, a simple lter estimate associated with a given basis fi g.
6.3.1 Linear estimates
A linear estimate is speci ed by a square-summable sequence of its weights = fi g1 i=1 and is just 1 X fb (y ) = i yi i : (6.27) As applied to an observation
y f;",
fb (y f;")
=
i=1
the estimate becomes
1 X i=1
1 P
i (f; i )i + "
"1 X
i=1
#
i i i
(6.28)
The \stochastic vector series" " i i i in the right hand side of this expression i=1 clearly converges in the mean square sense to a random element of H (recall that is a square summable sequence), so that the estimate makes sense. One can easily
R2" (fb; f )
1 P
1 P
= (1 i )2 (f; i )2 + "2 2i i=1 i=1 = ds2 (; f ) + "2 e2 ();
d(; f ) =
e(; f ) =
1 P
si=1
1 P
i=1
(1 i )2 (f; i)2
(6.29)
2i :
The \deterministic" component d2 (; f ) of the squared risk depends on the signal f and is nothing but the squared norm of the bias ofPour lter (the dierence between P the input f = fi i of the lter and its output i fi i in the absence of errors; i i from now on, fi = (f; i ) stand for the coordinates of a signal in our xed basis). The stochastic component "2 e2 () of the squared risk is nothing but the energy of the noise component of the output, the input being aected by white noise of intensity ". The simplest linear estimates are the so called projection estimates fbk { the weights j are equal to 1 for j not exceeding certain k (called the degree of the estimate) and are zero for j > k. Note that for the projection estimate of degree k relation (6.29) becomes 1 X R2" (fbk ; f ) = fi2 + "2 k (6.30) i=k+1
and is very transparent: the larger is the degree of the estimate, the less is the deterministic component of the squared risk and the larger is its stochastic component { the situation similar, \up to vice versa", to the one with window estimates. \Reversed", as compared to the case of window estimates, monotonicity properties of the deterministic and the stochastic components of the squared risk as functions of the \window width" (the role of the latter now is played by the degree k of the estimate) is quite natural: narrow windows in \time domain" correspond to wide windows in the \frequency domain". From the theoretical viewpoint, the interest in linear and projection estimates comes from the fact that they are minimax optimal in order on very natural classes of signals { on \ellipsoids". An \ellipsoid" is given by a sequence of its \half-axes"
a1 a2 ::: :
ai 2 R [ f+1g; ai ! 0; i ! 1
and its \radius" { a positive real L { and is de ned as
E (fai g; L) = ff
2Hj
1 X
fi2 2 i=1 ai
L2 g:
For example, the class of k timesR dierentiable periodic with derivatives of order < k, of period 1, signals f with 01 jf ( k)(x)j2 dx L2 is an ellipsoid with respect to the standard trigonometric orthonormal basis in L2 [0; 1], and its half-axes are ai = (1 + o(1))(i)k (see (6.2)). Similarly, a Sobolev ball S1k;2 (L) (see Chapter 2) is
in L2 [0; 1], the half-axes of the ellipsoid possessing the same asymptotics as in the periodic case (6.2). It can be easily seen that when estimating signals from a given ellipsoid, one can nd optimal in order estimates already among the simplest { the projection { ones, choosing properly the degree of the projection estimate as a function of noise intensity ". Moreover, given an ellipsoid E and a noise intensity ", we can easily build the best, in the minimax sense on E , among all linear estimates. In view of (6.29), to this end it suÆces to solve the optimization program " () ! min 1 1 1 P P P " () supf (1 i )2 fi2 + "2 2i j f : (fi =ai )2 L2 g =
i=1 2 L maxi (1
i )2 a2i
i=1 1 2 P 2 +" i : i=1
i=1
It is immediately seen that the optimal solution to this optimization program has the following structure: (a t("))+ i = i ; (6.31) ai where a+ = max(a; 0) and t(") is the minimizer of the univariate function X [(ai t)+ ]2 2 2 2 " (t) = L t + " a2i i over t > 0. A remarkable result of M.S. Pinsker [26] is that under minimal regularity assumptions on the sequence of half-axes of the ellipsoid in question (which are satis ed for the case when the axes decrease as a power sequence aj j or as a geometric progression), the optimal in the minimax sense on E linear estimate is asymptotically optimal in the minimax sense among all possible estimates. As a byproduct of this fact, we can point out not only the order of the principal term of the minimax risk R("; E ) as " ! +0, but this principal term itself; this is a very rare case in the non-parametric regression when we know both the principal term of the minimax risk and an asymptotically optimal up to factor (1 + o(1)), not just in order, estimation method. A shortcoming of the initial results on minimax optimality in order/minimax optimality up to factor (1 + o(1)) of projection/linear estimates on ellipsoids is that to build an estimate of this type we should know the parameters of the ellipsoid { the sequence of its half-axes and its radius (cf. the case of estimates from Chapters 1 { 2). As far as the projection estimates are concerned, there are several popular, although not too well understood theoretically, techniques for specifying the degree of the estimate from observations; for linear estimates, for a long time no theoretically valid \adaptation schemes" were known. A breakthrough in the area is due to Efroimovich and Pinsker [27] who proposed an adaptive estimation method which is asymptotically optimal (up to (1 + o(1))!) in the minimax sense on a wide spectrum of ellipsoids. We are about to demonstrate that the results completely similar to those of Efroimovich and Pinsker can be obtained just by linear aggregation of projection estimates.
Looking at the structure (6.31) of the optimal, in the minimax sense on an ellipsoid E , linear estimate, we see that this estimate has a nonincreasing sequence of weights belonging to the segment [0; 1] 2) . In other words, all these estimates belong to the set of linear estimates with nonincreasing weights from [0; 1]: X = f 2 `2 j 1 1 2 :::; 2i < 1g: i
Now, given a signal f 2 H and a positive ", we may ask ourselves what is the best, for these f and ", linear estimate from the class . The answer is clear: the corresponding weights are given by the solution to the optimization problem (Pf;" ) :
min
8v 1
i=1
9 =
i )2 fi2 + "2 2i ] j 2 ; :
[(1
The optimal value in this problem, i.e., the best quality of reproducing f from observations (6.6) by a linear estimate with nonincreasing weights, the intensity of noise being ", is certain function (f; ") of f; ". What we are about to build is an estimation method Bm = ffb"m ()g">0 depending on a single \design parameter" m 2 N with the following property: (!) Whenever k f k 1 and 0 < " < 1, the risk of the estimate fb"m at f can be bounded as follows: R" (fb"m; f ) (1 + m("))(f; ") + Cm"Lnm(1="); (6.32) where m (") ! 0; " ! 0, is independent of f , Cm depends on m only, and Lnm (x) = ln (1 + ln (1 + ln (1 + :::{z+ ln (1 + ln (1 + x)) :::)))} | is the \m-iterated" logarithm.
m
times
Postponing for the moment the construction which leads to (6.32), let us look what are the consequences. Consider an ellipsoid E (fai g; L) and assume (in fact this assumption can be eliminated) that the ellipsoid is contained in the unit ball. According to Pinsker's result, for a given intensity of noise the best, in the minimax sense on E , linear estimate is minimax optimal up to (1 + o(1)) factor as " ! +0, and this estimate, as we have seen, for every " is given by certain weight sequence = (") 2 . Combining this fact with the de nition of (; ), we conclude that for the minimax risk R ("; E ) associated with the ellipsoid E it holds R ("; E ) (1 o(1)) sup (f; "); " ! +0: f 2E
In view of this relation, (6.32) implies that R" (fb"m; F ) (1 + o(1))R("; E ) + Cm"Lnm(" 1); " ! +0: Consequently, 2) From (6.29) it is absolutely clear that there is no sense to speak about linear estimates with part of the weights outside [0; 1]: replacing a weight i 62 [0; 1] by the closest weight from this segment, we always improve the quality of the estimate. The actually important observation is that the weights i given by (6.31) form a nonincreasing sequence.
max sense, up to (1 + o(1)) factor, on every ellipsoid E such that
1
"Lnm (" 1 )
R ("; E ) ! 1; " ! 1:
(!!) is a very strong \certi cate of adaptive optimality", since the minimax risks associated with interesting ellipsoids do not decrease with " too fast. E.g., in the case when fi g is the standard trigonometric basis in L2 [0; 1], it turns out that
When the half-axes ai of E decrease sub-linearly:
ai O(i ); for some , as it is the case for ellipsoids comprised of smooth periodic functions of xed degree of smoothness, one has
R ("; E ) ! 1; " ! +0; " ln(" 1 ) so that already the method B1 is asymptotically optimal on E ;
When ai decrease at most exponentially:
ai O(expf ig); i ! 1; for some , as it is the case, e.g., for classes of functions f (x) = (expf2ixg), (z ) being analytic in a xed ring containing the unit circumference, one has
R ("; E ) ! 1; " ! +0; "Ln (" 1 ) 2
so that the method B2 is asymptotically optimal on E ;
When ai decrease at most double-exponentially:
ai O(expf expfO(i)gg); the method B3 is asymptotically optimal on E , etc.
6.3.3 The construction We are about to build the estimation method 0 < " < 1.
Bm underlying (!).
In what follows,
10 . Let us set
(") = Lnm1=6 (10" 1): (6.33) Given " > 0, let us de ne a sequence of positive integers fkj (")g1 j =1 as follows. Let 1 (") be the rst integer such that (1 + (")) > (") . We set (
(") ; kj (") = kj; + b(1 + ("))j c; jj > (") j 1
(6.34)
The structure of the sequence fkj g is quite transparent:
(") = O 1 (") ln( 1 ("))
initial terms of the sequence are just subsequent integers 1,2,..., so that the corresponding dierences dj = kj +1(") kj (") are equal to 1. Starting with j = (") + 1, the dierences dj of two subsequent terms of our sequence become integer approximations of the geometric progression (1 + ("))j +1 . Note that the number K (n; ") of terms kj (") not exceeding a positive integer n is not too large: h
i
K (n; ") = maxfj : kj (") ng Om (1) 1 (") ln( 1 (")) + ln n :
(6.35)
20 . Let us set
f (") = Ln2 (102 " 2 ); ` = 1; :::; m; N (6.36) ` ` 1 where for ` 1 Ln` () is the `-iterated logarithm and Ln0 (x) = x, and let N` (") be f ("). the rst integer in the sequence fkj (")g which is N ` For ` = 1; :::; m, let P` (") be the set of all projection estimates fbk of degrees belonging to the sequence fkj (") 1g and not exceeding N` (") 1; note that
P1 (") P2 (") ::: Pm ("): Let K` (") be the cardinality of P` ("); according to (6.35), for all small enough values of " we have h
i
K` (") Om (1) 1 (") ln( 1 (")) + ln N` (")
Om(1) 1 (")Ln`(" 1):
(6.37)
Our plan is as follows: given " > 0, we aggregate all projection estimates from P` (") according to the scheme of Section 6.2, thus getting m \aggregated estimates" fe"`, ` = 1; :::; m, and then aggregate these m estimates, thus coming to the desired estimate fe" . The precise description of the construction is as follows: Setup. We choose in advance three positive functions Æ1 ("), Æ2 ("); Æ3 (")
in such a way that
(a) (b) (c)
3 P
1 2 (") Æ =1 Æ1 (") " Æ (")
=
1 "2
! 1; " ! +0; Om(1) ("") ; = 2; 3;
(6.38)
which of course is possible. Building estimate fe". 1) Given observation y = y f;" with known noise intensity ", we split it into three independent observations y = y f;Æ (") , = 1; 2; 3.
apply to the vectors the orthogonalization procedure to get an orthonormal system fhj gKj =1(") such that the linear span of h1 ; :::; hs is, for every s = 1; 2; :::; K1 ("), coincides with the linear span of f 1 ; :::; f s. 3) We use the observation y 2 to get estimates 1
zj = (f; hj ) + Æ2 (")j of the projections of f on the directions hj , j = 1; :::; K1 (") (the noises j , j = 1; :::; K1("), are independent of each other and of y 1 N (0; 1) random variables), and for every ` = 1; :::; m de ne g ` fe"` (y ) 2 H as
g` =
KX ` (") j =1
zj hj :
4) We apply to the m vectors g 1; :::; g m the orthogonalization process to get an orthonormal system fe` gm `=1 with the same linear span, and use the 3 observation y to get estimates w` = (f; e` )+ Æ3 (")` of the projections of f onto e1 ; :::; em , the noises ` , ` = 1; :::; m, being independent of each other and of y 1; y 2 N (0; 1) random variables, and de ne the resulting estimate fe" of f as m X fe" = fe" (y f;") = w` e` : `=1
Accuracy analysis, I. Let n
o
0 (") = = fj g1 j =1 2 `2 : l = l0 ; kj (") l l < kj +1 ("); 8j = 1; 2; ::: ;
and let Let us set
`(") = f 2 (") j l = 0; l N` (")g ; ` = 1; :::; m:
R` ("; f ) = 2inf R"(fb; f ); ` (")
(6.39)
where fb is the linear estimate with weight vector . Observe that every weight vector 2 `(") is a linear combination of the weight vectors of projection estimates fbk 2 P` ("). It follows that the risk RLA P` (") ("; f ) of \ideal linear aggregation" of the estimates from P` (") is, for every f and " > 0, at most R` ("; f ). Applying (6.14), we get
2
R2" (fe"`; f ) RLA P` (") (Æ1 ("); f )
+ Æ22 (")K`(") R2` (Æ1 ("); f ) + Æ22 (")K`("):
(6.40)
Recalling how the resulting estimate fe" is obtained from the estimates fe"` and applying the same arguments as those used to get (6.14), we conclude that h
i
R2" (fe"; f ) `=1min R2` (Æ1 ("); f ) + Æ22(")K`(") ;:::;m whence also
q
R"(fe"; f ) `=1min R`(Æ1 ("); f ) + Æ2(") K`(") ;:::;m
+ Æ32 (")m;
p
+ Æ3 (") m:
(6.41)
we have built satis es (!). Let us x f 2 H , k f k 1, and " 2 (0; 1). 10 . It is clear that the optimization program (Pf;") specifying the best, for noise intensity ", linear estimate of f with weights from is solvable; let = (f; ") be an optimal solution to this problem. The corresponding squared risk is 2 (f; ") = and
1 X
j =1
(1 j )2 fj2 + "2
1 X
j =1
2j ;
(6.42)
1 1 2 :::; j ! 0; j ! 1
by the de nition of . 20 Let n be the largest of integers i such that i (") (if no i property exists, n = 0). Note that (6.42), (6.43) imply that
(6.43)
1 with this
(f; ") "n1=2 ("):
(6.44)
On the other hand, it is clear that (f; ") 1 (since the value of the objective in (Pf;" ) is k f k 1 already at the trivial feasible solution = 0). We conclude that 1 n 2 2 < N1 ("): " (") Let ` ` (f; ") be the largest of values ` = 1; :::; m such that n < N` ("). 30 . Let us build weight vector b 2 ` (") as follows:
If j N` ("), then b j = 0; If j < N` ("), then there exists the largest i = i(j ) such that ki(") j , and we set
b j = ki (") ; i = i(j ):
Note that by construction b 2 ` ("). Let 1 1 X X 2 2 2 2 b R = (1 j ) fj + Æ1 (") b 2j j =1
j =1
(6.45)
be the squared risk of recovering f by the linear estimate with the weight vector b , the intensity of noise being Æ1 ("). Our local goal is to verify that
R2 (1 + m ("))22 (f; ");
(6.46)
m (") 0 being an independent of f and converging to 0 as " ! +0 function. The cases of f = 0 and/or = 0 are trivial; assume that f 6= 0, 6= 0, and let us bound from above the ratio 2 = R(f;") maxfd2 ; s2 g; 2
2
P
d2
=
s2 =
j P j
(1 b j )2 fj2
(1 j )2 fj2 ; P
b 2j
Æ12 (") P j "2 2j j
(6.47)
have j < (") and b j = 0. Thus, we have (1
b j )2 (1 (")) 2 (1 j )2
for all j , whence
d2 (1 (")) 2 : (6.48) It remains to bound from above s2 ; the rst ratio in the expression de ning s2 does not depend on f and tends to 1 as " ! +0 by (6.38.b), so that we may focus on bounding the ratio N P P b2 b 2j j =1 #2s = Pj 2 jP ; N = N` ("): N j 2 j j j =1
Note that if the initial N -dimensional segment of b diers from that one of (this is the only case we should consider), then the connection between these segments is as follows. We can partition the range f1; :::; N g of values of index j in subsequent groups I1 ; :::; Ip in such a way that 1. The rst = (") = O ( 1 (") ln 1 (")) of the groups are singletons: Ij = fj g, j , and b j = j for j ; 2. For < l p, the group Il = fj j kl j < kl+1 g contains dl indices, where dl = b(1 + ("))l c, and b j = kl for j 2 Il . Let
S
=
Sl = Note that for l + 2 we have
"
P
2j jP =1 2j ; j 2Il
j =1
j = S +
2kl
l= +1
#2s
Sl dl
dl 2kl
S + d+12k
whence
p P
#
j ; l = + 1; :::; p: =
(see (6.43)), and therefore N P b2
P b2
j =1
1 1
p P
dl dl 11 Sl 1 l = +2 ! p P 1 max [dl dl 1 ] S + Sl + d +1 2k+1 l +2 l! = +1 N P 2 1] max [ d d j + d +1 2k+1 ; l l 1 l +2 j =1
+1
+
max [d d 1 ] l +2 l l 1
+ d +1 2k =S : +1
(6.49)
dl dl 1
b(1+(")) c = b(1+ ("))l c l
1
(1+("))l (1+("))l 1 1
(6.50)
(1 + ("))(1 + 2(")) [since (1 + (")) 1 (")] Besides this, S 2k by (6.43), while d +1 (1 + (")) +1 , so that 1 (6.51) d +1 2 +1 =S (1 + (")) +1 1 Om (1) ln 1 (") (recall that = (") = O ( 1 (") ln 1 (")) and (") is small when " is small). Combining (6.47), (6.48), (6.49) { (6.51), we come to (6.46). 40 . With (6.46) at hand, we are nearly done. Indeed, by origin of R we have R` (Æ1 ("); f ) R (see the de nition of R` in \Accuracy analysis I"). Combining this observation, (6.46) and (6.41), we come to the inequality +1
R"(fe"; f ) (1 + m("))(f; ") + Æ2(") K` (") + Æ3(")pm; q
whence, by (6.38.b; c),
p " q R"(f"; f ) (1 + m ("))(f; ") + Om(1) (") K` (") + m : For a given f , there are two possible cases: (I): ` < m; (II): ` = m. In the case of (I) we have n N` +1 ("), whence, by (6.44), e
p
(6.52)
q
(f; ") "(") n "(") N` +1 (") Om (1)"(")Ln` (" 1 )
(note that due to their origin, N` (") = O(Ln2` 1 (" 1 )) when " ! 0). Therefore in the case of (I) the ratio of the second right hand side term in (6.52) to the rst one does not exceed Om (1) times the quantity
pK
` (") 2 (")Ln` (" 1 )
p
5=2 1 1 Om (1) ("Ln`) (" Ln1 )` (" ) 5=12 m (" 1 ) Om (1) LnLn ` (" 1 ) 1 Om (1)Lnm =6 (" 1 )
m (") 0; " +0
p
!
!
[we have used (6.37)] [see (6.33)] [since ` < m]
Thus, if f is such that (I) is the case, then relation (6.52) implies that R"(fe"; f ) (1 + m("))(f; ") (6.53) with independent of f function m (") ! 0; " ! +0. It remains to consider the case when f is such that (II) is the case. Here the second right hand side term in (6.52) is
Om (1) ("")
hq
p
i
K` (") + m
q
Om(1)" 3=2(") Lnm (" Om(1)"Lnm (" 1)
1)
[see (6.33)]: Combining the latter relation with (6.53) and (6.52), we come to (6.32).
We have considered two of our three aggregation problems { C, where we are interested to mimic the best convex combination of a given family estimates, and L, where the goal is to reproduce the best linear combination of the estimates from the family. Now let us address the third problem. Thus, assume we are given a nested family "> M = ffb"j ()g j ;:::;M of estimates of signals f 2 H , H being a separable Hilbert " space with an orthonormal basis fig, via observations (6.6). For every f 2 H and every " > 0, let us denote by =1
0
( )
n
RM ("; f ) = jmin R"(fb"j ; f ) jmin E k f fb"j (yf;") k2 M (") M (")
o1=2
the minimal, over the estimates from the family, risk of recovering f , the intensity of noise being ". We are interested to solve the following Aggregation problem V. Given a nested family M of estimation methods, nd an estimation method with the risk, at every f 2 H , \close" to the risk RM ("; f ) of the best, with respect to f , estimate from the family.
A solution to the problem can be obtained by straightforward exploiting the aggregation technique from Section 6.2, which now should be used in a \cascade" mode. Namely, without loss of generality we may assume that M ("), for every ", is an integral power of 2: M (") = 2(") and that (") is nonincreasing in " > 0. What we intend to do is to split a given observation y f;" into (Æ0 ("))+1 independent observations y j = y f;Æj (") , j = 0; 1; :::; (Æ0(")), and to use y 0 to build all 2(Æ (")) of the estimates from the family, let us call them \estimates of generation 0". We partition these estimates into pairs and use the observation y 1 to approximate the closest to f linear combinations of estimates in every of the resulting 2(") 1 pairs, thus coming to 2(Æ (")) 1 estimates of \generation 1". Applying the same construction to estimates of generation 1 with y 2 playing the role of y 1 , we get 2(Æ (")) 2 estimates of \generation 2", and so on, until a single \estimate of generation (Æ0 ("))" is built; this estimate is the result of our aggregation routine. The precise description of the routine is as follows: 0
0
0
Setup. We choose somehow a function Æ (") > " and set q
Æb(") = (Æ (")) q
"Æ (") : Æ 2 (") "2
(6.54)
Recovering routine fe" . 1) Given observation y = y f;" of a signal f
we set
2 H,
"b = Æ (") and split y into ("b)+1 independent observations y 0 ; y 1; :::; y (b") , the noise intensities being Æ (") for y 0 and Æb(") for every one of the remaining y 's.
1 ("b) 1 Æ 2 (") "2 1 + + = 2; = 2 2 2 2 b 2 Æ (") Æ (") Æ (") " Æ (") " so that the required splitting is possible.
2) We use y 0 to build 2(b") vectors f0j f0j (y 0) = fbb"j (y 0 ) 2 H { \estimates of generation 0". 3) For = 1; :::; ("b), we perform the following operations. Given 2M = 2(b") +1 \estimates of generation 1" { vectors fj 1 = fj 1 (y 0 ; :::; y 1) 2 H { partition them into M pairs P` , ` = 1; :::; M . For every pair P` = ff` 1 ; g` 1g, we build an orthonormal basis fh`; g=1;2 in the linear span of the vectors from the pair and use the observation y to build estimates ;` b z;` = (f; h;` ) + Æ (") ; = 1; 2
with independent of each other and of y 0 ; :::; y 1 N (0; 1) random noises ;`, = 1; 2. We set ;` `; f` = z1;` h`; 1 + z2 h2 : After all M pairs P` are processed, M = 2(b") estimates f` of \generation " are built, and we either pass to the next step (if < ("b)), increasing by one, or terminate (if = ("b)), the single estimate of generation ("b) being the result fe" (y ) of our aggregation routine. Exactly the same reasoning which led us to (6.14) demonstrates that for every f and for every = 1; :::; ("b) and every ` = 1; :::; M it holds n
E kf
k
f` (y 0; :::; y ) 2
o
(
2H
)
E g2Lin min k f g k + 2Æb2 (") fP g n ` o 2 + 2Æb2 ("); gmin E k f g k 2P 2
`
while
n
o
E k f f0j (y0) k2 RÆ(") (fb"j ; f ); j = 1; :::; M (Æ(")):
Combining these observations, we come to n
o
E k f fe" k2 j=1;:::;M min RÆ(") (fbÆj(") ; f ) + 2(Æ ("))Æb2("): (Æ(")) Recalling the origin of Æb("), we come to the following "> Proposition 6.4.1 Let M = ff"j g j ;:::;M be a nested family of estimation methods, " M (") being nonincreasing in " > 0. For every function Æ (") > ", the risk of the associated with Æ (), according to the above construction, aggregated estimation method ffe"g">0 satis es the relation R"(fe"; f ) RM (Æ("); f ) + O(1) q 2"Æ(") 2 ln M (") 8(f 2 H; " > 0): (6.55) Æ (") " =1
0
( )
max sense, aggregation:
Corollary 6.4.1 Let F H be a family of signals, and let M be the same nested family of estimation methods as in Proposition 6.4.1. Assume that I. The \minimax risk"
n
RM ("; F ) = sup j=1min E k f fb"j k2 ;:::;M (")
o1=2
f 2F
associated with F; M is a \well-behaved" function of " as " ! 0: whenever a function Æ (") is such that Æ (")=" ! 1 as " ! +0, one has
RM (Æ("); F ) (1 + o(1))RM("; F ); " ! +0; II. The risk RM ("; F ) satis es the relation " ln M (") = o(1)RM ("; F ); " ! +0 (II in fact is an upper bound on the rate at which the number M (") of estimates to be aggregated can grow as " ! +0). Under these assumptions, the estimation methods from the family M restricted on the class of signals F admit \asymptotically eÆcient aggregation": there exists an estimation method ffe"g">0 such that
R" ("; F ) (1 + o(1))RM("; F ); " ! +0: To get the asymptotically eÆcient aggregation mentioned in the Corollary, it suÆces to implement the above construction with Æ (")=" approaching 1 as " ! +0 so slowly that "Æ (") q ln M (") = o(1)RM ("; F ); " ! +0; Æ 2 (") "2 the possibility of such a choice of Æ () is guaranteed by assumption II.
Chapter 7 Estimating functionals, I From now on we switch from the problem of estimating a nonparametric regression function to the problem of estimating functional of such a function.
7.1 The problem We continue to work within the bounds of the L2 -theory and Gaussian white noise model of observations. Geometrical setting of the generic problem we are interested in is as follows: We are given a real separable Hilbert space H with inner product (; ) and an orthonormal basis fig1 i=1 , a set H , a real-valued functional F de ned in a neighbourhood of . A \signal" f 2 is observed in Gaussian white noise of intensity ", i.e., we are given a sequence of observations n
o
y f;" = yif;" (f; i ) + "i ;
(7.1)
fig1i=1 being a collection of independent N (0; 1) random variables (\the
noise"), and our goal is to estimate via these observations the value F (f ) of F at f . As always, we will be interested in asymptotic, " ! 0, results. Recall that the model (7.1) is the geometric form of the standard model where signals f are functions from L2 [0; 1], and observation is the \functional observation"
yf (x) =
Z x
0
f (s)ds + "W (x);
(7.2)
W (x) being the standard Wiener process; in this \functional language", interesting examples of functionals F are the Gateau functionals F (f ) =
Z1
0
G(x; f (x))dx 147
(7.3)
F (f ) =
Z1
0
Z1
::: G(x1 ; :::; xk ; f (x1 ); :::; f (xk ))dx1 :::dxk :
(7.4)
0
In this chapter we focus on the case of a smooth functional F . As we shall see, if the parameters of smoothness of F \ t" the geometry of , then F (f ), f 2 , can be estimated with \parametric convergence rate" O("), and, moreover, we can build asymptotically eÆcient, uniformly on , estimates.
7.1.1 Lower bounds and asymptotical eÆciency In order to understand what \asymptotical eÆciency" should mean, the rst step is to nd out what are limits of performance of an estimate. The answer can be easily guessed: if F (f ) = (f; ) is a continuous linear functional, so that =
1 X
i i ;
i=1
f
i
2 = ( ; i )g1 i=1 2 ` ;
then seemingly the best way to estimate F (f ) is to use the \plug-in" estimate
Fb (y ) =
1 X i=1
f;" i yi = (f; ) + "
1 X i=1
i i
(the series in the right hand side converges in the mean square sense, so that the estimate makes sense); the estimate is unbiased, and its variance clearly is "2 k k2 . Now, if F is Frechet dierentiable in a neighbourhood of a signal f 2 , then we have all reasons to expect that locally it is basically the same { to estimate F or the linearized functional F (g ) = F (f ) + (F 0 (f ); g f ), so that the variance of an optimal estimate in this neighbourhood should be close to "2 k F 0 (f ) k2 . Our intuition turns out to be true: Theorem 7.1.1 [13] Let f 2 and F be a functional de ned on . Assume that (i) is convex, and F is Gateau dierentiable \along " in a neighbourhood U of f in : for every f 2 U , there exists a vector F 0 (f ) 2 H such that lim t!+0
F (f + t(g
t
f )) F (f )
= (F 0 (f ); g
f)
8g 2 ;
and assume that every one of the functions g (t) = (F 0 (f + t(g f)); g is continuous in a neighbourhood of the origin of the ray ft 0g (ii) The \tangent cone" of at f { the set
f), g 2 ,
T = fh 2 H j 9t > 0 : f + th 2 g { is dense in a half-space H+ = fh 2 H j ( ; h) 0g associated with certain = 6 0. Then the local, at f , squared minimax risk of estimating F (f ), f 2 , via observations (7.1) is at least "2 (1 + o(1)) k F 0 (f) k2 :
lim liminf inf Æ!+0 "!+0
sup
Fb2F f 2;kf fkÆ
h
i
E " 2 Fb (yf;") F (f ) 2 k F 0(f) k2;
(7.5)
on the space R1 of real sequences1) and E is the expectation with respect to the noises fig. In other words, for every xed Æ > 0 the squared minimax risk of estimating F (f ) 2 0 in a Æ -neighbourhood (in ) of f is at least " k F (f) k2 +o(1) , " ! +0.
Proof. Let d = F 0 (F ); there is nothing to prove is d = 0, so that we may assume that d 6= 0. By (ii), either d, or d is a limit of a sequence fhi 2 T g; for the sake of de niteness, assume that d = limi!1 hi (the alternative case is completely similar). Let us x positive Æ . 10 Let 2 (0; 1=4). Since d is a limiting point of the set T , there exists a unit vector h 2 T such that (h; d) k d k (1 ). By de nition of T , there exists a segment = [0; r], with 0 < r < Æ such that ft = f + th 2 for all t 2 . Taking into account (i) and decreasing, if necessary, the value of r, we may assume that the function
(t) = F (ft )
satis es the condition (1 2) k d k 0 (t) (1 + 2) k d k; t 2 ;
(7.6)
whence the inverse function 1 (s) satis es the relation 1 js 2) k d k
j 1(s) 1(s0)j (1
s0 j; (0) s; s0 (r):
(7.7)
20 . Now let us x " > 0, let Fb be an arbitrary estimate from F , and let = sup E
2
h
F (y ft;")
t2
b
i2
F (ft )
be the squared minimax risk of estimating the value of F on the segment S = fft gt2 of signals. We claim that
2
r"(1 2) k d k r + 2"
!2
;
(7.8)
Postponing for a while the justi cation of our claim, let us derive from (7.8) the required lower bound. Indeed, since Fb is an arbitrary estimate and by construction segment S is contained in and in the Æ -neighbourhood of fb, we get inf
sup
Fb2F f 2;kf fkÆ
E "
whence liminf inf "!+0
sup
i2
h
2 Fb (y f;" )
Fb2F f 2;kf fkÆ
F (f ) h
"
2 2
i
r(1 2) k d k = r + 2"
E " 2 Fb (yf;") F (f ) 2 (1
!2
2)2 k d k2 :
The resulting inequality is valid for all Æ; > 0, and (7.5) follows. 1) As always, R1 is equipped with metric de ning the Tikhonov topology
;
exists Fb 2 F and " > 0 such that sup E t2
h
F (y ft;") b
i2
F (ft )
r"(1 2) k d k < r + 2"
!2
:
(7.9)
30 .1) Since F (ft ), 0 t r, takes its values in the segment [(0); (r)], we may assume that Fb takes its values in the latter segment; indeed, if it is not the case, we may pass from Fb to the \truncated" estimate 8 > <
(0); Fb (y ) < (0) F (y ) = > Fb (y ); (0) Fb (y ) (r) ; : (r); Fb (y ) > (r) e
when replacing Fb with Fe , we may only decrease the left hand side in (7.9), and the truncated estimate takes its values in [(0); (r)]. 30 .2) Thus, we may assume that (0) Fb () (r). Now let us set
tb(y ) = 1 (Fb (y )): Combining (7.9) and (7.7), we conclude that
8t 2 = [0; r] :
h
i
2 2
E tb(yft;") t 2 < (r r+ "2")2 :
(7.10)
30 .3) Without loss of generality we may assume that f = 0; changing, if necessary, our orthonormal basis in H , we may assume also that h is the rst basic orth 1 (recall that our observations have the same structure in every orthonormal basis). Then ft = t1 , and (7.10) says the following: (*) There exists possibility to recover parameter t 2 [0; r] from observations y1 = t + "1 ; y2 = "2 ; y3 = "3; ::: (7.11) with independent N (0; 1) random noises 1 ; 2 ; ::: in such a way that the variance of the recovering error, for every t 2 [0; r], is < (rr+2"") . 2 2
2
Since observations y2 ; y3; ::: impart no information on t, (*) simply says that (**) Given that t 2 [0; r], there exists possibility to recover the mean t of N (t; "2) random variable from a single realization of this variable with the variance of the error, uniformly in t 2 [0; r], less than (rr+2"") . 2 2
2
Formal reasoning corresponding to our \impart no information" arguments is as follows: passing from the estimate tb(y) to the estimate n
o
te(y1 ) = E2 ;3 ;::: tb(y1 ; 2 ; 3 ; :::) ;
we may only improve the variance of recovering t from observations (7.11) and get an estimate which depends on y1 only.
contained, let us reproduce the corresponding reasoning for the simplest case we are interested in. Let tb(t + " ) be an estimate of t 2 [0; r] via noisy observation t + of t, N (0; 1), and let Z n o (a) Æ (t) = E t tb(t + " ) = (t tb(s))p(t; s)ds; p p(t;s) = (" 2 ) 1 expfZ (s t)2 =(2"2 )g; (7.12) h i2 (b) 2 (t) = E tb(t + " ) t = (tb(s) t)2 p(t; s)ds be the expectation and the variance of the estimation error; we are interested to bound from below the quantity
2 sup 2 (t): 0tr
In this bounding, we may assume that the estimate tb takes its values in [0; r] (cf. the above \truncation" reasoning). When tb is bounded, the bias Æ (t) is continuously dierentiable in [0; r], and from (7.12.a) we get
Æ 0 (t) = 1 = 1
1
= 1 = 1 1
Z
(t tb(s))p0t (t; s)ds
2
3
p0 (t; s) 5 (t tb(s)) p(t; s) 4 qt ds p(t; s) ! Z 1=2 Z (p0t (t; s))2 1=2 2 b ds [Cauchy's inequality] (t t(s)) p(t; s)ds ( t; s ) p Z 1=2 [direct computation] " 1 (t tb(s))2 p(t; s)ds " 1 (t) " 1 Z
q
Integrating the resulting inequality from t = 0 to t = r and taking into account that jÆ(t)j (t) , we get 2 Æ (r) whence
2
Æ (0) r(1 " 1 );
r" 2 r + 2"
so that (**) indeed is impossible. The lower bound on local minimax risk of estimating smooth functionals stated by Theorem 7.1.1 motivates the following de nition of an asymptotically eÆcient estimation method:
De nition 7.1.1 Let H be a convex family of signals and F : ! R be a functional such that for every f 2 there exists a vector F 0 (f ) 2 H : lim t!+0
F (f + t(g
t
f )) F (f )
= (F 0 (f ); g
f)
8f; g 2 :
uniquely de ned by F; f . An estimation method fFb"() 2 Fg">0 is called asymptotically eÆcient on , if
limsup sup " "!+0 f 2
2
h
E
F (y f;") b
i2
F (f )
k F 0(f ) k2
0:
(7.13)
E.g., we have seen that a continuous linear functional F (f ) = (f; ) admits asymptotically eÆcient, on the entire H , estimation method. Such a functional is a simplest { linear { polynomial on H . We shall see in a while that a polynomial of a degree > 1 also can be estimated in an asymptotically eÆcient, on every bounded subset of H , fashion, provided that the polynomial is a Hilbert-Schmidt one. On the other hand, it turns out that already the function F (f ) =k f k2 cannot be estimated in an asymptotically eÆcient fashion on the entire unit ball. Thus, in order to be able to build asymptotically eÆcient estimates of \smooth", but not \very smooth" functionals, we should restrict the class of signals to be \not too massive", similarly to what we did when recovering the signals themselves. A very convenient way to control the \massiveness" of is to impose restrictions on the Kolmogorov diameters dk ():
De nition 7.1.2 Let H and m be a positive integer. We say that the mdimensional Kolmogorov diameter dm () of is Æ , if there exists an m-dimensional linear subspace Hm H such that
8f 2 :
dist(f; Hm ) fmin 0 2H
k f f 0 k Æ: m
2)
In what follows, we impose on restrictions like
dm () Lm ; m m0
[ > 0];
(7.14)
i.e., say at which rate the \non-parametric" set of signals can be approximated by \m-parametric" sets { by the projections of on appropriately chosen m-dimensional subspaces of H . E.g., if is an ellipsoid = ff
1 X
(f; i )2 2 i=1 ai
2Hj
L2 g
[a1 a2 :::; ai ! 0; i ! 1] ;
(7.15)
then one clearly has
dm () Lam+1 ; m = 1; 2; ::: In particular, the Kolmogorov diameters of the \periodic part" of a Sobolev ball Sk;1 2 (L) (same as the diameters of the ball itself) decrease as m k (cf. (6.2)):
dm Sk;1 2 (L)
ck Lm k ; k = m + 1; m + 2; :::
Thus, (7.14) in typical applications is an a priori restriction on the smoothness of signals we deal with. 2) The \canonical" de nition of the Kolmogorov diameters deals with aÆne rather than linear subspaces of H ; note, however, that if there is an aÆne m-dimensional subspace H 0 of H such that dist(f; H 0 ) Æ 8f 2 , there exists (m +1)-dimensional linear subspace of H with the same property; thus, \up to shift by 1 in the dimension" (absolutely unimportant in what follows), we may speak about approximation by linear, rather than aÆne, subspaces.
lations between the \degree of smoothness" of a functional F to be estimated and the \asymptotical width" of (i.e., the value of in (7.15)) in order for F to admit an asymptotically eÆcient, on , estimation method, and (b) how to build an asymptotically eÆcient estimation method, provided that it exists.
To get a kind of preliminary orientation, let us start with the simplest case of a once continuously dierentiable functional.
7.2 The case of once continuously dierentiable functional Consider the problem of estimating a functional F on a set of signals and assume that
A.1. is a bounded subset of H , and the Kolmogorov diameters of satisfy (7.14) with certain a priori known , L. For the sake of de niteness, assume that is contained in the unit ball
O = ff j k f k 1g of H .
A.2. The functional F to be estimated is de ned in the ball
O2 = ff j k f k< 1 + 2g [ > 0]; is continuously Frechet dierentiable in O2 , and its derivative F 0 () is Holder continuous in O2 with exponent > 0 and constant L: 8f; g 2 O2 : k F 0(f ) F 0(g) k L k f g k : (7.16) E.g., the Gateau functional (7.3) satis es A.2, provided that the integrand G(x; t) is continuously dierentiable in t for almost all x 2 [0; 1], is measurable in x for every t and
G(; 0) G0t (x; 0) 0 0 k Gt (; ) Gt(; 0) k1
2 L1 [0; 1]; 2 L2 [0; 1]; C max [j 0j; j 0 j ] 8; 0 2 R
Similarly, the Gateau functional (7.4) satis es A.2, provided that
G(x1 ; :::; xk ; t1 ; :::; tk ) = G(x; t) is continuously dierentiable in t for almost all x, is measurable in x for all t and
G(; 0) G0t(; 0) k G0t(; ) G0t(; 0) k1
2 L1 ([0; 1]k ); 2 L2 ([0; 1]k ); C max [j 0 j; j 0 j ] 8; 0 2 Rk :
Theorem 7.2.1 Assume that A.1, A.2 are satis ed and that the parameters and
are linked by the inequality 1 (7.17)
> : 2 Then F admits asymptotically eÆcient on estimation method. Proof. We build explicitly the corresponding estimation method. The idea of the construction is quite transparent. Given noise intensity " > 0, we choose an appropriate m = m("), nd an m-dimensional linear subspace Hm such that dist(f; Hm ) Lm , and build the associated projection estimate fbm of f . After fbm is built, we approximate F by the rst order Taylor expansion of F at fbm : F (f ) F (fbm ) + (F 0 (fbm ); f
f m );
f m being the projection of f onto Hm , and estimate the linear part of this expansion as a linear functional { just substituting, instead of f f m , the observation of this vector. A nice feature of this scheme is that the noises aecting the observation of f f m are independent of those aecting the estimate fbm , which allows for easy evaluation of the risk.
The construction is as follows.
10 . We rst choose the \order" m = m(") of the estimate fbm from a very natural desire to get an optimal in order nonparametric estimate of f 2 . To understand what is this order, we use the quite familiar to us reasoning as follows. For a given m, we build an m-dimensional subspace Hm in H such that the norm of the projection f ? of f 2 on the orthogonal complement to Hm (this norm is nothing but dist(f; Hm )) is guaranteed to be Lm , build an orthonormal basis h1 ; :::; hm in Hm and de ne fbm as m X fbm = zi hi ; i=1
where zi = (f; hi ) + "i , f g are independent N (0; 1) random variables, are the estimates of theo projections of f onto hi given by observation y f;". The squared risk n E k f fbm k2 clearly can be bounded as
i mi=1
n
o
E k f fbm k2 m"2 + k f ? k2 m"2 + L2 m 2 ;
(7.18)
and to get an optimal in order estimate, we should balance the stochastic term m"2 and the deterministic term L2 m 2 , i.e., to set
m = m(") = b"
2 2 +1
c
3):
(7.19)
After our choice of m(") is speci ed, we may assume { just to save notation { that Hm is simply the linear span of the rst basic orths 1 ; :::; m of the basis where the 3) In order to avoid messy expressions, in what follows we do not opimize the choice of parameters with respect to the constant L involved in A.1, A.2.
{ with xed Hm(") ) and are in our right to use whatever orthonormal basis we want, updating the observations (without any change in their structure) accordingly. 20 . Let f 2 be the observed signal, f m be its projection on the subspace spanned by the rst m = m(") basic orths i , and fbm be the corresponding projection estimate of f : m X fbm = f m + " i i : i=1
In order to implement the outlined approximation-based scheme, we should ensure that the \preliminary estimate" we use belongs to the domain of the functional F , which is not the case for some realizations of fbm . This is, however, a minor diÆculty: since f 2 O, we only improve the quality k f k of our estimate by projecting fbm on O { by passing from fbm to the estimate (
b fem = fm ; (1 + ) k fbm k 1 fbm ;
k fbm k 1 + : k fbm k> 1 +
The estimate fem is the one we actually use in the above approximation scheme. Important for us properties of the estimate can be summarized as follows: (a) For a given f 2 , the estimate fem depends on the collection m = fi gm i=1 of 1 the observations noises in (7.1) and does not depend on the sequence m+1 = fig1i=m+1 of the \remaining" noises; (b) We have (from now on all C 's stand for dierent positive quantities depending only on F and and independent of " and of a particular choice of f 2 ): (a) (b) (c) (d)
k fem k 1 + k fem f m k k fbm f m k k f f m ok Cmn (") C" o n E k f m fem k2 = Em k f m fem k2 C"2m(") C" 2 2 +1
(7.20)
4 2 +1
Note that (d) is given by (7.19) and the fact that k fem 30 . The estimate of F (f ) we arrive at is 0
Fb" Fb" (y f;") = F (fem ) + @F 0 (fem );
1 X i=m+1
f m kk fbm
f m k.
1
yi i A ; yi = yif;"
(7.21)
(of course, fem depends on observations yif;", i = 1; :::; m, and m = m("); to save notation, we omit explicit indication of these dependencies).
Accuracy analysis. To evaluate the accuracy of the estimate we have built, let R = F (f ) F (f m ) (F 0 (f m ); f f m ); 1 P = " i [F 0 (fem )]i [for g 2 H , gi = (g; i)] i=m+1
h
i
F (f ) Fb" = [F (f nm ) + (F 0 (f m ); f o f m )n+ R] F (fem ) + (F 0 (fem ); fo f m ) + = R + F (f m ) F (fem ) 1 + (F 0 (f m ) F 0 (fem ); f f m ) 2 (7.22) m Observe that R is deterministic, f g1 , f g2 depend only on , while the conditional 1 P expectations, m being xed, of and 2 are, respectively, 0 and "2 [F 0 (fem )]2i . i=m+1 Consequently, h
E F (f ) F" b
i2
E
=
h
+"2
R + F (f m )
E
(
1 P
i=m+1
F (f ) + (F 0 (f m ) e m )
F 0 (fem ); f
[F 0 (fem )]2i : (7.23)
We claim that the following facts hold true: A)
i2 f m)
jRj o(1)"
(7.24) From now on, o(1)'s stand for deterministic functions of " (independent of a particular choice of f 2 ) tending to 0 as " ! +0.
B)
E
C)
n
[F (f m )
F (fm e
)]2
o
m X 2 " [F 0 (f )]2i + "2 o(1) i=1
n
o
E [(F 0(f m) F 0(fem ); f f m)]2 "2o(1)
D)
"2 E
X 1
i=m+1
[F 0 (fem )]2i
"2
1 X i=m+1
[F 0 (f )]2i + "2 o(1)
(7.25) (7.26) (7.27)
Note that (7.23) combined with A) { D) clearly implies that h
E F (f ) F" b
i2
"2 k F 0(f ) k2 +"2o(1) 8f 2 ;
i.e., that the estimate we have built is asymptotically eÆcient on . Thus, all we need is to verify A) { D) Verifying A) We have
jRj C k f f m k1+ C [m (")] (1+ ) C" 2 (1+ ) 2 +1
= "o(1)
as required in (7.24).
[by A.2] [by A.1] [by (7.19)] [by (7.17)]
h
i
h
F (f m ) F (fem ) = Fh(f m ) + (F 0 (f m ); femi f m ) F (fem ) + (F 0 (f m ); f m + (F 0 (f m ); fbm fem ) ; and in order to verify (7.25) it suÆces to demonstrate that h
i
fbm )
(7.28)
i
E F (f m) + (F 0(f m); fem f m) F (fem) 2 "2o(1) h i m (b) E (F 0(f m); f m fbm) 2 "2 iP [F 0 (f )]2i + "2 o(1) =1 h i2 0 m 2 b e (c) E (F (f ); fm fm) " o(1) (a)
(7.29)
(7.29.a): We have
E E
h
F (f m ) + (F 0 (f m ); fem n o C f m fem 2(1+ )
k
n
E Ck (
k
fm
fm k b
m P
2(1+ )
E C "2(1+ ) i=1 i2 C ["2 m( ")]1+ C"
f m)
o
1+ )
[since fbm
F (fm ) e
i2
[by A.2] [by (7.20.b)]
fm = "
m P i=1
i ]
4 (1+ ) 2 +1
"2o(1)
[by (7.19)] [by (7.17)]
h
=
i2
E fm ) ( 2 ) m P 2 0 m E " i=1[F (f )]i i
= "2 = "2
(F 0 (f m ); f m
m P
b
[since fbm
fm = "
[F 0 (f m )]2i
i=1 m P i=1
[[F 0 (f )]i + Æi ]2 [Æi = [F 0 (f m )]i m
m
m P i=1
i ]
[F 0 (f )]i ]
P "2(1 + ) iP [F 0 (f )]2i + "2 (1 + 1 ) Æi2 8 > 0 =1 i=1 [since (a + b)2 (1 + )a2 + (1 + 1 )b2 ] m "2(1 + ) iP [F 0 (f )]2i =1 +C"2 (1 + 1 ) k f f m k2
m P "2 (1 + ) [F 0 (f )]2i i=1
[by A.2]
4 + C"2 (1 + 1 )" 2 +1
[by (7.20.c)]
m P
"2(1 + o(1)) i=1[F 0(f )]2i + "2o(1)
m P "2 [F 0 (f )]2i i=1
+ "2 o(1)
[set = "
2 2 +1
]
[by A.2]
(7.29.c): Observe rst that for every q 1 one has (
E k
m X i=1
i i k
2q
)
C (q)mq :
(7.30)
Consequently, for every q 1 and every 1 it holds n
o
E k fem fbm kq n o =2 1=2 E knfem fbm k2q 1o Probffbm 6= fem g 2q E k fbm f m k2q 1=2 Probfk fbm f m k> g 1=2 [since k fem f m kk fbm f m k] 1=2 m P 2 q= 2 C1(q)[" m] Probfk i=1 ii k ="g
[by (7.30)]
C (q; )["2m]q=2+ m P [since E k i i k4 C (4)m2 by (7.30) i=1 m P and therefore Probfk i i k> ="g C ()["2 m]2 ] i=1 Thus, we get
n
o
8q; 1 : E k fem fbm kq C (q; )["2m(")]q=2+ :
(7.31)
E
h
i2
(F 0 (f m ); fbm fem ) n o C E k fbm fem k2 [by A.2] 2 1+ C ()[" m(")] 8 1 [by (7.31)] C ()" 8 1 [by (7.19)] 2 " o(1) [choose appropriately] 4 (1+ ) 2 +1
Verifying C) We have n
o
E n[(F 0(f m) F 0(fem); f f m)]2 o E nk F 0(f m ) F 0(fem ) k2k f o f m k2 E C k f m fem k2 m 2 (") C" m 2 (") C"2" 4 2 +1
4 2 2 +1
= "2 o(1)
[by A.2 and (7.20.c)] [by (7.20.d) and since 1] [by (7.19)] [by (7.17)]
Verifying D) We have
"2 E
X 1 8i=m+1 <
[F 0 (fem )]2i 1 X
"2E :(1 + )
i=m+1
"2 (1 + )
1 X
i=m+1 n
"2 (1 + )
1 X
i=m+1
1 X
i=m+1
"2(1 + o(1)) = "2
1 X i=m+1
1 X
[F 0 (f )]2i + (1 + 1 )
[F 0 (f )]2i
+(1 + 1 )"2 E
"2(1 + )
i=m+1
Æi2 ; 8 > 0
[Æi = [F 0 (f )]i
[F 0 (fem )]i ; cf. veri caton of (7.29.b)]
k F 0(f ) F 0(fem) k2
o
[F 0 (f )]2i + (1 + 1 )"2 E
n
[F 0 (f )]2i + C"2 (1 + 1 )"
1 X
9 =
k f fem k2
o
4 2 +1
[by (7.20.c; d) and since 1] [F 0 (f )]2i + "2 o(1)
i=m+1
[F 0 (f )]2i + "2 o(1)
[set = "
2 2 +1
]
[by A.2]
The proof of Theorem 7.2.1 is completed.
7.2.1 Whether condition (7.2.2) is sharp? We have seen that if the \asymptotical width of " (see A.1) and the \degree of smoothness of F " are \properly linked", namely, > 21 (see (7.17)), then F admits
the condition (7.17) is \de nitive" (i.e., if it is violated, then it may happen that F admits no asymptotically eÆcient estimation on ), or it is an \artifact" coming from the particular estimation method we were dealing with. It turns out that (7.17) indeed is \de nitive":
Theorem 7.2.2 Let > 0, 2 (0; 1] be such that
<
1 : 2
(7.32)
Then there exist O satisfying A.1 and a functional F : H ! R satisfying A.2 on the entire space H such that F does not admit asymptotically eÆcient estimation on .
Proof. Let us set
= ff
2Hj
1 X i=1
i2 (f; i )2 1g;
(7.33)
so that clearly satis es A.1 (one can choose as Hm the linear span of the rst m basic orths 1 ; :::; m ). We are about to build a functional F which satis es A.2 and does not admit asymptotically eÆcient estimation on .
The idea. Assume we are given a noise intensity " > 0. Let us choose somehow k = k(") and K = K (") = 2k(") distinct elements f0 ; :::; fK 1 2 such that for appropriately chosen = (") it holds: (i) (ii)
k fi k= 8 8i; k fi fj k> 2 whenever i 6= j .
Let (f ) be a once for ever xed smooth function on H which is equal to 1 at the point f = 0 and is zero outside the unit ball, e.g., (f ) = (k f
k2);
(7.34)
where is a C1 function on the axis which is 1 at the origin and vanishes outside [ 1; 1]. Given an arbitrary collection ! = f!i 2 f 1; 1ggKi=0 1, let us associate with it the functional ! (f ) =
KX1 i=0
!i i (f ); i (f ) = 1+ ( 1 (f
fi )):
(7.35)
The structure of the functional is very transparent: every fi is associated with the term !i i (f ) in ; this term vanishes outside the centered at fi ball of radius and is equal to !i 1+ at the center fi of this ball. Due to the origin of , the supports of distinct terms have no points in common, so that ! (fi ) = !i 1+ i = 0; :::; K
1:
(7.36)
disjoint it is immediately seen that ! is C1 on H and
k 0! (f )
0! (g ) k C k f
g k 8f; g 2 H
(7.37)
with C depending on only. We are about to demonstrate that with properly chosen k("); ("), at least one of the 2K functionals ! () corresponding to all 2K collections of !i = 1 is \diÆcult to evaluate" already on the set F" = ff0 ; :::; fK 1g, provided that the intensity of noises in (7.1) is ". Namely, there exists a functional in the family such that no estimate Fb" is able to recover its values on F" with squared risk "2 Æ , Æ > 0 being chosen appropriately. After this central fact will be established, we shall combine the \diÆcult to estimate" functionals corresponding to dierent values of the noise intensity " in a single functional which is impossible to evaluate in an asymptotically eÆcient (even in an order-eÆcient) way. In order to prove that there exists a \diÆcult to estimate" functional of the type ! , assume, on contrary, that all these functionals are easy to estimate. Note that we can \encode" a signal f 2 F" = ff0 ; :::; fK 1g by the values of k = log2 K functionals from our family, namely, as follows. Let I` , ` = 1; :::; k, be the set of indices i = 0; :::; K 1 = 2k 1 such that the `-th binary digit in the binary representation of i is 1, and let ` () be the functional ! () corresponding to the following choice of ! :
I` !i = 1;1; ii 2 62 I`
In other words, the value of the functional ` () at fi \says" what is the `-th binary digit of i: if it is 1, then ` (fi ) = 1+ , and if it is 0, then ` (fi ) = 1+ . It follows that the collection of values of k functionals 1 ; 2 ; :::; k at every f 2 F" allows to identify f . Now, if all k functionals ` , ` = 1; :::; k, are \easy to estimate" via observations (7.1), we can use their \good" estimates in order to recover a signal f (known to belong to F") from observations (7.1), since the collection of values of our functionals at f 2 F" identi es f . On the other hand, we know from the Fano inequality what in fact are our abilities to recover signals from F" from observations (7.1); if these \actual abilities" are weaker than those oered by the outlined recovering routine, we may be sure that the \starting point" in developing this routine { the assumption that every one of the functionals ` , ` = 1; :::; k, is easy to estimate on F" { is false, so that one of these functionals is diÆcult to estimate, and this is exactly what we need.
The implementation of the above plan is as follows. 10 . Let us x 0 such that (a) 0 < (b) 2 < 1 + 2 0
2
(7.38)
(this is possible since 2 < 1). 20 . Let us x " > 0 and set
k = k(") = b"
2 1+2
0
c:
(7.39)
010
001 000
011
111
100 101
110
Figure 7.1: Three functionals \encoding" 8 = 23 signals.
30 . The set given by (7.33) contains the centered at the origin k-dimensional disk of the radius r = k . Since m-dimensional unit sphere contains a set of 2m points with pairwise distances at least 1=44) , we conclude that for 1 = (") = k (") (7.40) 8 there exist K = 2k signals fi 2 satisfying conditions (i) and (ii) from the previous item. Let ` , ` = 1; :::; k, be the functionals associated with F" = ff0 ; :::; fK 1 g by the construction from the previous item. Let
Æk (") = `=1 max inf i=0max ;:::;K ;:::;k Fb"
h
E Fb"(yfi;") 1
` (fi )
i2
Our central auxiliary results is as follows:
Lemma 7.2.1 For all small enough values of " one has Æk (")
1 2+2 (") C" 128
0
4 (1+ ) 2 +1
(7.41)
with positive C > 0.
Proof. Assume that (7.41) does not hold, so that Æk (") <
1 2+2 (") 128
(7.42)
Let Fb"` , ` = 1; :::; k be estimates such that h
i2
E Fb"`(yfi;")
`(fi )
2Æk ("); ` = 1; :::; k; i = 0; :::; K
1:
Let
(7.43)
m = b10 ln kc: Consider K = 2k hypotheses Hi on the distribution of a sample Y of m observations y (1) ; :::; y (m) ; hypotheses Hi states that Y is a sample of m independent observations (7.1) associated with the signal fi . Let us look at the following procedure for distinguishing between these hypotheses: 4 To see this, note that if X = fxi gNi=1 is the maximal subset of the unit sphere in Rm such that the pairwise distances between the points of the set are > 1=4, then N \spherical hats" fx 2 Rm j k x k= 1; k x xi k 14 g cover the entire sphere. On the other hand, the ratio of the \area" of such a hat and the one of the sphere is 2 arcsin(1 Z =8)
sin
m
2
(s)ds
0
=2 Z
2
(m
tm
=
sinm 2 (s)ds m
0 2 (2 arcsin(1=8)) 1) sin cos(2 arcsin(1=8))
so that N 2m for m 7.
sin(2 arcsin(1 =8)) Z
2
2
(1
t2
)
0
Z1
m; m
0
tm
7;
2
(1
t2
)
1=2
dt
1=2
dt
Fb"` (y (j )). If more than one half of these reals are positive, we set b` = 1, otherwise we set b` = 0. After b1 ,..., bk are built, we treat them as the binary digits of (uniquely de ned) integer i, 0 i 2k 1 and claim that Y is given by the hypotheses Hi .
Let us evaluate the probability to reject a particular hypotheses Hi when it is true. If for every ` = 1; :::; k in the sequence fF`j gm j =1 more than one half of the entries are ` of the same sign as (fi ), then b` will be exactly the `th binary digit b` (i) of i, and the hypotheses Hi will be accepted. Thus, if Hi is not accepted, it means that there exists ` such that among the entries of the sequence fF`j gm j =1 at least one half is of the ` sign opposite to that one of (fi ). The probability that it is the case for a particular value of ` is at most the probability that in a sequence of m independent identically distributed random variables j = Fb"` (y (j )) ` (fi ) at least one half of the elements is in absolute value 1+ ("). On the other hand, by (7.43) we have n
o
E j2 2Æk ("); whence
n
Prob jj j 1+ (")
o
(see (7.42)), so that ProbHi fb` 6= b` (i)g It follows that
X
m=2j m
q
2Æ (")
1+ k(") < 81
Cjm (1=8)j (7=8)m
ProbHi f9` k : b` 6= b` (i)g k2
m
j
2 m:
41
(we have taken into account the origin of m). Thus, for every i = 0; :::; K 1 the probability to reject the hypotheses Hi when it is true is at most 1=4. On the other hand, the pairwise Kullback distances between the distributions of y (j ) associated with hypotheses H0 ; :::; HK 1 clearly do not exceed 2 128k K = 21"2 i;j=0max k f f k i j ;:::;K 1 "2
2
(we have taken into account property (i) from the previous item). Applying the Fano inequality (1.27) and recalling that m 10 ln k(") and K = 2k(") , we get 1280k
2 (") ln k (")
"2
41 ln(2k(")
1)
ln 2;
In view of (7.39) and (7.38.a), the concluding inequality fails to be true for all small enough " > 0. 40 . We have seen that for all small enough values of " > 0 there exist functionals (") with the following properties: A) (") is continuously dierentiable on the entire H , and the derivative of the functional is Holder continuous with exponent and constant independent of ";
9(")g of the sphere ff
j k f k= 8(")g, where 1 b" 8
(") =
2
0 +1 2
c
C) There exists F" \ U" such that inf sup E
h
Fb" f 2F"
F" b
(y f;")
i2 (") (f )
;
C"
0
4 (1+ ) 2 +1
with some positive C independent of ". Note that property C) clearly is preserved under arbitrary modi cation of (") which does not vary the functional in U" . Now let us choose a decreasing sequence of positive reals "i which converges to 0 so fast that the \outer boundary" of U"i is inside the \inner boundary" of U"i (see B)), and let us set 1 X (f ) = ("i ) (f ); +1
i=1
note that is well-de ned, since at every point f at most one of the terms of the right hand side series diers from 0. Moreover, from A) combined with the fact that fU"i g are mutually disjoint it follows that satis es A.2. We claim that the functional cannot be evaluated "2 -consistently on , which is immediate: since coincides with ("i ) in U"i , from C) and the remark accompanying this statement it follows that inf sup "i Fb"i f 2
2
h
E F"i b
(y f;"i )
i2
(f )
0 1+2 2 2 22 0 +1 C"i
! 1; i ! 1
(see (7.38.b)), as claimed.
7.3 Increasing smoothness of F As we have seen, the sharp link between the \asymptotical width" of the set of signals and the \degree of smoothness" of the functional F we intend to estimate in an asymptotically eÆcient on fashion is given by the inequality > 21 . It follows that the \wider" is (the less is in A.1), the more smooth should be F . Note that the outlined tradeo is possible in a restricted range of values of only: since
1, the \width" parameter should be > 1=2. If we are interested to work with \wider" signal sets { those satisfying A.1 with 1=2 { we should impose stronger requirements on the degree of smoothness of F and switch from the estimates based on the rst-order approximation of F to those based on higher-order approximations. The general scheme of the associated estimates is quite transparent: in order to estimate the value of a k 1 times dierentiable functional F via observations (7.1) of the argument f , we choose somehow m = m("), " being the noise intensity, build an orthonormal basis where signals from can be approximated as tight as possible by their order m projection estimates fbm and write
F (f )
k X
1 ` b D F (fm )[f `=0 `!
f m ]` ;
(7.44)
D` F (f )[h
@` F (f + t1 h1 + ::: + t` h` ) 1 ; :::; h` ] = @t1 @t2 :::@t` t=0
is the value of `-th dierential of F taken at f along the set of directions h1 ; :::; h`, and D`F (f )[h]` = D` F (f )[h; :::; h] is the `-th derivative of F taken at f in a direction h. In order to estimate F (f ), we use the observations of the rst m coordinates of f in our basis to build fbm and therefore { to build the polynomials of f f m in the right hand side of (7.44). After these polynomials are built, we use the observations of the remaining coordinates of f (i.e., those of the coordinates of f f m ) in order to estimate the right hand side in (7.44). Note that the estimate we dealt with in the previous section is given by the outlined construction as applied with k = 1. As we shall see in a while, passing from rst-order local approximations of F to higher-order approximations allows to get \sharp" tradeo between the \asymptotical width" of and the degree of smoothness of F in the entire range > 0 of the values of the width parameter. However, the implementation of this scheme heavily depends on whether k 2 or k 3; in the second case, a completely new curious phenomenon occurs. We postpone the case of k 3 till the next chapter, and are about to complete the current one with considering the case of k = 2 (which is quite similar to the case of k = 1 we are already acquainted with).
7.3.1 The case of twice continuously dierentiable functional We are about to replace the assumption A.2 with
A.3. The functional F to be estimated is de ned in the ball
O2 = ff j k f k< 1 + 2g [ > 0]; is twice continuously Frechet dierentiable in O2 , and its second derivative F 00 ()
(which is a symmetric bounded linear operator on H ) is Holder continuous in O2 with exponent > 0 and constant L:
8f; g 2 O2 : k F 00(f ) F 00(g) k L k f g k ; (7.45) here for a bounded linear operator A on H k A k is the operator norm of A: k A k= supfk Ah kj k h k 1g: Note that the Gateau functional (7.3) with twice dierentiable in f integrand G(x; t) does not satisfy A.3, except the case when G(x; t) is quadratic in t for almost all x and this integrand de nes a continuous quadratic form on H (to this end G(; 0) should belong to L1 [0; 1], G0t (; 0) should belong to L2 [0; 1] and G00ff (; 0) should belong to L1 [0; 1]). Similarly, in order to satisfy A.3, the Gateau functional (7.4) should have quadratic with respect to every ti integrand
interesting example of this type is a \homogeneous Gateau polynomial"
F (f ) =
Z 1
0
:::
Z 1
0
G(x1 ; :::; xk )f (x1 ):::f (xk )dx1 :::dxk
(7.46)
with square summable kernel G(x1 ; :::; xk ). We are about to prove the following extension of Theorem 7.2.1:
Theorem 7.3.1 Let assumptions A.1, A.3 be satis ed, and let
>
1 2
1:
(7.47)
Then F admits an asymptotically eÆcient on estimation method.
Proof. Let us build the estimation method as follows. Setup. Given noise intensity " < 0:1, we set m = m(") = M = M (") =
b" c; b " ln(11 =") c; 2 2 +1
(7.48)
2
note that M > 2m, provided that " is small enough (as it is assumed in the sequel). According to A.1, we may nd m-dimensional and (M m)-dimensional subspaces Hm , HM m in H in such a way that dist(f; Hm ) Cm ; dist(f; HM
m)
CM
8f 2
(as above, C 's stand for positive quantities depending on the data in A.1, A.3 only and independent of " and of a particular choice of f 2 ). It follows that we may choose an orthonormal basis in H in such a way that Hm is the linear span of the rst m vectors of the basis, while Hm + HM m is contained in the linear span of the rst M vectors from the basis; without loss of generality, we may assume that this basis is ` our original basis fi g1 i=1 . Denoting by f the projection of f 2 H on the linear span of the rst ` vectors of the basis, we therefore get
k f f ` k C` ;
` = m and ` = M:
(7.49)
Now, by A.1 the closure of is a compact set, and since by A.3 F 0 is Lipschitz continuous on cl , the image of cl under the mapping f 7! F 0 (f ) also is a compact set. Consequently, the quantities
k f f N k; k F 0(f ) [F 0(f )]N k converge to 0 as N ! 1 uniformly in f 2 . Since by A.3 both F
Lipschitz continuous on , there exists N = N (") > M (") such that
8f 2 : k F (f ) F (f N ) k "4; k F 0(f ) [F 0(f N )]N k "4:
and F 0 are (7.50)
1) We use the observations yif;", i m("), to build the projection estimate
fbm =
m X f;"
m X
i=1
i=1
yi i = f m + "
i i
(7.51)
and then \correct" it to get the estimate (
b fm = fm ; (1 + ) k fb km1 fbm ; e
k fbm k 1 + ; k fbm k> 1 +
exactly as in the construction used to prove Theorem 5.3.1; in particular, we ensure (7.20) and (7.31). 2) In what follows, f 2 , y stands for the observation y f;" and is the corresponding sequence of noises. For a pair of nonnegative integers p; q with p q we set q P fpq = (f; i )i;
pq = ypq = f1q , y1q , 1q
We write simply Our estimate is
i=p q P i=p q P
i i ; yi i = fpq + "
i=p as f q , y q , q .
q P i=p
i i = fpq + "pq ;
M P
N ) " 00 e Fb" = F (fem ) + (F 0 (fem ); ymN+1 ) + 12 (F 00 (fem )ymM+1 ; ymN+1 + yM +1 2 i=1 Fii (fm ); [m = m("); M = M ("); see (7.48); N = N ("); see (7.50)] (7.52) where Fij00 (fem ) are the entries of the matrix of the operator F 00 (fem ) in the basis fi g. 2
The origin of the estimate is as follows. It is more convenient to think that we are estimating F (f N ) rather than F (f ) { these two quantities, in view of (7.50), dier from each other by no more than "4 , while the rate of convergence we are interested to get is O("); at the same time, when estimating F (f N ), we should not bother about convergence of in nite series. Now, we have
F (f M ) + (F 0(f M ); fMN +1); F (f m) + (F 0(f m); fmM+1 ) + 21 (F 00(f m)fmM+1 ; fmM+1 ); F 0(f m) + F 00(f m)fmM+1 ;
F (f N ) F (f M ) F 0 (f M )
combining these approximations, we come to h
F(f m) + F 0(f m); fmM+1
i
+ 21 F 00 (f m )fmM+1 ; fmM+1 + F 0 (f m) + F 00 (f m )fmM+1 ; fMN+1 = F (f m ) + F 0 (f m ); fmN+1 + 21 F 00 (f m )fmM+1 ; fmN+1 + fMN +1 : (7.53) Our concluding step is to replace in the resulting approximation the value and the derivatives of F at f m with the value and derivatives at fem and
F (f N )
of suppressing the "2 -terms in the bias resulting from this substitution. There are two sources of "2 -terms in the bias: 1) When replacing F (f m ) with F (fem ), the resulting error is, approximately, 1 (F 0(f m ); fbm fm ) + (F 00 (f m )(fbm f m ); fbm f m ) 2 (recall that fem and fbm coincide with probability close to 1); a good approximation to the expectation of this error is m m "2 X "2 X Fii00 (f m ) F 00 (fem ); 2 i=1 2 i=1 ii
2) When replacing fpq with ypq , the "2 -terms in the expectation of the resulting error are the same as in the expectation of 0
1
M N N X X 1 @ 00 m X F (f ) i i ; i i A ; i i + 2 i=m+1 i=m+1 i=M +1
i.e., their sum is M M "2 X "2 X Fii00 (f m ) F 00 (fe ) 2 i=m+1 2 i=m+1 ii m
Thus, a natural way to convert approximation (7.53) into an estimate of F (f N ) is to plug in the right hand side fem instead of f m and ypq instead of fpq , subtracting simultaneously the principal term of the bias, which is M " P Fii00 (fem ); the resulting estimate is exactly (7.52). 2 2
i=1
Accuracy analysis. Note that for small enough values of ", all f 2 and all realizations of observation noise the points fem + fmN+1 and fem belong to O2 . Indeed, the latter point, by construction, belongs to O , while k fmN+1 k Cm (") by (7.49), so that k fmN+1 < , provided that " is small. Setting
we have
F" b
1 G = F (fem ) + (F 0 (fem ); fmN+1 ) + (F 00 (fem )fmN+1 ; fmN+1 ); 2
F (f N )
= G F (f
e m {z
|
A
+ fmN+1 ) + Fb" + }
+ F (fm + fmN+1 ) e
|
|
F (f N ) {z
D
(7.54)
m "2 X Fii00 (fem ) G 2 i=1 {z
B m "2 X
2
i=1
}
Fii00 (fem )
(7.55)
}
As it was already explained, A is well-de ned for small enough values of ", and in the sequel we assume that " meets this requirement.
h
E F" F (f ) b
i2
"2 k F 0(f ) k2 +"2o(1);
from now on, all o(1) stand for deterministic functions of " independent of f converging to 0 as " ! +0. In view of (7.50), in fact we should verify that h
E F" b
i2 F (f N )
2 and
"2 k F 0(f ) k2 +"2o(1);
or, which is the same in view of (7.55), that o
n
E (A + B + C )2 "2 k F 0(f ) k2 +"2o(1)
(7.56)
We claim that A)
jAj "o(1)
(7.57)
B) (a) (b)
jE fB j mgj o(1)" N E fB 2 j mg "2 i=P [F 0 (fem )]2i + "2 o(1) m+1
here E fj m g xed;
[for g 2 H , gi = (g; i)] (7.58) m is the conditional expectation, the noises = (1 ; :::; m ) being
C)
D)
n
o
E D 2 "2 8 N < X
E:
i=m+1
m X i=1 9 =
[F 0 (f N )]2i + "2 o(1)
[F 0 (fem )]2i ;
N X
[F 0 (f N )]2i + o(1):
i=m+1
(7.59)
(7.60)
E f(A + B + D)2g (1 + o(1))E f(B + D)2g + "2o(1)
(1 + o(1)) [E f
B2
(
"
(1 + o(1)) "2
Em
(
N P
(
(
(
(i=m+1 N P
(1 + o(1)) "2 Em
"2
(1 + o(1))
Em
D2
] + "2 o(1)
[since D depends on m only] )
#
[F 0 (fem )]2i + o(1) + o(1)"Em fjDjg + E fD2 g )
N P
i=m+1
m )
i=m+1
+"2 o(1)
[by A)]
g + 2E fDE fB j gg + E f g m
[F 0 (fem )]2i + o(1)" )
)
q
Em fD2g + E fD2g )
[by B)] + "2 o(1)
m P [F 0 (fem )]2i + "2 o(1) + "2 [F 0 (f N )]2i + "2 o(1) i=1
N P
"2 i=1[F 0 (f N )]2i + "2o(1) = "2 k F 0 (f ) k2 +"2 o(1)
[by C)] [by D)] [by (7.50)]
It remains to verify A) { D) Verifying A) We have
jAj
= j G F (fem + fmN+1 )j = F (fem ) + (F 0 (fem ); fmN+1 ) + 21 (F 00 (fem )fmN+1 ; fmN+1 )
C k fmN+1 k2+ Cm
(2+ ) (")
C" "o(1)
2 (2+ ) 2 +1
F (fem + fmN+1 ) [origin of G] [by A.3] [by (7.49)] [by (7.47)]
B = Fb" + "2
2
m P i=1
Fii00 (fem ) G
N + yN ) " = F (fem ) + (F 0 (fem ); ymN+1 ) + 21 (F 00 (fem )ymM+1 ; yM +1 m+1 2 1 00 N 0 N N e e e F (fm ) (F (fm ); f2m+1 ) 2 (F (fm )fm+1 ; fm+1 ) 3 M 2 X " Fii00 (fem )5 = "(F 0 (fem ); mN+1 ) + 4(F 00 (fem )mM+1 ; mM+1 ) | {z } 2 i=m+1
2
B1
|
{z
+
{z
|
{z
}
3 " 00 e BM (F (fm )fm+1 ; mN+1 ) + 2
}
{z
|
i=1
Fii00 (fem )
}
B2 1 00 e N " N ) (F (fm )fM +1 ; fMN +1 ) + (F 00 (fem )fmM+1 ; M +1 2 2
|
M P
}
" 00 e BM4 (F (fm )m+1 ; fMN +1 ) 2
|
{z
}
B6 " 00 e B5M N ) 2 00 N e + (F (fm )m+1 ; fm+1 ) + " (F (fm )mM+1 ; M +1 } | {z 2| {z } B8 B 7
(7.61) (7.58.a): Among the terms B1 { B8 in (7.61), the only one with nonzero conditional, m xed, expectation is B3 , so that jE fB j mgj = (F 00 (fem)fMN +1; fMN +1) C k fMN +1 k2 [by A.3] 2 CM (") [by (7.49)] q 4 C " ln(1=") [by (7.48)] = "o(1) [since > 1=4 by (7.47)] (7.58.b): It suÆces to demonstrate that
E
n
j
B12 m
(which is evident) and that
n
o
=
N X 2 "
[F 0 (fem )]2i
(7.62)
i=m+1
o
E B`2 j m "2o(1); ` = 2; 3; :::; 8:
(7.63)
(7.63) for ` = 2: We have
E fB22j mg
=
"4 4
8" <
M P
8" <
i=m+1 #2
E : (F 00(fem)mM+1 ; mM+1) E:
M P
Fij00 (fem )(i j
=
"4 4
=
" 00 e 2 4 i;j =m+1[Fij (fm )] (2 C"4 M
4
i;j =m+1
M P
Æij )
Æij )E f(i j
#2
Fii00 (fem ) 9 =
9 =
j m;
j m; Æij )2 g
[since k F 00 (fem ) k C by A.3, whence "2 o(1) [by (7.48)]
[Fij00 (fem )]2 C 8i]
P
j
E fB32j mg
C k fMN +1 k4 [by A.3] CM 4 (") [by (7.49)] C ("2 ln(1="))4 [by (7.49)] 2 " o(1) [since 4 > 1 due to (7.47) combined with 1]
(7.63) for ` = 4: We have
E fB42j mg "4 k F 00(fem)fmM+1 k2 C "4 k fmM+1 k2 2
[by A.3] [by (7.49)]
2
= "2 o(1)
(7.63) for ` = 5; 6; 7: completely similar to the case of ` = 4. (7.63) for ` = 8: We have n
E fB82 j mg "4E knF 00(fem )mMo+1 k2 j m C"4E k mM+1 k2
o
N ] [since mM+1 is independent of M +1 [by A.3]
= C"4 M "2o(1)
[by (7.48)]
B) is proved. Verifying C) We have D
m 2 P = F (fem + fmN+1 ) F (f N ) "2 Fii00 (fem ) i=1 = (F 0 (f N ); fbm f m ) + (F 0 (f N ); fem fbm ) |
{z
D1
}
|
{z
}
D2
+ F (fem + fmN+1 ) F (f N ) (F 0 (f N ); fem |
1 + 2
"
F 00 (f N )(fem
f m ); fem
fm
"2
{z
|
{z
f m)
D3 m X i=1
1 00 N e F (f )(fm 2
F 00 (fem )
D4
f m ); fem
n
o
}
#
ii
}
(7.64)
To establish C), it suÆces to verify that
E D12
fm
= "2
M X i=1
[F 0 (f N )]2i
(which is evident) and that n
o
E D`2 "2o(1); ` = 2; 3; 4: (7.65) for ` = 2: We have n
o
E fD22 g C E k fem fbm k2 C ()("2m("))1+ 8 1 "2o(1)
[by A.3] [by (7.31)] [choose appropriately]
(7.65)
jF (fem + fmN+1 ) F (f N )
F 00 (f N )(fem f m ); fem C k fem f m k2+ 1 2
F (f N ); fem fm j
fm
by A.3, whence n
E fD32g
o
C E nk fem f m k2(2+ ) o C E k fbm f m k2(2+ ) C ["2 m(")]2+ C" "2 o(1) 4 (2+ ) 2 +1
[by (7.20.b] [by (7.30)] [by (7.48)] [by (7.47)]
(7.65) for ` = 4: We have 2D4 = =
F 00 (f N )(fem
f m ); fem
fm
F 00 (f N )(fbm
f m ; fbm
|
m P
"2
f m ) "2
{z
D4;1
+ F 00 (f N )(fem
f m ); fem
|
+ "2 |
m X i=1
fm
(Fii00 (f N )
Fii00 (fem ))
{z
}
D4;3
Fii00 (fem )
i=1 m X i=1
Fii00 (fem ) }
{z
F 00 (f N )(fbm
f m ); fbm
D4;2
fm
}
and in order to establish (7.65) for ` = 4 it suÆces to verify that n
o
E D42; "2o(1); = 1; 2; 3:
(7.66)
(7.66) for = 1: We have n
E D42;1
o
8" < P m
= "4 E :
i;j =1
m h 4 P
Fij00 (f N )(i j i2
Æij )
#2 9 = ;
= " Fij00 (f N ) (2 Æij )E f(ij Æij )2 g i;j =1 C"4m(") P [since k F 00 (f N ) k C by A.3, whence [Fij00 (f N )]2 C 8i] j "2o(1) [by (7.48)]
n
o
E D42;2 E F 00(f N )(fem f m); fem f m
=
2 m m b f ); fm f n o C h fbm fem 2 + fbm fem i 4 C () ["2 m(")]1+ + ["2 m(")]2+
F 00 (f N )(fbm
E k C ()" "2o(1)
k
k
4 (1+ ) 2 +1
k
8
[by A.3] 1 [by (7.31)] [by (7.48)] [choose appropriately]
(7.66) for = 3: We have
E
n
D42;3
o
h
E n o C"4 E m2nk f N fem k2 o C ["2 m]2 E k f fem k2 C"4
m k F 00 (f N )
C ["2 m]2 " C" = "2 o(1)
F 00 (fem ) k
i2
4 2 +1
[by A.3] [by (7.20.c; d)] [by (7.48)] [by (7.47)]
4 (2+ ) 2 +1
C) is proved. Verifying D) We have
E
(
N P
(i=m+1
[F 0 (fem )]2i
)
N P
N E (1 + ) i=m+1[F 0(f N )]2i + (1 + 1) i=P Æi2 m+1 [Æi = [F 0 (f N ) F 0 (fem )]i ] N P
n
8 > 0
k F 0 (f N ) F 0 (fem ) k2 i=m+1 n o N P (1 + ) [F 0 (f N )]2i + (1 + 1 )C E k f N fem k2 i=m+1 (1 + )
[F 0 (f N )]2i + (1 + 1 )E
)
[by A.3] N P (1 + ) [F 0 (f N )]2i + C (1 + 1 )" i=m+1 [by (7.20.c; d)] N P [F 0 (f N )]2i + o(1) i=m+1
[set = "
2 2 +1
o
4 2 +1
]
The proof of Theorem 7.3.1 is completed.
7.3.2 Concluding remarks Sharpness of (7.3.4). Relation (7.47) establishes \sharp" link between the asymptotical width of (i.e., the parameter ) and the degree of smoothness of a functional
to prove Theorem 7.2.2 yields the following result:
Theorem 7.3.2 Let 2 (0; 1] and > 0 be such that
<
1 2
1:
Then there exist a set H and a functional F : H ! R satisfying A.1, A.3 such that F does not admit asymptotically eÆcient (even eÆcient in order) on estimation method.
The case of quadratic functional. Let F (f ) = (Af; f ), where A is a bounded symmetric operator on H , and let satisfy A.1. Consider the estimator resulting from (7.52) by letting N ! 1 and replacing fem with fbm . Tracing the proof of Theorem 7.3.1, one can see that in the case in question the squared risk of estimating F (f ) can be bounded from above as h
E F" F (f ) b
i2
"2 k F 0(f ) k2 +C ("4M + M
4 ) + "2 o(1)
(7.67)
(C is independent of " and f 2 ); here M > m(") is a \free design parameter" of the estimate5) . Assuming that 1=4 (the case of > 1=4 is covered by Theorem 7.3.1) and setting M = b" c; we get an estimate of a quadratic functional with the squared risk satisfying the relation h i2 b E F" F (f ) "2 k F 0(f ) k2 +C" + "2o(1) 4 4 +1
16 4 +1
(C is independent of f 2 and of "). We see that if the asymptotical width of is 41 , then a quadratic functional F can be estimated at points f 2 with the squared risk not exceeding " . It turns out (see [14]) that this rate of convergence is unimprovable in the minimax sense, provided that 16 4 +1
dk () ck ; k = 1; 2; ::: and that for some > 0 it holds
F (f ) k f
k2 8f 2 H:
In (7.52), M = M (") was controlled according to (7.48); the estimate, however, makes sense for other values of the parameter as well. 5)
Chapter 8 Estimating functionals, II We proceed with constructing asymptotically eÆcient, on a given compact set , estimates of a smooth functional F of a nonparametric signal f via observations n
o
y f;" = yif;" (f; i ) + "i ;
(8.1)
of the signal (fig form an orthonormal basis in the Hilbert space H where the signals live, the \noise" fi g1 i=1 is a collection of independent N (0; 1) random variables).
8.1 Preliminaries: estimating polynomials We already know that if the Kolmogorov diameters dk () admit an upper bound
dk () ck
and F is times ( = 1; 2) continuously dierentiable in a neighbourhood of functional with Holder continuous, with exponent , -th derivative, then F can be asymptotically eÆciently estimated on , provided that the asymptotic width of and the degree of smoothness of F are linked according to
+ > 1+
1 : 2
(8.2)
The \widest" we can handle corresponds to the case of = 2, = 1, where (8.2) requires from to be > 1=4. As we remember, (8.2) is sharp; thus, when interested to deal with wider { 1=4 { sets of signals, we should impose stronger smoothness restrictions on F . On the other hand, it was mentioned in Section 7.3.2 that if
dk () = O(k ); < 1=4; then some quadratic functionals { e.g., F (f ) =k f k2 { cannot be estimated on with uniform squared risk of order "2 . Since a quadratic functional is \as smooth as a functional can be", we conclude that merely increasing the number of derivatives F is assumed to possess does not help; we should impose certain structural restrictions on these derivatives. In order to understand what these restrictions could be, note that if we are planning to build asymptotically eÆcient, on a \wide" set , estimators of F and the estimators we intend to construct are based on local approximations 177
asymptotically eÆcient, on , estimation. And since we intend to work with \wider and wider" sets of signals { i.e., with approaching 0 { the above polynomials should admit asymptotically eÆcient estimation on the entire space H (or at least on any bounded subset of H ). Indeed, if our \structural restrictions" on the derivatives of F are such that "2 -consistent estimation of, say, the Taylor polynomial of degree 5 of F already imposes a nontrivial restriction on the asymptotical width of , we have no hope to work successfully with too close to 0. Now, there is a very natural family of polynomials on H admitting asymptotically eÆcient estimation on all bounded subsets of H { the Hilbert-Schmidt polynomials, and this is the family we will work with.
8.1.1 Hilbert-Schmidt polynomials Recall that a homogeneous polynomial of degree k on H is a function
P (f ) = k [ f; :::; f ]; | {z } k times where k [f1 ; :::; fk ] is a symmetric k-linear continuous form on H . Given an orthonormal basis fi g in H , we may associate with k (and therefore { with P ()) the system of coeÆcients fP = k [ ; :::; k ]g2Ik , where Ik is the set of all k-dimensional multiindices = (1 ; :::; k ) with positive integer entries. We clearly have 1
P (f ) = Nlim !1
X
:p N; p=1;:::;k
P f :::fk [fi = (f; i )]: 1
A homogeneous polynomial P (f ) of degree k is called a Hilbert-Schmidt polynomial, if sX P2 < 1; k P k2 2Ik
k P k2 is called the Hilbert-Schmidt norm of P .
It can be proved that the HilbertSchmidt norm is independent of the (orthonormal) basis with respect to which the coeÆcients of P are taken. A generic example of a Hilbert-Schmidt polynomial is the Gateau polynomial
F (f ) =
Z 1
0
:::
Z 1
0
G(x1 ; :::; xk )f (x1 ):::f (xk )dx1 :::dxk
on L2 [0; 1] with square-summable kernel G; the Hilbert-Schmidt norm of this polynomial is just the L2 -norm of the kernel. A non-homogeneous polynomial P of degree k is a sum of homogeneous polynomials P p of degrees 0 (a constant), 1,..., k:
P (f ) =
k X p=0
P p(f ):
P is called a Hilbert-Schmidt polynomial, if its homogeneous components P 1 ; :::; P k are so.
Let P be a Hilbert-Schmidt polynomial of degree k on H . We are about to demonstrate that such a polynomial admits asymptotically eÆcient, on every bounded subset of H , estimate. Let us x an orthonormal basis fig, and let
fN be the projection of f
N X
=
i=1
(f; i )i
2 H onto the linear span of the rst N basic orths. Let also PN (f ) = P (f N ):
We start with building an estimator of PN via observations (8.1). Note that PN is a polynomial of N real variables and therefore it can be naturally extended onto the complexi cation CN of RN . Let N be a random N -dimensional Gaussian vector with zero mean and unit covariance matrix. For z 2 CN , let n
o
PbN (z ) = E PN (z + i" N ) : i being the imaginary unit. Setting yN
=
y N (f; ")
=
consider the estimator
N X f;" i=1
yi i =
fN
+ " N ;
N
=
N X i=1
i i ;
PeN = Pb (y N ):
(8.3)
Theorem 8.1.1 PeN is an unbiased estimator of PN : n
E PeN (yN (f; "))
o
= PN (f )
8f 2 H;
with the variance h
E PN e
(y N (f; "))
i2
PN (f )
=
k 2p X
" p=1 p!
k DpPN (f ) k22 :
(8.4)
Proof. Let ! N = N + i N ( N is independent of N Gaussian vector with zero mean and unit covariance matrix). The distribution of ! N remains invariant under rotations of CN (viewed as a 2N -dimensional real Euclidean space), while PN is an analytic function on CN and is therefore a harmonic function on CN (again viewed as a 2N -dimensional real space). Therefore n
o
E PeN (yN (f; "))
n
n
oo
N + i" N ) = EN nE N PN (f N + " o = E!N PN (f N + "! N ) = PN (f );
the concluding equality being given by the Mean Value Theorem for harmonic functions. Since PeN is unbiased, to determine the variance of the estimator at a xed f we can con ne ourselves to the case of PN (f ) = 0.
of these two vectors, and let ! N = N + i N , N = N + iN . Since PeN clearly is real-valued, we have n
E PeN2
o
n
o
= E PN (f N + "! N )PN (f N + "N ) ;
whence, expanding PN in a Taylor series around f N , n
E PeN2
o
8 k <X
1 Dp PN (f N )["! N ]p Dq PN (f N )["N ]q p;q=1 p!q ! h
=E:
ih
9 i= ;
;
(8.5)
where A[h]p = A[ h; :::; h ], A[h1 ; :::; hp] being a p-linear form. | {z } p times Let JNp be the set of multi-indices = (1 ; :::; N ) with nonnegative entries and N N P P with jj j = p. For 2 JNp , z = zj j 2 CN and p = 1; :::; k let j =1
j =1
! = 1 !:::N !; z = z1 :::zNN ; N}]: Pp = DpPN (f N )[| 1 ; :::; 1 ; 2 ; :::; 2 ; :::; |N ; :::; {z } | {z } {z 1
1
2
N
We have h
ih
i
DpPN (f N )["! N ]p Dq PN (f N )["N ]q =
X
2JNp ; 2JNq
"p+q
p!q ! p q N N P P (! ) ( ) : (8.6) ! !
Observe now that n
E (!N )(N )
o
=
N Y j =1
n
E !jj j j
o
=
N h Y j =1
i
Æj j j ! :
(8.7)
Indeed, all we need to verify is the concluding equality, i.e., the fact that if ; ; are independent N (0; 1) random variables and r; s are nonnegative integers, then
E f( + i )r ( + i)sg = Ærsr!:
(8.8)
But E f( + i )r g = Hr ( ) is the r-th Hermite polynomial (see [31], p. 163), and (8.8) is precisely the orthogonality property of these polynomials:
E fHr ( )Hs( )g = Ærsr! (see [4], p. 133). Combining (8.5), (8.6) and (8.7), we get n
E PeN2
o
=
k X p=1
"2p
k "2p (Pp)2 X = k DpPN (f ) k22 ; ! ! p p p=1 2JN X
the concluding equality being given by the fact that every Pp occurs exactly p!! times among the coeÆcients of the p-linear form DpPN (f )[; :::; ] with respect to the basis figNi=1.
lowing result. Let G(x) be a function on RN which can be continued to an entire function G(z ) on CN such that
jG(z)j c exp
(
k z k22
)
2"2
with some 2 (0; 1), c < 1. Assume that an observation y = x + " of a point x 2 RN is given, where the noise is Gaussian with zero mean and identity covariance matrix. Then the estimator Gb (y ) E G(y + i" ); being independent of Gaussian random vector with zero mean and identity covariance matrix, is an unbiased estimator of G(x), x 2 RN , with variance h
i2
E Gb G(x)
2
1 "2p X
=
p=1
k DpG(x) k22 p!
6 4
k DpG(x) k22
X p
@ p f (x) 2 N 1 2JN @x1 :::@xN
3
p! 7 5 !
Note that Gb is the unique unbiased estimator of G in the class of estimators (y ) satisfying the condition
8y 2
RN
: (y ) c exp
(
k y k2 2 2 2"
Corollary 8.1.1 Let P (f ) =
k X p=0
)
[c < 1; 2 (0; 1)]
P p(f )
be a polynomial on H with Hilbert-Schmidt homogeneous components P 0 ; :::; P k of the Hilbert-Schmidt norms not exceeding L. Then for every " > 0 one can choose N = N (P; ") < 1 in such a way that for the associated estimator PeN of P (f ) via observations (8.1) one has
8f 2 H :
h
E PN e
i2
(y N (f; "))
P (f )
"2 k P 0(f ) k2 +c(k)L2"4(1+ k f k2k ):
(8.9) In particular, the resulting estimator is asymptotically eÆcient on every bounded subset of H .
Proof. For every positive integer N , for every p k and every f 2 H we have
P p(f ) P p(f N )
X
1 ;:::p :
max >N v j j u X u t
p 1 ;:::;p 1
;:::p : j j >N
1 max
f :::fp
P
Pp ;:::;p 1
2
k f kp;
whence for every positive Æ there exists N1 (Æ ) such that
jP (f ) P (f N )j Æ(1+ k f kk )
8f 8N N1 (Æ):
k P 0(f ) DPN (f ) k Æ(1+ k f kk
8f 8N N2(Æ):
1)
It is also clear that for some c1 (k) (depending only on k) and for all N and all f we have k DpPN (f ) k2 Lc1(k)(1+ k f kk ): Letting N = max [N1 (L"2 ); N2 (L"2 )], we get (8.9) as a consequence of (8.4).
2H
Remark 8.1.2 It is easily seen that if a polynomial P satis es the premise of Corollary 8.1.1, then the estimators PeN (see (8.3)) converge in the mean square as N ! 1 to an unbiased estimator Pe of P (), the variance of the estimator being
E
h
i2
P (y f;")
P (f )
e
=
k 2p X
" p=1 p!
k DpP (f ) k22 :
Examples. I. A continuous linear form P (f ) = (p; f ) always is a Hilbert-Schmidt polynomial, and the corresponding unbiased estimator is the standard plug-in estimator 1 X [pj = (p; j )] Pe (y f;") = yjf;"pj j =1
II. Let P (f ) = (Af; f ) be a homogeneous continuous quadratic form, and let [aj` ] be the matrix of the form with respect to the basis fj g. The estimator PeN of PN (f ) = (Af N ; f N ) is PN = e
h N i2 f;" f;" X aj`yj y` + ajj yjf;" j =1 j 6=`;j;`N X
"2
;
and the variance of this estimator is h
i2
E PN PN (f ) e
= 4"2 k (Af N )N
k2 +2"4
N X j;`=1
a2j`:
For N xed, this is an asymptotically eÆcient estimator of PN (f ). The trivial plug-in estimator PN (y N (f; ")) also is an asymptotically eÆcient estimator of PN (f ) (N is xed), but its risk is greater than the one of PeN in terms of order of "4 : h
E PN (yN (f; ")) PN (f )
i2
0
= 4"2 k (Af N )N
k2 +2"4 B@
N X
j;`=1
a2j` +
1 2
2 1 N X C jj A j =1
a
;
when N is large, this dierence can be decisive. P If A is a Hilbert-Schmidt operator (i.e., a2j` < 1), then the estimators PeN j;` converge in the mean square, as N ! 1, to an unbiased asymptotically eÆcient, on every bounded subset of H , estimator of P ().
N P
j =1
[(f; j )]3 is
j =1
PN = e
N h i X f;" 2
yj
j =1
3"2 yjf;"
;
and its variance is h
E PN
(y N (f; "))
PN (f )
i2
=
N X 2 fj4 9" j =1
N X 4 fj2 + 6"6 N; + 18" j =1
fj = (f; j ):
8.1.3 Extension We have built asymptotically eÆcient, on bounded subsets of H , estimators for Hilbert-Schmidt polynomials. To achieve our nal goals, we need to build a \nearly" asymptotically eÆcient estimate for a \nearly" Hilbert-Schmidt polynomial. Namely, assume that k X P (f ) = P p(f ) is a polynomial of degree k 2 such that
p=0
(a) P p are Hilbert-Schmidt polynomials for p k 1 with k P p k2 L < 1 (b:1) k P k k sup fjk [f1 ; :::; fk ]j j k f` k 1; ` = 1; :::; kg L (b:2) k P k;h k2 L k h k; (8.10) k k;h where k [f1 ; :::; fk ] is the symmetric k-linear form associated with P and P is the (k 1)-symmetric linear form obtained from k [f1 ; :::; fk ] when the last argument is set to a constant value h. E.g., the quadratic form (Af; f ) associated with a bounded symmetric operator A satis es (b:1); (b:2) with L =k A k. Another example of a homogeneous polynomial satisfying (b:1); (b:2) is given by a \diagonal" polynomial
P k (f )
=
1 X j =1
cj fjk [fj = (f; j )]
with bounded sequence of coeÆcients fcj g or by a continuous \band-type" polynomial
P k (f ) =
X
2Ikd
c f :::fk 1
[fj = (f; j )];
where Ikd is the set of multi-indices = (1 ; :::; k ) such that `=1 max ` `=1 min ` d < ;:::;k ;:::;k 1. Under condition (8.10) we can build an estimator for the polynomial PN (f ) = P (f N ) as follows. Let M < N be a given natural number. Consider the polynomial
P;M (f ) =
kX1 p=1
P p(f ) + P k [f; :::; f; f M ]
(8.11)
L k fMN +1 kk f N kk 1;
P;M (f N ) PN (f N )
N P
fMN +1 =
(f; j )j :
(8.12)
j =M +1
At the same time, the homogeneous polynomial P k (f ) = P k [f; :::; f; f M ] corresponds to the symmetric k-linear form 1 k (h1 ; :::; hk ) = k [(h1 )M ; h2 ; :::; hk ] + ::: + k [h1 ; :::; hk 1 ; (hk )M ] ; k and the coeÆcients of this form are as follows. Let us partition the set Ik of multiindices = (1 ; :::; k ) of the coeÆcients into M + 1 groups: the rst M groups Gj contain the multi-indices with `=1 min = j , j = 1; :::; M , and the group GM +1 ;:::;k ` contains the multi-indices with `=1 min ` > M . The coeÆcients of k with indices ;:::;k from GM +1 are zero, and absolute values of the coeÆcients with indices from Gj , j = 1; :::; M , are less than or equal to the absolute values of the coeÆcients k; of k with the same indices. By (8.10.b:2), sX
2Gj
whence
2k; L;
M X
X k Pk k22
j =1 2Gj
2k; ML2 :
(8.13)
Associating with the Hilbert-Schmidt polynomial P;M () estimator (8.3), let the latter be denoted by PeM;N (), and applying Theorem 8.1.1, we get the following result: Proposition 8.1.1 Let P be a polynomial of degree k 2 satisfying (8.10). Then, for every pair of positive integers M; N (M < N ), the estimator PeM;N () for every f 2 H and every " 2 (0; 1) satis es the relations (a) (b)
h
n
o
E PeM;N (yN (f; ")) P (fN ) i
E PeM;N (yN (f; ")) P (f N ) 2
1=2
L k fMN +1 kk f N kk 1 " k P0;M (f N ) k +c1 (k)L"p2 (1+ k f N kk ) +c2 (k)"k ML(1+ k f N kk ) +L k fMN +1 kk f N kk 1 :
(8.14)
8.2 From polynomials to smooth functionals We are about to extend the techniques for asymptotically eÆcient estimating HilbertSchmidt polynomials to estimating smooth functionals with Hilbert-Schmidt derivatives. As before, we assume that the set of signals satis es A.1, i.e., it is a subset of the unit ball O of H with Kolmogorov diameters satisfying dk () Lk (8.15) As about the functional F to be estimated, we assume that
O = ff 2 H j k f k< 1 + 2g [ > 0]
and is k 3 times continuously Frechet dierentiable in O : Moreover, A.4.1. The derivatives F (j ) (f ), f 2 O , of order j k 1 have bounded Hilbert-Schmidt norms: sup
n
o
k F (j)(f ) k2j f 2 O L
1jk
1;
A.4.2. The k-th derivative F (k) (f ) satis es the inequality k F (k);g (f ) k2 L k g k 8f 2 O 8g 2 H (cf. (8.10)), where
(8.16) (8.17)
F (k);g (f )[h1 ; :::; hk 1 ] Dk F (f )[h1 ; :::; hk 1 ; g ]: A.4.3. F (k) (f ) is Holder continuous, with exponent > 0, in the usual norm: k F (k)(f ) F (k)(g) k L k f g k 8f; g 2 O: (8.18) Note that A.2, A.3 are nothing but the versions of A.4 associated with k = 1; 2, respectively. In these cases the sharp link between the asymptotical width of the set and the smoothness parameters of F ensuring possibility for asymptotically eÆcient, on , estimation of F was given by 1
> + 1 k; (8.19) 2 and it would be natural to suppose that the same link works for k > 2 as well. It turns out, however, that the \correct tradeo" between the width of and the smoothness of F under assumption A.4 is given by 1
> k; k 3: (8.20) 2 E.g., (8.19) says that to ensure asymptotically eÆcient estimation of twice continuously dierentiable functional with Lipschitz continuous second derivative (k = 2,
= 1) the asymptotical width of should be > 41 , while (8.20) says that in order to ensure the same possibility for three times continuously dierentiable functional with Holder continuous, with close to 0 exponent , third derivative, it suÆces to have > 61 . At the same time, common sense says to us that a twice continuously dierentiable functional with Lipschitz continuous second order derivative is basically the same as a three times continuously dierentiable functional with small Holder continuity exponent of the third derivative; if so, where the \jump down" > 41 7! > 61 in the condition ensuring possibility of asymptotically eÆcient estimation comes from? The answer is that when passing from A.3 to A.4, we do not merely increase the number of derivatives of the functional, but impose a structural assumption on the derivatives of order < k { now they should be Hilbert-Schmidt polylinear operators. This structural assumption is exactly what is responsible for the above \jump down". More speci cally, imposing on the second derivative of a smooth functional the restriction to be bounded in the Hilbert-Schmidt norm results in a completely new phenomenon { measure concentration.
The phenomenon of measure concentration was discovered by P. Levy; in its rough form, the phenomenon is that a function G with xed modulus of continuity, say, Lipschitz continuous with constant 1, on a high-dimensional unit Euclidean sphere \almost everywhere is almost constant": there exists a constant a = a(G) such that Probfx j jG(x) a(G)j > "g, the probability being taken with respect to the uniform distribution of x on the unit n-dimensional sphere, for every xed " > 0 goes to 0 as the dimension n ! 1. In the case we are interested in { the one when G is with Hilbert-Schmidt second-order derivative { this phenomenon can be expressed as follows:
Proposition 8.2.1 Let G be a twice continuously dierentiable in the ball Vr = fx 2 Rn j k x k rg function, and let k G0 (0) k T and k G00 (x) k2 T for all x 2 Vr and some T < 1. For L : Vr ! R, let M [L] be the average of L taken over the uniform distribution on the sphere of radius centered at the origin, 0 r. Then h
Mr (G(x) G(0))2
i
2
(1+) rn k G0(0) k2 +T 2r(2 + r)
+(2++ 1 )
r4 T 2 4n
8 > 0: (8.21)
Remark 8.2.1 Note that if T is xed and n is large, then (8.21) demonstrates that G in Vr is close, in the mean square sense, to the constant G(0). Thus, Proposition indeed demonstrates a kind of \measure concentration" phenomenon. Proof. Let g (x) = (G(x) G(0))2 . For 0 < r, let Q (h) =k h k2 0 < Æ < by Green's formula ( is the Laplacian) we have Z
Ækhk
fgQ Q gg dh
=
(
Z
khk=R
+
Z
khk=Æ
n
2 n . For
)
@Q @g g Q dS (h) @e @e ( ) @Q @g g Q dS (h); @e @e
(8.22)
where dS (h) is the element of area of the boundary of the strip fÆ k h k g and e is the outer unit normal to the boundary. Since Q = 0, the left hand side in (8.22) is equal to Z
Æ
sn 1 (s2
n
2 n )n Ms [g ] ds;
where n is the surface area of a unit sphere in Rn . As Æ ! +0, the right hand side in (8.22) tends to (2 n)n M [g ] (note that g (0) = 0). Thus, passing to limit in (8.22) as Æ ! +0, we get (n
2)M [g ] =
Z
0
(s
sn 1 2 n )Ms [g ]ds;
R
M [g ] = (2n) 1 2 (s)Ms [g ]ds; 0 (s) = (2nn 2) (s sn 1 2 n ) 0; 2
Z
0
Now let `(x) = G(x) Let
(8.23)
(s)ds = 1:
G(0), so that g (x) = `2 (x). We have 1 g = ``+ k r` k2 : 2
(8.24)
A() = 0max M [g ]; s s h i 2 B () = 0max M ( ` ) ; s s h i 2 : C () = 0max M k r ` k s s
From (8.23) and (8.24) it follows that
M [g ] =
2 n
n
2 n
2
Z
0
Z
0
h
i
(s)Ms j`` j + k r` k2 ds
(s) Ms1=2 [`2 ]Ms1=2 [(` )2 ] + Ms [k r` k2 ] ds
A1=2 ()B 1=2 () + C ()
Z
[since 0 and (s)ds = 1]
0
r2 n
A1=2 (r)B 1=2 (r) + C (r) :
Since the resulting inequality is valid for all r, we get for every Æ > 0:
A(r)
) A(r)
h
i
r2 Æ A(r ) + 1 B (r ) + C (r ) 2Æ hn 2 i 2 1 setting 1 r2nÆ = +1 r2 (1 + )C (r ) + r4 (2 + + 1 ) B (r ) n 4n2
(8.25)
8 > 0: Now, by assumptions of Proposition in Vr we have k r` k=k rG kk G0 (0) k2 +T r,
whence
C (r) = 0max M [k r` k2 ] (k G0 (0) k +T r)2 k G0 (0) k2 +T 2 r(2 + r); r
and
B (r) = 0max M [(`)2 ] = 0max M [(G)2 ] nT 2 ; r r
the concluding inequality being given by (G)2
n X
@2G = 2 i=1 @xi
!2
n
n X i=1
In view of these bounds, (8.25) implies (8.21).
@2G @x2i
!2
nT 2:
We are about to prove the following
Theorem 8.2.1 Let ; F satisfy conditions A.1, A.4 and let (8.20) take place. The F admits asymptotically eÆcient on estimation method. Remark 8.2.2 The link (8.20) between and is sharp (in the same sense as in Theorem 7.2.2). The proof (see [24]) follows the same line of argument as in Theorem 7.2.2, but is more involving, since now we should ensure the Hilbert-Schmidt property of the derivatives.
Proof. We just build the asymptotically eÆcient estimation method. Setup. Given noise intensity " < 0:1, let us set m = m(") = b
1
"2 ln(1=")
c; M = M (") = b "2(k
1
1) ln(1=")
c
(8.26)
(note that M > m for all small enough values of ", which is assumed from now on). Same as in the proof of Theorem 7.3.1, without loss of generality we may assume that
8f 2 : k f f n k cn ; n = m; M
(8.27)
and can nd N = N (") > M such that
8f 2 : jF (f ) F (f N )j "4; k F 0(f ) F 0(f N ) k "4;
(8.28)
here and in what follows, as usual, q P
fpq = (f; i )i; i=p f q = f1q ; [f q f;" P ypq = ypq (f; ") = yi i yq pq q
= = = =
2 H]
i=p q P q fp + " ii ; i=p y1q ; i qi=p; 1q :
f g
In view of (8.28), we may focus on estimating the functional FN (f ) = F (f N ), f 2 .
The estimate is as follows. Let m P
fbm = y m = f m + " i i; i=1 ( b ; f m fem = (1 + ) k fbm k 1 fbm ;
k fbm k 1 + ; k fbm k> 1 +
(8.29)
G(h) = Gfem (h)
n X
1 (`) e N F (fm )[hm+1 ; :::; hNm+1 ] | {z } ! ` `=0
(8.30)
`
Note that the polynomial G is random { it depends, as on parameters, on the \initial fragments" f m , m of the observed signal and the observation noise; usually we skip indicating this dependence in notation. Since the polynomial G(h) depends on hNm+1 only and clearly satis es (8.10) with L depending only on the parameters involved into A.1, A.4, we may apply to this polynomial the construction from Section 8.1.3 with already speci ed M; N to get an estimator Fb" = Fb"fem (ymN+1 ) of Gfem (fmN+1 ) via observations ymN+1 = ymN+1 (f; ") with conditional ( m being xed) bias and risk satisfying the relations (see Proposition 8.1.1 and take into account that G(h), as a function of h, depends on the \tail" hNm+1 of h only) (a) (b)
h
E
E Fb"
F" G(fmN+1 ) m i2 1=2 N G(fm+1 ) m b
C k fMN +1 kk fmN+1 kk 1; " k G0 (fmN+1) k p +C "2 + "k M + k fMN +1 kk fmN+1 kk
1
; (8.31)
where
G (h) =
kX1
1 1 (`) e N F (fm )[hm+1 ; :::; hNm+1 ] + F (k) (fem )[hNm+1 ; :::; hNm+1 ; hM m+1 ]; (8.32) k! `=0 `!
here and in what follows, all C denote positive quantities depending only on the data involved A.1, A.4 and all o(1) are deterministic functions of " depending on the same \side parameters" as C 's and converging to 0 as " ! +0. The above Fb" is our estimate of F (f ) via observations (8.1).
Accuracy analysis. We should prove (see (8.28)) that if f 2 , then h
i2
E F" FN (f ) b
"2 k FN0 (f ) k2 +"2o(1); " ! 0
(8.33)
Assume that " is so small that k fmN+1 k for all f 2 (this indeed is the case for all small enough values of " in view of (8.27)). Since F is well-de ned in O and k fem k 1 + by construction, the functional F (fem + fmN+1) is well-de ned for all f 2 and all realizations of noises. Representing
Fb"
FN (f ) = Fb" |
Gfem (fmN+1 ) + Gfem (fmN+1 ) F (fem + fmN+1 ) {z
}
A + F (fem + fmN+1 ) |
{z
D
|
{z
F (f N );
B
}
we claim that in order to get (8.33) it suÆces to verify that
}
h
i2
E Fb" Gfem (fmN+1 )
(a)
h
E Fb" Gfem (f
(b) B)
(
i N m m+1
)
N
"2 i=P [F 0 (f N )]2i + "2 o(1) m+1 [for g 2 H , gi = (g; i)] "o(1) 2 )
E G (fmN+1 ) F (fem + fmN+1 )
C)
fem
h
i
E F (fem + fmN+1 ) F (f N ) 2 "2
"2o(1)
m X i=1
[F 0 (f N )]2i + "2 o(1):
(8.34)
(8.35) (8.36)
Indeed, assuming that (8.34) { (8.36) take place, we have h
i2
E F" FN (f ) n o = E [A + Bn + D]2 o (1 + )E n[A + D]2o + (1 + 1)E fB 2 g [8 > 0] (1 + )E [A + D]2 + (1 + 1)"2o(1)
b
[by (8.35)] (1 + ) E fA2 + D2 g + 2Em DE A + (1 + 1 )"2 o(1) [since D depends on m only] (1 + ) [E fA2 + D2 g + "o(1)E fjDjg] + (1 + 1 )"2 o(1) [by (8.34.b)] (1 + ) ["2 (k FN0 (f ) k2 +o(1)) + "o(1)E fjDjg] + (1 + 1 )"2 o(1) [by (8.34.a), (8.36)] 2 0 2 1 2 (1 + )" (k FN (f ) k +o(1)) + (1 + )" o(1) [by (8.36)] 2 0 2 2 " k FN (f ) k +" o(1) [choose appropriate = (") ! 0; " ! +0]
m
as required in (8.33). It remains to verify A) { C) Verifying A) We have
h
i
E Fb" Gfem (fmN+1 ) m C k fMN +1 kk fmN+1 kk 1 [by (8.31.a)] ( k 1) CM m [by (8.27)] C 4( k 1) C" (ln(1=")) [by (8.26)] "o(1) [since k 1 k 1 4 (k 1) > 2 k+ 2 k+1 1 by (8.20) and due to 1, k 3]
(8.37)
h
E F" b
i
2 Gfem (fmN+1 ) n o "2 G0 (fmN+1 ) 2 1 )C "4 + "2k M +
(1 + )E k k +(1 + k fMN +1 k2k fmN+1 k2k 2 [8 > 0] [see (8.31.nb)] o (1 + )E "2 k G0 (fmN+1 ) k2 + (1 + 1 )"2 o(1) [since "2k M C"2 = ln(1=") by (8.26) and k fMN +1 k2 k fmN+1 k2k 2 "2 o(1) as in (8.37)]
(8.38)
To complete the proof of (8.34.a), it suÆces to show that N X
o
n
E k G0 (fmN+1 ) k2
i=m+1
[FN0 (f )]2i + o(1);
(8.39)
given (8.39), we can choose in the resulting estimate of (8.38) = (") so slowly converging to 0 as " ! +0 that the estimate will imply (8.34.a). To verify (8.39), note that by (8.32) and in view of A.4
k G0(fmN+1 )
[F 0 (fem )]Nm+1
k C k fmN+1 k o(1)
(the concluding inequality is given by (8.27)), whence
=
n
E k G0(fmN+1 ) k2 s N P
si=m+1 N P
o1=2
[F 0 (f N )]2i +
n
E k F 0(fem) F 0(f N ) k2
n
o1=2
n
o1=2
o1=2
+ o(1)
[F 0 (f N )]2i + C E k f N fem k2 + o(1) i=m+1 [since F 0 is Lipschitz continuous on O by A.4] s N P
si=m+1 N P si=m+1 N P
i=m+1
[F 0 (f N )]2i + C
E k f N fbm k2
[F 0 (f N )]2i + C m"2 + k fmN+1 k2
1=2
+ o(1)
+ o(1)
[F 0 (f N )]2i + o(1) [see (8.26), (8.27)] ;
since F 0 is bounded in O , (8.39) follows. A) is proved. Verifying B) As it was already mentioned, for all small enough values of " the segment [fem ; fem + fmN+1 ] is, for all f 2 and all realizations of noises, contained in O . Due to the origin of G(h) = Gfem (h) and in view of A.4.3 we have
jF (fem + fmN+1 ) G(fmN+1 )j C k fmN+1 kk+ Cm (k+ ) "o(1)
[by (8.27)] [by (8.26) and (8.20)]
h
i
E F (fem + fmN+1) F (f N ) 2
=
h
i
E F (fbm + fmN+1 ) F (f N ) 2 "kmk h i2 +E F (fem + fmN+1 ) F (f N ) "km k> ;
so that to verify C) it suÆces to check that h
E h (b) E F (fbm + fmN+1 ) (a)
F (fm + fmN+1 ) e
i2 F (f N ) "km k> i2 N F (f ) "km k
"2o(1); m "2 iP [F 0 (f N )]2i + "2 o(1): =1
(8.40)
Verifying (8.40.a): Since F is bounded in O , it suÆces to prove that Prob f" k m k> g "2 o(1); which is immediately given by Bernstein's inequality; for the sake of completeness, here is the proof: Prob f" k m k> g m P = Prob i2 > 2 " 2 =
i=1 m 1 P
Prob 4 i2 > 41 2 " 2 i=1 n o m 2 P 1 2 " 2 i exp exp 4 4 n o h n ni=1 12 ooim 1 2 " 2 exp exp 4 4 n o 2 exp nCm 41 " o " "0 exp 81 2 " 2 2 " o(1):
E = E
[by Tschebyshev's inequality]
8
[by (8.26)]
Verifying (8.40.b): this is the central point, and this is the point where the \measure concentration" is exploited. Let
g (h) = F (f N + h) F (f N ); h 2 Hm ; where Hm is the linear span of 1 ; :::; m . By A.4.1, this function satis es the premise of Proposition 8.2.1 with r = and with T = C . Denoting by m (s)
= am expf s2 =2gsm
1
have = =
h
i2 F (fm + fmN+1 ) F (f N ) "km k n o g 2 (" m)km k=" Z=" M"s [g 2 ] m (s)ds [averages Mt [ ] are de ned 0 Z1 "2 s2 0 "3 s3 (1 + "s) g (0) 2 +(1 + )C (1 + )
E E
b
0
m
k
k
in Proposition 8.2.1]
m
+(2 + + 1 )C sm" m (s)ds 8 > 0 [by (8.21)] n o (1 + ) k g0(0) k2 "mnE fk m k2go + (1 + )C E "m k m k3 + "m k m k4 +(2 + + 1 )C E "m k m k4 h i (1 + )"2 k g0(0) k2 +C (2 + + 1 )"2 "m1=2 + "2m [since E fk m k2 g = m, E fk m kpg cp mp=2 ] (1 + )"2 k g0(0) k2 +(2 + + 1)"2o(1) [since "m1=2 = o(1) by (8.26)] m P = (1 + )"2 [F 0 (f N )]2i + (2 + + 1 )"2 o(1) i=1 [the origin of g ] m "2 iP=1[F 0(f N )]2i + "2o(1) [choose appropriate = (") ! 0, " ! +0] 4 4
2
3
4
4
Bibliography [1] Barron A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory, v. 39 No. 3 (1993). [2] Besov O.V., V.P. Il'in, and S.M. Nikol'ski. Integral representations of functions and embedding theorems, Moscow: Nauka Publishers, 1975 (in Russian) [3] Birge L. Approximation dans les espaces metriques et theorie de l'estimation. Z. Wahrscheinlichkeitstheorie verw. Geb., v. 65 (1983), 181-237. [4] Cramer H. Mathematical Methods of Statistics, Princeton Univ. Press, Princeton, 1957. [5] Donoho D., I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika v. 81 (1994) No.3, 425-455. [6] Donoho D., I. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. v. 90 (1995) No. 432, 1200-1224. [7] Donoho D., I. Johnstone, G. Kerkyacharian, D. Picard. Wavelet shrinkage: Asymptopia? (with discussion and reply by the authors). J. Royal Statist. Soc. Series B v. 57 (1995) No.2, 301-369. [8] Eubank R. Spline smoothing and Nonparametric Regression, Dekker, New York, 1988. [9] Goldenshluger A., A. Nemirovski. On spatially adaptive estimation of nonparametric regression. Math. Methods of Statistics, v. 6 (1997) No. 2, 135 { 170. [10] Goldenshluger A., A. Nemirovski. Adaptive de-noising of signals satisfying differential inequalities. IEEE Transactions on Information Theory v. 43 (1997). [11] Golubev Yu. Asymptotic minimax estimation of regression function in additive model. Problemy peredachi informatsii, v. 28 (1992) No. 2, 3-15. (English transl. in Problems Inform. Transmission v. 28, 1992.) [12] Hardle W., Applied Nonparametric Regression, ES Monograph Series 19, Cambridge, U.K., Cambridge University Press, 1990. [13] Ibragimov I.A., R.Z. Khasminski. Statistical Estimation: Asymptotic Theory, Springer, 1981.
195
estimation in Gaussian white noise. Theory Probab. Appl. v. 31 (1986) No. 3, 391-406. [15] Juditsky, A. Wavelet estimators: Adapting to unknown smoothness. Math. Methods of Statistics v. 6 (1997) No. 1, 1-25. [16] Juditsky A., A. Nemirovski. Functional aggregation for nonparametric estimation. Technical report # 993 (March 1996), IRISA, Rennes [17] Korostelev A., A. Tsybakov. Minimax theory of image reconstruction. Lecture Notes in Statistics v. 82, Springer, New York, 1993. [18] Lepskii O. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability and Its Applications, v. 35 (1990) No. 3, 454-466. [19] Lepskii O. Asymptotically minimax adaptive estimation I: Upper bounds. Optimally adaptive estimates. Theory of Probability and Its Applications, v. 36 (1991) No. 4, 682-697. [20] Lepskii O., E. Mammen, V. Spokoiny. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. v. 25 (1997) No.3, 929-947. [21] Nemirovski A., D. Yudin. Problem complexity and method eÆciency in Optimization, J. Wiley & Sons, 1983. [22] Nemirovski A. On forecast under uncertainty. Problemy peredachi informatsii, v. 17 (1981) No. 4, 73-83. (English transl. in Problems Inform. Transmission v. 17, 1981.) [23] Nemirovski A. On nonparametric estimation of smooth regression functions. Sov. J. Comput. Syst. Sci., v. 23 (1985) No. 6, 1-11. [24] Nemirovski A. On necessary conditions for eÆcient estimation of functionals of a nonparametric signal in white noise. Theory Probab. Appl. v. 35 (1990) No. 1, 94-103. [25] Nemirovski A. On nonparametric estimation of functions satisfying dierential inequalities. { In: R. Khasminski, Ed. Advances in Soviet Mathematics, v. 12, American Mathematical Society, 1992, 7-43. [26] Pinsker M., Optimal ltration of square-integrable signals in Gaussian noise. Problemy peredachi informatsii, v. 16 (1980) No. 2, 120-133. (English transl. in Problems Inform. Transmission v. 16, 1980.) [27] Pinsker M., S. Efroimovich. Learning algorithm for nonparametric ltering. Automation and Remote Control, v. 45 (1984) No. 11, 1434-1440. [28] Pisier G. Remarques sur un resultat non publie de B. Maurey, - in: Seminaire d'analyse fonctionelle 1980-1981, v. 1 { v. 12, Ecole Polytechnique, Palaiseau, 1981.
lando, 1983. [30] Rosenblatt M. Stochastic curve estimation. Institute of Mathematical Statistics, Hayward, California, 1991. [31] Suetin P.K. The classical orthogonal polynomials. Nauka, Moscow, 1976 (in Russian). [32] Wahba G. Spline models for observational data. SIAM, Philadelphia, 1990.