In statistics we often want to discover (sometimes impose) structure on observed data, and dimension plays a crucial role in this task. The setting that I will consider in this talk is the following: some high-dimensional data has been collected but it (potentially) lives in some lower dimensional space (this lower dimension is called the intrinsic dimension of the dataset); the objective is to estimate the intrinsic dimension of the high-dimensional dataset.
Why would we want to to this? Dimensionality reduction techniques (e.g., PCA, manifold learning) usually rely on knowledge about intrinsic dimension. Knowledge about dimension is also important to try to avoid the curse of dimensionality. From a computational perspective, the dimension of a dataset has impact in terms of the amount of space needed to store data (compressibility). The speed of algorithms is also commonly affected by the dimension of input data. One can also envision situations where we have access to some regression data, but the design points are unknown (this occurs, for example, in graphon estimation problems); the dimension of the design space has a large impact on the rate with which the regression function can be recuperated.
Our approach relies on having access to a certain graph: each vertex represents an obser- vation, and there is an edge between two vertices if the corresponding observations are close in some metric. We model this graph as a random connection model (a model from continuum percolation), and use this to propose estimators for the intrinsic dimension based on the dou- bling property of the Lebesgue measure. I will give some conditions under which the dimension can be estimated consistently, and some bounds on the probability of correctly recuperating an integer dimension. I will also show some numerical results and compare our estimators with some competing approaches from the literature.
This is joint work with Michel Mandjes.