Clinton Boys bio photo

Clinton Boys

I am an Australian data scientist and mathematician, living in Tel Aviv.

Email Twitter LinkedIn Instagram Github Stackoverflow

I was recently re-reading some fantastic course notes by the peerless Terence Tao on basic statistics from a fairly high mathematical standpoint as I wanted a quick refresher on a couple of things, and came across a very nice connection between two important and beautiful results which are fundamental to statistics and machine learning.

The weak law of large numbers

The weak law of large numbers is an important theorem in statistics which underpins a lot of analysis and model-building. It tells us that the sample mean of independently, identically distributed random variables converges in probability to the expected value, with very weak assumptions on the random variable itself. This allows us to “assume normality” in a large number of cases, and lets us use the power of statistical tests based on the normal distribution in a large number of contexts. If you’re not familiar with random variables, just think of rolling a dice: the weak law of large numbers tells us that if we roll a die enough times, and look at the proportion of times we rolled a 6 out of all the rolls, this proportion will approach as we roll the die more and more times.

We can state the weak law of large numbers more formally as follows. Suppose we are given a sequence of iid copies of a random variable with expected value . If for each we denote by the sum , then the random variables converge in probability to , the true mean.

The weak law of large numbers is of enormous consolation to casino operators. Indeed, suppose that the random variable models the result of spinning a roulette wheel. Then at any given spin, let’s denote it by , there is a chance of a significant loss; but over time we are guaranteed that the total loss per spin will converge to some , and we can calibrate the payouts accordingly so our bottom line is positive.

The curse of dimensionality

If you think about a point in space being represented by three numbers (a simple generalisation of the familiar Cartesian plane from high-school geometry), then collections of numbers represent points in some theoretical -dimensional space which has no spatially interpretable meaning for . The curse of dimensionality is a catch-all used by statisticians and mathematicians to describe a bunch of highly counterintuitive behaviours which occur at these “non-spatial” dimensions, particularly when is very large (i.e. thousands or even millions of dimensions).

The most well-known, and most prone to cause issues, is the strange density property of high-dimensional space: in a very vague sense, paraphrasing Wikipedia, there is no space in the “middle” of a unit hypercube; and it is almost all “pushed” out to the corners. This is a weird idea, and it’s even weirder that we can use the weak law of large numbers, a concept from statistics with seemingly little relation to high-dimensional geometry, elegantly to make this notion precise.

Let’s state the connection now. Choose some quantity . Then for sufficiently large, at least of the unit hypercube is contained in the set

This equation might be a little hard to understand at first, particularly if you have never seen a set written like this before. It’s describing the set of all points in -dimensional space whose “distance” (denoted by the bars around ) satisfies some inequality: that it is larger than what is on the left and smaller than what is on the right. Now think about what happens if we choose a very very small . So then is just a tiny bit smaller than and just a tiny bit larger, and we’re saying that almost of the hypercube’s volume is contained in the infinitesimally small strip of points whose distance from zero is between and .

Proof

Following Terence Tao’s notes, we let be drawn uniformly from the interval which gives a vector uniformly distributed on . Their squares are also iid random variables and so if we let represent the uniform measure on we can compute their expectations as

since the uniform probability measure on is (since each interval has length 2). This then equals

The weak law of large numbers tells us that the quantity $\frac{X_1^2+\ldots+X_n^2}{n}$ converges in probability to $\frac{1}{n}\sum_{i=1}^n\frac13=\frac13$, which means that for a given epsilon we can find $n$ such that

i.e. , and taking square roots of both sides we see this is precisely the definition of $(X_1,\ldots,X_n)$ lying in the region described above.