Monte Carlo Integration is a numerical method for computing an integral abf(x)dx\int_a^b f(x)\mathrm{d}x which only requires being able to evaluate the function ff at arbitrary points. It is especially useful for higher-dimensional integrals (involving multiple variables) such as ABf(x,y)dxdy\int_A \int_B f(x, y)\mathrm{d}x\mathrm{d}y . In this article I will touch on the basics of Monte Carlo Integration as well as explain a number of variance reduction techniques which we can use to get a more accurate result with fewer computations.

Numerical Integration

Say we have a function f(x)=x3f(x) = x^3 which we wish to integrate. Recall that the integral of ff can, in some sense, be thought of as the (signed) area under the graph of ff :

Suppose we're interested in integrating this function between 00 and 11 . We can do this analytically:

01f(x)dx=01x3dx=x4401=14\begin{aligned} \int_0^1 f(x)\mathrm{d}x &= \int_0^1 x^3\mathrm{d}x \\ &= \left.\frac{x^{4}}{4} \right\vert_{0}^{1} \\ &= \frac{1}{4} \end{aligned}

Quadrature Rules

One simple way of approximating the integral would be to:

  • Partition the domain [0,1][0, 1] into NN equally-sized subintervals:
x1=0,x2=1/N,x3=2/N,,xN=(N1)/Nx_1 = 0, x_2 = 1/N, x_3 = 2/N, \ldots, x_N = (N-1)/N
  • Let Δx=1/N\Delta x = 1/N be the size of each subinterval

  • Define I^=i=1i=NΔxf(xi)01f(x)dx\hat{I} = \sum_{i=1}^{i=N} \Delta xf(x_i) \approx \int_0^1 f(x)\mathrm{d}x

This gives the following JavaScript code (integrating between zero and one):

function integrate_zero_one(f, n) {
    var sum = 0.0;
    var delta_x = 1/n;

    for (var i = 0; i < n; i++) {
        sum += delta_x * f(i * delta_x);
    }

    return sum;
}

Visually, this can be thought of as summing the area of NN rectangles of equal width, with height determined by the integrand at the left of each rectangle:

There are other methods for numerical approximation that work similarly to this basic method (known as a left Reimann sum), such as Newton-Cotes rules (including the Trapezoidal rule and Simpson's rule), or Gauss-Legendre rules. Collectively these types of methods are called 'quadrature rules'.

Problems with Quadrature Rules

Quadrature rules are typically excellent for integrating functions of one variable. However they all suffer from a problem known as the 'curse of dimensionality', meaning that the approximations they give converge extremely slowly to the desired integral in multiple dimensions. See Veach's thesis (section 2.2) for a more rigorous analysis of quadrature rules.

Monte Carlo Integration

For integrating functions of multiple variables, it may be preferable to use a technique called Monte Carlo Integration.

Let's say that we want to compute the value of an integral abf(x)dx\int_a^b f(x)\mathrm{d}x .

Suppose we are given a set of NN random variables XiUniform(a,b)X_i \sim \mathrm{Uniform}(a, b) (typically called 'samples'). We would like to find some function I^N(X1,X2,,XN)\hat{I}_N(X_1, X_2, \ldots, X_N) such that

E[I^N(X1,X2,,XN)]=abf(x)dx.\mathbb{E}[\hat{I}_N(X_1, X_2, \ldots, X_N)] = \int_a^b f(x)\mathrm{d}x\mathrm{.}

Using the jargon, this means that I^N\hat{I}_N is an 'estimator' of ff . It should be intuitive that given a set of NN random variables XiUniform(a,b)X_i \sim \mathrm{Uniform}(a, b) (typically called 'samples'), then

E[f(Xk)]1Ni=1Nf(Xi)for any k{1,2,,N}.\mathbb{E}[f(X_k)] \approx \frac{1}{N}\sum_{i=1}^{N} f(X_i) \text{\medspace\medspace\medspace\medspace\medspace for any } k \in \{1, 2, \ldots, N\}.

That is to say, the mean of our computed samples f(Xi)f(X_i) approaches the expectation of ff as NN gets larger (according to the law of large numbers). Since the XiX_i are distributed uniformly, we know that pX(x)=1/(ba)p_X(x) = 1/(b-a) , so using the law of the unconscious statistician gives

E[f(Xk)]=abf(x)pX(x)dx=abf(x)badx=1baabf(x)dx\begin{aligned} \mathbb{E}[f(X_k)] &= \int_a^b f(x)p_X(x)\mathrm{d}x \\ &= \int_a^b \frac{f(x)}{b-a} \mathrm{d}x\\ &= \frac{1}{b-a}\int_a^b f(x)\mathrm{d}x \end{aligned}

hence

1Ni=1Nf(Xi)1baabf(x)dx    baNi=1Nf(Xi)abf(x)dx\begin{aligned} \frac{1}{N}\sum_{i=1}^{N} f(X_i) &\approx \frac{1}{b-a}\int_a^b f(x)\mathrm{d}x \\ \implies \frac{b-a}{N}\sum_{i=1}^{N} f(X_i) &\approx \int_a^b f(x)\mathrm{d}x \end{aligned}

This is excellent, as we can now set I^N(X1,X2,,XN)\hat{I}_N(X_1, X_2, \ldots, X_N) (which I will just write as I^N\hat{I}_N from now on) equal to this left hand side quantity. This gives

E[I^N]=E[baNi=1Nf(Xi)]=baNi=1NE[f(Xi)]=baNi=1Nabf(x)pX(x)dx=baNi=1Nabf(x)badx=1Ni=1Nabf(x)dx=abf(x)dx\begin{aligned} \mathbb{E}[\hat{I}_N] &= \mathbb{E}\left[\frac{b-a}{N}\sum_{i=1}^{N}f(X_i)\right] \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\mathbb{E}[f(X_i)] \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\int_a^b f(x)p_X(x)\mathrm{d}x \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\int_a^b \frac{f(x)}{b-a}\mathrm{d}x \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b f(x)\mathrm{d}x \\ &= \int_a^b f(x)\mathrm{d}x \\ \end{aligned}

which is exactly what we want!

Monte Carlo Estimator

In fact, we can sample random variables from any distribution - we just need to change our I^N\hat{I}_N slightly. Suppose now that the XiX_i are sampled from some distribution with an arbitrary probability density function pX(x)p_X(x) . Then define the Monte Carlo estimator of ff with NN samples to be:

I^N=1Ni=1Nf(Xi)pX(Xi)\hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}

Again, algebraic manipulation shows that: E[I^N]=E[1Ni=1Nf(Xi)pX(Xi)]=1Ni=1NE[f(Xi)pX(Xi)]=1Ni=1Nabf(x)pX(x)pX(x)dx=1Ni=1Nabf(x)dx=abf(x)dx\begin{aligned} \mathbb{E}[\hat{I}_N] &= \mathbb{E}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b \frac{f(x)}{p_X(x)}p_X(x)\mathrm{d}x \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b f(x)\mathrm{d}x \\ &= \int_a^b f(x)\mathrm{d}x \\ \end{aligned}

It will become clear later why it is useful to be able to pick the XiX_i from any distribution.

Returning to the previous case setting XiUniform(a,b)X_i \sim \mathrm{Uniform}(a, b) , we can write some code which computes the value of the Monte Carlo estimator of ff with NN samples:

function integrate_monte_carlo(f, a, b, n) {
    var sum = 0.0;

    for (var i = 0; i < n; i++) {
        // Obtain x_i ~ Unif(a, b) from Unif(0, 1)
        var x_i = Math.random() * (b - a) + a;
        sum += f(x_i);
    }

    return (b - a) / n * sum;
}

// e.g integrate_monte_carlo(x => (x * x * x), 0.0, 1.0, 100)

Monte Carlo Integration in Multiple Dimensions

The Monte Carlo estimator extends easily to multiple dimensions. For example, in three dimensions: let Xi=(xi,yi,zi)X_i = (x_i, y_i, z_i) be drawn according to a joint probability density function pX(x,y,z)p_X(x, y, z) . Then the Monte Carlo estimator of a function f(x,y,z)f(x, y, z) is again

I^N=1Ni=1Nf(Xi)pX(Xi).\hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\mathrm{.}

Keep in mind that the Monte Carlo estimator is a function of NN - for low values of NN , the estimate will be quite inaccurate, but the law of large numbers tells us that as NN increases, the estimate gets better and better. It is possible to show analytically that the error of the Monte Carlo estimator (defined by I^NE[I^N]\hat{I}_N - \mathbb{E}[\hat{I}_N] ) is O(N1/2)O(N^{-1/2}) in any number of dimensions, whereas no quadrature rule has an error better than O(N1/s)O(N^{-1/s}) in ss dimensions. Again, see Veach's thesis for a more rigorous exploration of these results.

Variance Reduction Techniques

Ideally we want our estimator I^N\hat{I}_N to give us an accurate result with as small a value of NN as possible (since this implies fewer computations). Mathematically speaking, we would like to minimize Var[I^N]\mathrm{Var}[\hat{I}_N] .

Importance Sampling

One extremely clever way of reducing the variance of a Monte Carlo estimator is to strategically sample the XiX_i according to some probability density pX(x)p_X(x) that closely approximates the integrand ff . To see why this works, consider picking pX(x)=cf(x)p_X(x) = cf(x) 1 (where cc is some constant which ensures that pXp_X integrates to 1). Then

Var[I^N]=Var[1Ni=1Nf(Xi)pX(Xi)]=Var[1Ni=1N1c]=Var[1c]=0.\begin{aligned} \mathrm{Var}[\hat{I}_N] &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{1}{c}\right] \\ &= \mathrm{Var}\left[\frac{1}{c}\right] \\ &= 0\mathrm{.} \end{aligned}

This would be the perfect estimator! For all values of NN , our estimator gives us the exact value of the integral. However unfortunately, it is not possible to choose such a pXp_X in the first place, because computing the normalization constant cc involves computing the integral of ff , which is exactly the thing we're trying to calculate. Also, it may be difficult to sample the XiX_i from the probability density function if we cannot find an analytic formula for the cumulative distribution function (which is required for CDF inversion sampling). Therefore we usually have to settle for picking samples from probability density functions which merely approximate the integrand.

To think about why this is an improvement over picking from a uniform distribution, consider the variance of our estimator:

Var[I^N]=Var[1Ni=1Nf(Xi)pX(Xi)]=1N2i=1NVar[f(Xi)pX(Xi)]=1NVar[f(Xi)pX(Xi)]    Var[I^N]Var[f(Xi)pX(Xi)]\begin{aligned} \mathrm{Var}[\hat{I}_N] &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N^2}\sum_{i=1}^{N}\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ \implies \mathrm{Var}[\hat{I}_N] &\propto \mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \end{aligned}

(where the second line holds if the XiX_i are independent). Now, if our choice of pX(x)p_X(x) has a similar shape to f(x)f(x) , then the expression f(x)pX(x)\frac{f(x)}{p_X(x)} should be almost constant, hence Var[f(Xi)pX(Xi)]\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] will be low, which is exactly what we want for our estimator.

Importance sampling is absolutely crucial for reducing the amount of computational work. For example in computer graphics we are often trying to calculate the color of a point on a surface by (very loosely speaking) integrating the energy of light rays arriving at the point in a hemisphere of directions. We know that incoming light rays arriving perpendicular to the surface have a greater effect on its color than light rays arriving parallel to the surface, so we can sample more light rays close to the normal of the surface and get a faster converging result!

Low Discrepancy Sampling

One final technique I will talk about for reducing the variance of our estimator is uniform sample placement.

In addition to importance sampling, it is intuitive that we would like to explore the domain of our function ff as evenly as possible. For example, recall the case of the Monte Carlo estimator where we have a uniform PDF. It should be clear that we would like our samples to be evenly spaced: that is, they should not be clumped up together (as they would be retrieving values of ff that are nearly the same, providing little additional information) and similarly they should not be far apart. In the case of a non-uniform PDF, the same holds true, though the connection is a little harder to see intuitively.

We can mathematically quantify how 'evenly spaced' the points in a sequence (xn)(x_n) are using a measurement called the Star Discrepancy. Using a low discrepancy sequence (LDS) gives us a slightly lower variance (especially for small sample counts) than naive pseudorandom sampling.

The image below uses pseudorandom samples to generate the coordinates of each point.

As you can see, there are areas of higher point density and lower point density. This results in high variance when doing Monte Carlo integration. The image below instead uses a low discrepancy sequence called a Sobol sequence for each coordinate:

The points are much more evenly spaced throughout the image, which is what we want.

One of the simplest low discrepancy sequences is called the van der Corput sequence. The base- bb van der Corput sequence is defined by:

xn=k=0dk(n)bk1x_n = \sum_{k=0}^{\infty} d_k(n)b^{-k-1}

where dk(n)d_k(n) is the kk th digit of the expansion of nn in base bb . With b=2b = 2 , the sequence begins 0.5,0.25,0.75,0.1250.5, 0.25, 0.75, 0.125 \ldots

Conclusion

Monte Carlo integration is a powerful tool for evaluating high-dimensional integrals. We have seen how its variance can be reduced significantly through importance sampling and through choosing a low discrepancy sequence, both of which result in lowering the amount of computational work we need to do to obtain a reasonable result.

In the next article, I will talk about a technique called Multiple Importance Sampling which allows us to combine samples from multiple different probability density functions that we think match the shape of the integrand, reducing variance without introducing bias.

Footnotes

1: For the rest of this post we assume that f(x)f(x) is non-negative, otherwise such a choice of PDF would not be possible. We also assume that the PDF is non-zero wherever ff is non-zero, to avoid division by zero.

References