Monte Carlo Integration is a numerical method for computing an integral $\int_a^b f(x)\mathrm{d}x$ which only requires being able to evaluate the function $f$ at arbitrary points. It is especially useful for higher-dimensional integrals (involving multiple variables) such as $\int_A \int_B f(x, y)\mathrm{d}x\mathrm{d}y$ . In this article I will touch on the basics of Monte Carlo Integration as well as explain a number of variance reduction techniques which we can use to get a more accurate result with fewer computations.

## Numerical Integration Say we have a function $f(x) = x^3$ which we wish to integrate. Recall that the integral of $f$ can, in some sense, be thought of as the (signed) area under the graph of $f$ :

Suppose we're interested in integrating this function between $0$ and $1$ . We can do this analytically:

\begin{aligned} \int_0^1 f(x)\mathrm{d}x &= \int_0^1 x^3\mathrm{d}x \\ &= \left.\frac{x^{4}}{4} \right\vert_{0}^{1} \\ &= \frac{1}{4} \end{aligned}

### Quadrature Rules One simple way of approximating the integral would be to:

• Partition the domain $[0, 1]$ into $N$ equally-sized subintervals:
$x_1 = 0, x_2 = 1/N, x_3 = 2/N, \ldots, x_N = (N-1)/N$
• Let $\Delta x = 1/N$ be the size of each subinterval

• Define $\hat{I} = \sum_{i=1}^{i=N} \Delta xf(x_i) \approx \int_0^1 f(x)\mathrm{d}x$

This gives the following JavaScript code (integrating between zero and one):

function integrate_zero_one(f, n) {
var sum = 0.0;
var delta_x = 1/n;

for (var i = 0; i < n; i++) {
sum += delta_x * f(i * delta_x);
}

return sum;
}


Visually, this can be thought of as summing the area of $N$ rectangles of equal width, with height determined by the integrand at the left of each rectangle:

There are other methods for numerical approximation that work similarly to this basic method (known as a left Reimann sum), such as Newton-Cotes rules (including the Trapezoidal rule and Simpson's rule), or Gauss-Legendre rules. Collectively these types of methods are called 'quadrature rules'.

### Problems with Quadrature Rules Quadrature rules are typically excellent for integrating functions of one variable. However they all suffer from a problem known as the 'curse of dimensionality', meaning that the approximations they give converge extremely slowly to the desired integral in multiple dimensions. See Veach's thesis (section 2.2) for a more rigorous analysis of quadrature rules.

## Monte Carlo Integration For integrating functions of multiple variables, it may be preferable to use a technique called Monte Carlo Integration.

Let's say that we want to compute the value of an integral $\int_a^b f(x)\mathrm{d}x$ .

Suppose we are given a set of $N$ random variables $X_i \sim \mathrm{Uniform}(a, b)$ (typically called 'samples'). We would like to find some function $\hat{I}_N(X_1, X_2, \ldots, X_N)$ such that

$\mathbb{E}[\hat{I}_N(X_1, X_2, \ldots, X_N)] = \int_a^b f(x)\mathrm{d}x\mathrm{.}$

Using the jargon, this means that $\hat{I}_N$ is an 'estimator' of $f$ . It should be intuitive that given a set of $N$ random variables $X_i \sim \mathrm{Uniform}(a, b)$ (typically called 'samples'), then

$\mathbb{E}[f(X_k)] \approx \frac{1}{N}\sum_{i=1}^{N} f(X_i) \text{\medspace\medspace\medspace\medspace\medspace for any } k \in \{1, 2, \ldots, N\}.$

That is to say, the mean of our computed samples $f(X_i)$ approaches the expectation of $f$ as $N$ gets larger (according to the law of large numbers). Since the $X_i$ are distributed uniformly, we know that $p_X(x) = 1/(b-a)$ , so using the law of the unconscious statistician gives

\begin{aligned} \mathbb{E}[f(X_k)] &= \int_a^b f(x)p_X(x)\mathrm{d}x \\ &= \int_a^b \frac{f(x)}{b-a} \mathrm{d}x\\ &= \frac{1}{b-a}\int_a^b f(x)\mathrm{d}x \end{aligned}

hence

\begin{aligned} \frac{1}{N}\sum_{i=1}^{N} f(X_i) &\approx \frac{1}{b-a}\int_a^b f(x)\mathrm{d}x \\ \implies \frac{b-a}{N}\sum_{i=1}^{N} f(X_i) &\approx \int_a^b f(x)\mathrm{d}x \end{aligned}

This is excellent, as we can now set $\hat{I}_N(X_1, X_2, \ldots, X_N)$ (which I will just write as $\hat{I}_N$ from now on) equal to this left hand side quantity. This gives

\begin{aligned} \mathbb{E}[\hat{I}_N] &= \mathbb{E}\left[\frac{b-a}{N}\sum_{i=1}^{N}f(X_i)\right] \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\mathbb{E}[f(X_i)] \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\int_a^b f(x)p_X(x)\mathrm{d}x \\ &= \frac{b-a}{N}\sum_{i=1}^{N}\int_a^b \frac{f(x)}{b-a}\mathrm{d}x \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b f(x)\mathrm{d}x \\ &= \int_a^b f(x)\mathrm{d}x \\ \end{aligned}

which is exactly what we want!

### Monte Carlo Estimator In fact, we can sample random variables from any distribution - we just need to change our $\hat{I}_N$ slightly. Suppose now that the $X_i$ are sampled from some distribution with an arbitrary probability density function $p_X(x)$ . Then define the Monte Carlo estimator of $f$ with $N$ samples to be:

$\hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}$

Again, algebraic manipulation shows that: \begin{aligned} \mathbb{E}[\hat{I}_N] &= \mathbb{E}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b \frac{f(x)}{p_X(x)}p_X(x)\mathrm{d}x \\ &= \frac{1}{N}\sum_{i=1}^{N}\int_a^b f(x)\mathrm{d}x \\ &= \int_a^b f(x)\mathrm{d}x \\ \end{aligned}

It will become clear later why it is useful to be able to pick the $X_i$ from any distribution.

Returning to the previous case setting $X_i \sim \mathrm{Uniform}(a, b)$ , we can write some code which computes the value of the Monte Carlo estimator of $f$ with $N$ samples:

function integrate_monte_carlo(f, a, b, n) {
var sum = 0.0;

for (var i = 0; i < n; i++) {
// Obtain x_i ~ Unif(a, b) from Unif(0, 1)
var x_i = Math.random() * (b - a) + a;
sum += f(x_i);
}

return (b - a) / n * sum;
}

// e.g integrate_monte_carlo(x => (x * x * x), 0.0, 1.0, 100)


### Monte Carlo Integration in Multiple Dimensions The Monte Carlo estimator extends easily to multiple dimensions. For example, in three dimensions: let $X_i = (x_i, y_i, z_i)$ be drawn according to a joint probability density function $p_X(x, y, z)$ . Then the Monte Carlo estimator of a function $f(x, y, z)$ is again

$\hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\mathrm{.}$

Keep in mind that the Monte Carlo estimator is a function of $N$ - for low values of $N$ , the estimate will be quite inaccurate, but the law of large numbers tells us that as $N$ increases, the estimate gets better and better. It is possible to show analytically that the error of the Monte Carlo estimator (defined by $\hat{I}_N - \mathbb{E}[\hat{I}_N]$ ) is $O(N^{-1/2})$ in any number of dimensions, whereas no quadrature rule has an error better than $O(N^{-1/s})$ in $s$ dimensions. Again, see Veach's thesis for a more rigorous exploration of these results.

## Variance Reduction Techniques Ideally we want our estimator $\hat{I}_N$ to give us an accurate result with as small a value of $N$ as possible (since this implies fewer computations). Mathematically speaking, we would like to minimize $\mathrm{Var}[\hat{I}_N]$ .

### Importance Sampling One extremely clever way of reducing the variance of a Monte Carlo estimator is to strategically sample the $X_i$ according to some probability density $p_X(x)$ that closely approximates the integrand $f$ . To see why this works, consider picking $p_X(x) = cf(x)$ 1 (where $c$ is some constant which ensures that $p_X$ integrates to 1). Then

\begin{aligned} \mathrm{Var}[\hat{I}_N] &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{1}{c}\right] \\ &= \mathrm{Var}\left[\frac{1}{c}\right] \\ &= 0\mathrm{.} \end{aligned}

This would be the perfect estimator! For all values of $N$ , our estimator gives us the exact value of the integral. However unfortunately, it is not possible to choose such a $p_X$ in the first place, because computing the normalization constant $c$ involves computing the integral of $f$ , which is exactly the thing we're trying to calculate. Also, it may be difficult to sample the $X_i$ from the probability density function if we cannot find an analytic formula for the cumulative distribution function (which is required for CDF inversion sampling). Therefore we usually have to settle for picking samples from probability density functions which merely approximate the integrand.

To think about why this is an improvement over picking from a uniform distribution, consider the variance of our estimator:

\begin{aligned} \mathrm{Var}[\hat{I}_N] &= \mathrm{Var}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N^2}\sum_{i=1}^{N}\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ &= \frac{1}{N}\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \\ \implies \mathrm{Var}[\hat{I}_N] &\propto \mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right] \end{aligned}

(where the second line holds if the $X_i$ are independent). Now, if our choice of $p_X(x)$ has a similar shape to $f(x)$ , then the expression $\frac{f(x)}{p_X(x)}$ should be almost constant, hence $\mathrm{Var}\left[\frac{f(X_i)}{p_X(X_i)}\right]$ will be low, which is exactly what we want for our estimator.

Importance sampling is absolutely crucial for reducing the amount of computational work. For example in computer graphics we are often trying to calculate the color of a point on a surface by (very loosely speaking) integrating the energy of light rays arriving at the point in a hemisphere of directions. We know that incoming light rays arriving perpendicular to the surface have a greater effect on its color than light rays arriving parallel to the surface, so we can sample more light rays close to the normal of the surface and get a faster converging result!

### Low Discrepancy Sampling One final technique I will talk about for reducing the variance of our estimator is uniform sample placement.

In addition to importance sampling, it is intuitive that we would like to explore the domain of our function $f$ as evenly as possible. For example, recall the case of the Monte Carlo estimator where we have a uniform PDF. It should be clear that we would like our samples to be evenly spaced: that is, they should not be clumped up together (as they would be retrieving values of $f$ that are nearly the same, providing little additional information) and similarly they should not be far apart. In the case of a non-uniform PDF, the same holds true, though the connection is a little harder to see intuitively.

We can mathematically quantify how 'evenly spaced' the points in a sequence $(x_n)$ are using a measurement called the Star Discrepancy. Using a low discrepancy sequence (LDS) gives us a slightly lower variance (especially for small sample counts) than naive pseudorandom sampling.

The image below uses pseudorandom samples to generate the coordinates of each point.

As you can see, there are areas of higher point density and lower point density. This results in high variance when doing Monte Carlo integration. The image below instead uses a low discrepancy sequence called a Sobol sequence for each coordinate:

The points are much more evenly spaced throughout the image, which is what we want.

One of the simplest low discrepancy sequences is called the van der Corput sequence. The base- $b$ van der Corput sequence is defined by:

$x_n = \sum_{k=0}^{\infty} d_k(n)b^{-k-1}$

where $d_k(n)$ is the $k$ th digit of the expansion of $n$ in base $b$ . With $b = 2$ , the sequence begins $0.5, 0.25, 0.75, 0.125 \ldots$

## Conclusion Monte Carlo integration is a powerful tool for evaluating high-dimensional integrals. We have seen how its variance can be reduced significantly through importance sampling and through choosing a low discrepancy sequence, both of which result in lowering the amount of computational work we need to do to obtain a reasonable result.

In the next article, I will talk about a technique called Multiple Importance Sampling which allows us to combine samples from multiple different probability density functions that we think match the shape of the integrand, reducing variance without introducing bias.

### Footnotes 1: For the rest of this post we assume that $f(x)$ is non-negative, otherwise such a choice of PDF would not be possible. We also assume that the PDF is non-zero wherever $f$ is non-zero, to avoid division by zero.