In the previous article I described Importance Sampling, a method for reducing the variance of a Monte Carlo estimator. In this article I will discuss a technique called Multiple Importance Sampling which allows us to combine samples from multiple different probability distributions that we think match the shape of the integrand, reducing variance without introducing bias.

Let's say we're trying to estimate the integral of a product of two functions, $I = \int g(x)h(x)\mathrm{d}x$. Suppose we have two probability density functions, $p_g(x)$ and $p_h(x)$ which have a similar shape to $g$ and $h$ respectively. It is not obvious how we should sample and weight the random variables to construct an unbiased estimator of $I$ given this information.

## Trying Some Stuff

One way to do this would be to importance sample strictly according to $p_g$, completely ignoring $p_h$:

\begin{aligned} \tag{1} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{g}(X_i)} \end{aligned}

This works well if $h$ is approximately constant over the domain, as the shape of the integrand $g(x)h(x)$ will match the shape of $g(x)$ closely. However if this assumption does not hold, then the scheme will be poor.

Another idea would be to construct a probability density function $p_{gh}(x) \propto p_g(x)p_h(x)$, and use importance sampling as before, picking the $X_i$ according to $p_{gh}$:

\begin{aligned} \tag{2} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{gh}(X_i)} \end{aligned}

This is totally reasonable, but in many cases it is not computationally feasible to sample random variables from $p_{gh}$. Typically, sampling random variables from a given PDF involves inverse transform sampling, which is only possible if you can compute the inverse CDF.

Instead of trying to construct a single PDF, we can sample from each distribution, then average the samples:

\begin{aligned} \tag{3} \hat{I}N = \frac{1}{2N}\sum{i=1}^{N}\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{2N}\sum_{i=1}^{N}\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where the $X_{g,i}$ are picked according to $p_g$, and the $X_{h,i}$ according to $p_h$. Since each term of this sum is a Monte Carlo estimator with expected value $I/2$, we sum these to obtain an unbiased estimator. Note that this means we are now evaluating the integrand $2N$ times, with $N$ samples being taken according to each of $p_g$ and $p_h$.

The problem with this method is that it can lead to unnecessarily high variance. To see this, consider the variance of the estimator:

\begin{aligned} \mathrm{Var}[\hat{I}N] &= \frac{1}{4N^2}\sum{i=1}^{N}\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \ &= \frac{1}{4N}\left(\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})}\right] + \mathrm{Var}\left[\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right]\right) \ \end{aligned}

Since the variance is additive, if either $p_g$ or $p_h$ is a bad match for the integrand, we will get high variance - even if the other PDF is a perfect match.

## Generalizing

Let's try and think about the problem a bit more generally. The last technique was definitely along the right lines.

### Weights

It seems reasonable to explore estimators of the form

\begin{aligned} \tag{4} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{N}\sum_{i=1}^{N}w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where $w_g(x), w_h(x)$ are some 'weighting' functions. In strategy $(1)$, we tried $w_g(x) = 1, w_h(x) = 0$, and in strategy $(3)$ we tried $w_g(x) = w_h(x) = 1/2$.

The notation at this point is starting to get a bit heavy, so try to keep in mind that we're looking for some expression $\hat{I}_N$ such that $\mathbb{E}[\hat{I}_N] = \int g(x)h(x)\mathrm{d}x$. Clearly not all choices of weighting functions leave us with an unbiased estimator. To see what constraints are required on the weighting functions to keep the estimator unbiased, consider the expected value of $(4)$:

\begin{aligned} \mathbb{E}[\hat{I}N] &= \frac{1}{N}\sum{i=1}^{N}\mathbb{E}\left[w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \ &= \int w_g(x)\frac{g(x)h(x)}{p_g(x)}p_g(x) + w_h(x)\frac{g(x)h(x)}{p_h(x)}p_h(x)\mathrm{d}x \ &= \int \left(w_g(x) + w_h(x)\right)g(x)h(x)\mathrm{d}x \end{aligned}

Therefore we can see that we require $\forall x, w_g(x) + w_h(x) = 1$ for the estimator to remain unbiased. It is also required that $w_g(x) = 0$ whenever $p_g(x) = 0$, so as to avoid a division by zero when evaluating the first term of $\hat{I}_N$ (and likewise for $w_h(x)$ and the second term).

### Sampling Techniques

Currently we've been sampling from 2 PDFs (let's call each of these a 'sampling technique'), but we can generalise this. Let $K$ be the number of sampling techniques, with PDFs $p_1, p_2, \dots, p_K$ that each match the shape of the integrand $g(x)h(x)$ in some way. Let $X_{j, i}$ be sampled according to $p_j$. Given some set of weighting functions $w_j$, we can then rewrite $(4)$ as:

\begin{aligned} \tag{5} \hat{I}N = \sum{j=1}^{K}\frac{1}{N}\sum_{i=1}^{N}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

Note that the order of summation is not important here. We still require the two properties from before: $\forall x, \sum_{j=1}^{K} w_j(x) = 1$, and $p_j(x) = 0 \implies w_j(x) = 0$.

### Sample Counts

So far we've been taking exactly $N$ samples from each sampling technique, resulting in $KN$ samples in total. This can be generalised to allow greater or fewer samples for each sampling technique.

Let $n_j$ the number of samples taken for sampling technique $j$ (so that the total number of samples used by the estimator is $\sum_j n_j = N$). Then:

\begin{aligned} \tag{6} \hat{I}N = \sum{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

The proof that this is results in an unbiased estimator is trivial and left as an exercise to the reader.

## Weighting Heuristics

Now that I've introduced the general MIS formulation, we can discuss some different choices of weighting functions.

### Balance Heuristic

Veach determined that the following choice of weights is reasonable:

\begin{aligned} w_j(x) = \frac{n_jp_j(x)}{\sum_{k=1}^{K} n_kp_k(x)} \end{aligned}

When inserted into the MIS definition $(6)$, this gives:

\begin{aligned} \hat{I}N &= \sum{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}\left(\frac{n_jp_j(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})}\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})}\right) \ \tag{7} &= \sum_{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \end{aligned}

Veach shows that (assuming positive weights), the balance heuristic is close to optimal. Specifically, with the balance heuristic:

\begin{aligned} \mathrm{Var}[\hat{I}_N] - \mathrm{Var}[I] \le \left(\frac{1}{\mathrm{min}_j n_j} - \frac{1}{\sum_j n_j}\right)\mathbb{E}[I]^2 \end{aligned}

That is to say, even using the most optimal choice of weights, the variance differs by at most the term on the right hand side. A proof of this is given in the Chapter 9 Appendix of Veach's thesis.

We can interpret the balance heuristic $(7)$ in a more natural way by rewriting it slightly. Let $c_j = n_j/N$, i.e the fraction of samples given to technique $j$. Then:

\begin{aligned} \hat{I}N &= \sum{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \ &= \frac{1}{N}\sum_{i=1}^{n_j}\sum_{j=1}^{K}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} c_kp_k(X_{j,i})} \end{aligned}

Note that I have swapped the order of summation on the second line. One may notice that this resembles a standard Monte Carlo estimator with probability density function $\hat{p}(x) = \sum_j c_jp_j(x)$ which Veach calls the 'combined sample density'. This represents the distribution of a random variable that is equal to each $X_{j,i}$ with probability $1/N$.

### Cutoff Heuristic

The balance heuristic can introduce unnecessary variance in the case that one of the sampling techniques is a very close match to the integrand (this is closer to what typically happens in computer graphics). The reason for this is explored in more detail in Veach's thesis on pages 270-273.

The way to address this is to 'sharpen' the weighting functions - i.e increase the value of high weights while decreasing the value of low weights. For brevity,we will drop the argument $x$ and write $q_j = n_jp_j$. Veach defines the cutoff heuristic with parameter $0 \le \alpha \le 1$ by:

$w_j(x) = \begin{cases} \frac{q_j}{\sum_{k=1}^{N}{q_k | q_k \ge \alpha q_{\text{max}}}} &\text{if } q_j \ge \alpha q_{\text{max}} \ 0 &\text{otherwise} \end{cases}$

This has the effect of 'cutting off' any samples that are generated with a PDF less than $\alpha$ (and therefore increasing the value of the other weights so as to sum to 1). In the case of $\alpha = 0$ this reduces to the balance heuristic.

### Power Heuristic

The power heuristic with parameter $\beta \ge 1$ is defined as:

\begin{aligned} w_j(x) = \frac{q_j^\beta}{\sum_{k=1}^{K} q_k^\beta} \end{aligned}

In the case of $\beta = 1$, this also reduces to the balance heuristic. For larger values of $\beta$, more sharpening occurs. Veach determined empirically that the value $\beta = 2$ gives good results in typical scenes.

## Conclusion

Although I have not directly linked multiple importance sampling to computer graphics (see this PBRT section for a brief summary), MIS is such a crucial technique for reducing variance in offline rendering that it was mentioned as a reason for Eric Veach's technical Oscar. You might be amused to watch the video of him going up on stage and receiving the award from Michael B. Jordan and Kristen Bell, who clearly have no understanding of the words they are reading off the teleprompter.