In the previous article I described Importance Sampling, a method for reducing the variance of a Monte Carlo estimator. In this article I will discuss a technique called Multiple Importance Sampling which allows us to combine samples from multiple different probability distributions that we think match the shape of the integrand, reducing variance without introducing bias.

Let's say we're trying to estimate the integral of a product of two functions, $I = \int g(x)h(x)\mathrm{d}x$. Suppose we have two probability density functions, $p_g(x)$ and $p_h(x)$ which have a similar shape to $g$ and $h$ respectively. It is not obvious how we should sample and weight the random variables to construct an unbiased estimator of $I$ given this information.

## Trying Some Stuff One way to do this would be to importance sample strictly according to $p_g$, completely ignoring $p_h$:

\begin{aligned} \tag{1} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{g}(X_i)} \end{aligned}

This works well if $h$ is approximately constant over the domain, as the shape of the integrand $g(x)h(x)$ will match the shape of $g(x)$ closely. However if this assumption does not hold, then the scheme will be poor.

Another idea would be to construct a probability density function $p_{gh}(x) \propto p_g(x)p_h(x)$, and use importance sampling as before, picking the $X_i$ according to $p_{gh}$:

\begin{aligned} \tag{2} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{gh}(X_i)} \end{aligned}

This is totally reasonable, but in many cases it is not computationally feasible to sample random variables from $p_{gh}$. Typically, sampling random variables from a given PDF involves inverse transform sampling, which is only possible if you can compute the inverse CDF.

Instead of trying to construct a single PDF, we can sample from each distribution, then average the samples:

\begin{aligned} \tag{3} \hat{I}N = \frac{1}{2N}\sum{i=1}^{N}\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{2N}\sum_{i=1}^{N}\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where the $X_{g,i}$ are picked according to $p_g$, and the $X_{h,i}$ according to $p_h$. Since each term of this sum is a Monte Carlo estimator with expected value $I/2$, we sum these to obtain an unbiased estimator. Note that this means we are now evaluating the integrand $2N$ times, with $N$ samples being taken according to each of $p_g$ and $p_h$.

The problem with this method is that it can lead to unnecessarily high variance. To see this, consider the variance of the estimator:

\begin{aligned} \mathrm{Var}[\hat{I}N] &= \frac{1}{4N^2}\sum{i=1}^{N}\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \ &= \frac{1}{4N}\left(\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})}\right] + \mathrm{Var}\left[\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right]\right) \ \end{aligned}

Since the variance is additive, if either $p_g$ or $p_h$ is a bad match for the integrand, we will get high variance - even if the other PDF is a perfect match.

## Generalizing Let's try and think about the problem a bit more generally. The last technique was definitely along the right lines.

### Weights It seems reasonable to explore estimators of the form

\begin{aligned} \tag{4} \hat{I}N = \frac{1}{N}\sum{i=1}^{N}w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{N}\sum_{i=1}^{N}w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where $w_g(x), w_h(x)$ are some 'weighting' functions. In strategy $(1)$, we tried $w_g(x) = 1, w_h(x) = 0$, and in strategy $(3)$ we tried $w_g(x) = w_h(x) = 1/2$.

The notation at this point is starting to get a bit heavy, so try to keep in mind that we're looking for some expression $\hat{I}_N$ such that $\mathbb{E}[\hat{I}_N] = \int g(x)h(x)\mathrm{d}x$. Clearly not all choices of weighting functions leave us with an unbiased estimator. To see what constraints are required on the weighting functions to keep the estimator unbiased, consider the expected value of $(4)$:

\begin{aligned} \mathbb{E}[\hat{I}N] &= \frac{1}{N}\sum{i=1}^{N}\mathbb{E}\left[w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \ &= \int w_g(x)\frac{g(x)h(x)}{p_g(x)}p_g(x) + w_h(x)\frac{g(x)h(x)}{p_h(x)}p_h(x)\mathrm{d}x \ &= \int \left(w_g(x) + w_h(x)\right)g(x)h(x)\mathrm{d}x \end{aligned}

Therefore we can see that we require $\forall x, w_g(x) + w_h(x) = 1$ for the estimator to remain unbiased. It is also required that $w_g(x) = 0$ whenever $p_g(x) = 0$, so as to avoid a division by zero when evaluating the first term of $\hat{I}_N$ (and likewise for $w_h(x)$ and the second term).

### Sampling Techniques Currently we've been sampling from 2 PDFs (let's call each of these a 'sampling technique'), but we can generalise this. Let $K$ be the number of sampling techniques, with PDFs $p_1, p_2, \dots, p_K$ that each match the shape of the integrand $g(x)h(x)$ in some way. Let $X_{j, i}$ be sampled according to $p_j$. Given some set of weighting functions $w_j$, we can then rewrite $(4)$ as:

\begin{aligned} \tag{5} \hat{I}N = \sum{j=1}^{K}\frac{1}{N}\sum_{i=1}^{N}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

Note that the order of summation is not important here. We still require the two properties from before: $\forall x, \sum_{j=1}^{K} w_j(x) = 1$, and $p_j(x) = 0 \implies w_j(x) = 0$.

### Sample Counts So far we've been taking exactly $N$ samples from each sampling technique, resulting in $KN$ samples in total. This can be generalised to allow greater or fewer samples for each sampling technique.

Let $n_j$ the number of samples taken for sampling technique $j$ (so that the total number of samples used by the estimator is $\sum_j n_j = N$). Then:

\begin{aligned} \tag{6} \hat{I}N = \sum{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

The proof that this is results in an unbiased estimator is trivial and left as an exercise to the reader.

## Weighting Heuristics Now that I've introduced the general MIS formulation, we can discuss some different choices of weighting functions.

### Balance Heuristic Veach determined that the following choice of weights is reasonable:

\begin{aligned} w_j(x) = \frac{n_jp_j(x)}{\sum_{k=1}^{K} n_kp_k(x)} \end{aligned}

When inserted into the MIS definition $(6)$, this gives:

\begin{aligned} \hat{I}N &= \sum{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}\left(\frac{n_jp_j(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})}\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})}\right) \ \tag{7} &= \sum_{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \end{aligned}

Veach shows that (assuming positive weights), the balance heuristic is close to optimal. Specifically, with the balance heuristic:

\begin{aligned} \mathrm{Var}[\hat{I}_N] - \mathrm{Var}[I] \le \left(\frac{1}{\mathrm{min}_j n_j} - \frac{1}{\sum_j n_j}\right)\mathbb{E}[I]^2 \end{aligned}

That is to say, even using the most optimal choice of weights, the variance differs by at most the term on the right hand side. A proof of this is given in the Chapter 9 Appendix of Veach's thesis.

We can interpret the balance heuristic $(7)$ in a more natural way by rewriting it slightly. Let $c_j = n_j/N$, i.e the fraction of samples given to technique $j$. Then:

\begin{aligned} \hat{I}N &= \sum{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \ &= \frac{1}{N}\sum_{i=1}^{n_j}\sum_{j=1}^{K}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} c_kp_k(X_{j,i})} \end{aligned}

Note that I have swapped the order of summation on the second line. One may notice that this resembles a standard Monte Carlo estimator with probability density function $\hat{p}(x) = \sum_j c_jp_j(x)$ which Veach calls the 'combined sample density'. This represents the distribution of a random variable that is equal to each $X_{j,i}$ with probability $1/N$.

### Cutoff Heuristic The balance heuristic can introduce unnecessary variance in the case that one of the sampling techniques is a very close match to the integrand (this is closer to what typically happens in computer graphics). The reason for this is explored in more detail in Veach's thesis on pages 270-273.

The way to address this is to 'sharpen' the weighting functions - i.e increase the value of high weights while decreasing the value of low weights. For brevity,we will drop the argument $x$ and write $q_j = n_jp_j$. Veach defines the cutoff heuristic with parameter $0 \le \alpha \le 1$ by:

$w_j(x) = \begin{cases} \frac{q_j}{\sum_{k=1}^{N}{q_k | q_k \ge \alpha q_{\text{max}}}} &\text{if } q_j \ge \alpha q_{\text{max}} \ 0 &\text{otherwise} \end{cases}$

This has the effect of 'cutting off' any samples that are generated with a PDF less than $\alpha$ (and therefore increasing the value of the other weights so as to sum to 1). In the case of $\alpha = 0$ this reduces to the balance heuristic.

### Power Heuristic The power heuristic with parameter $\beta \ge 1$ is defined as:

\begin{aligned} w_j(x) = \frac{q_j^\beta}{\sum_{k=1}^{K} q_k^\beta} \end{aligned}

In the case of $\beta = 1$, this also reduces to the balance heuristic. For larger values of $\beta$, more sharpening occurs. Veach determined empirically that the value $\beta = 2$ gives good results in typical scenes.

## Conclusion Although I have not directly linked multiple importance sampling to computer graphics (see this PBRT section for a brief summary), MIS is such a crucial technique for reducing variance in offline rendering that it was mentioned as a reason for Eric Veach's technical Oscar. You might be amused to watch the video of him going up on stage and receiving the award from Michael B. Jordan and Kristen Bell, who clearly have no understanding of the words they are reading off the teleprompter.

### Further Reading There is a reasonable amount of literature surrounding MIS that is worthy of note. Cornuet et al. (2012) describe a method for adaptive multiple importance sampling that effectively modify the weights according to samples obtained at runtime. Kondapaneni et al. (2019) revisit some of Veach's derivations, showing that the variance of the balance estimator can be improved upon further than the bound given by Veach if negative weights are allowed. They also describe a technique for estimating the most optimal MIS weights, since they cannot be computed directly.

Sbert et al. (2016) define an estimator based on the 'one sample model' of MIS that Veach describes but I have omitted. This involves randomly selecting one of the proposal distributions to sample from, rather than sampling from them all at the same time and combining their contributions. Their method is provably better than the balance heuristic for the same number of samples.