In the previous article I described Importance Sampling, a method for reducing the variance of a Monte Carlo estimator. In this article I will discuss a technique called Multiple Importance Sampling which allows us to combine samples from multiple different probability distributions that we think match the shape of the integrand, reducing variance without introducing bias.

Let's say we're trying to estimate the integral of a product of two functions, I=g(x)h(x)dxI = \int g(x)h(x)\mathrm{d}x . Suppose we have two probability density functions, pg(x)p_g(x) and ph(x)p_h(x) which have a similar shape to gg and hh respectively. It is not obvious how we should sample and weight the random variables to construct an unbiased estimator of II given this information.

Trying Some Stuff

One way to do this would be to importance sample strictly according to pgp_g , completely ignoring php_h :

I^N=1Ni=1Ng(Xi)h(Xi)pg(Xi)(1)\begin{aligned} \tag{1} \hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{g}(X_i)} \end{aligned}

This works well if hh is approximately constant over the domain, as the shape of the integrand g(x)h(x)g(x)h(x) will match the shape of g(x)g(x) closely. However if this assumption does not hold, then the scheme will be poor.

Another idea would be to construct a probability density function pgh(x)pg(x)ph(x)p_{gh}(x) \propto p_g(x)p_h(x) , and use importance sampling as before, picking the XiX_i according to pghp_{gh} :

I^N=1Ni=1Ng(Xi)h(Xi)pgh(Xi)(2)\begin{aligned} \tag{2} \hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}\frac{g(X_i)h(X_i)}{p_{gh}(X_i)} \end{aligned}

This is totally reasonable, but in many cases it is not computationally feasible to sample random variables from pghp_{gh} . Typically, sampling random variables from a given PDF involves inverse transform sampling, which is only possible if you can compute the inverse CDF.

Instead of trying to construct a single PDF, we can sample from each distribution, then average the samples:

I^N=12Ni=1Ng(Xg,i)h(Xg,i)pg(Xg,i)+12Ni=1Ng(Xh,i)h(Xh,i)ph(Xh,i)(3)\begin{aligned} \tag{3} \hat{I}_N = \frac{1}{2N}\sum_{i=1}^{N}\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{2N}\sum_{i=1}^{N}\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where the Xg,iX_{g,i} are picked according to pgp_g , and the Xh,iX_{h,i} according to php_h . Since each term of this sum is a Monte Carlo estimator with expected value I/2I/2 , we sum these to obtain an unbiased estimator. Note that this means we are now evaluating the integrand 2N2N times, with NN samples being taken according to each of pgp_g and php_h .

The problem with this method is that it can lead to unnecessarily high variance. To see this, consider the variance of the estimator:

Var[I^N]=14N2i=1NVar[g(Xg,i)h(Xg,i)pg(Xg,i)+g(Xh,i)h(Xh,i)ph(Xh,i)]=14N(Var[g(Xg,i)h(Xg,i)pg(Xg,i)]+Var[g(Xh,i)h(Xh,i)ph(Xh,i)])\begin{aligned} \mathrm{Var}[\hat{I}_N] &= \frac{1}{4N^2}\sum_{i=1}^{N}\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \\ &= \frac{1}{4N}\left(\mathrm{Var}\left[\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})}\right] + \mathrm{Var}\left[\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right]\right) \\ \end{aligned}

Since the variance is additive, if either pgp_g or php_h is a bad match for the integrand, we will get high variance - even if the other PDF is a perfect match.

Generalizing

Let's try and think about the problem a bit more generally. The last technique was definitely along the right lines.

Weights

It seems reasonable to explore estimators of the form

I^N=1Ni=1Nwg(Xg,i)g(Xg,i)h(Xg,i)pg(Xg,i)+1Ni=1Nwh(Xh,i)g(Xh,i)h(Xh,i)ph(Xh,i)(4)\begin{aligned} \tag{4} \hat{I}_N = \frac{1}{N}\sum_{i=1}^{N}w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + \frac{1}{N}\sum_{i=1}^{N}w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})} \end{aligned}

where wg(x),wh(x)w_g(x), w_h(x) are some 'weighting' functions. In strategy (1)(1) , we tried wg(x)=1,wh(x)=0w_g(x) = 1, w_h(x) = 0 , and in strategy (3)(3) we tried wg(x)=wh(x)=1/2w_g(x) = w_h(x) = 1/2 .

The notation at this point is starting to get a bit heavy, so try to keep in mind that we're looking for some expression I^N\hat{I}_N such that E[I^N]=g(x)h(x)dx\mathbb{E}[\hat{I}_N] = \int g(x)h(x)\mathrm{d}x . Clearly not all choices of weighting functions leave us with an unbiased estimator. To see what constraints are required on the weighting functions to keep the estimator unbiased, consider the expected value of (4)(4) :

E[I^N]=1Ni=1NE[wg(Xg,i)g(Xg,i)h(Xg,i)pg(Xg,i)+wh(Xh,i)g(Xh,i)h(Xh,i)ph(Xh,i)]=wg(x)g(x)h(x)pg(x)pg(x)+wh(x)g(x)h(x)ph(x)ph(x)dx=(wg(x)+wh(x))g(x)h(x)dx\begin{aligned} \mathbb{E}[\hat{I}_N] &= \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[w_g(X_{g,i})\frac{g(X_{g,i})h(X_{g,i})}{p_g(X_{g,i})} + w_h(X_{h,i})\frac{g(X_{h,i})h(X_{h,i})}{p_h(X_{h,i})}\right] \\ &= \int w_g(x)\frac{g(x)h(x)}{p_g(x)}p_g(x) + w_h(x)\frac{g(x)h(x)}{p_h(x)}p_h(x)\mathrm{d}x \\ &= \int \left(w_g(x) + w_h(x)\right)g(x)h(x)\mathrm{d}x \end{aligned}

Therefore we can see that we require x,wg(x)+wh(x)=1\forall x, w_g(x) + w_h(x) = 1 for the estimator to remain unbiased. It is also required that wg(x)=0w_g(x) = 0 whenever pg(x)=0p_g(x) = 0 , so as to avoid a division by zero when evaluating the first term of I^N\hat{I}_N (and likewise for wh(x)w_h(x) and the second term).

Sampling Techniques

Currently we've been sampling from 2 PDFs (let's call each of these a 'sampling technique'), but we can generalise this. Let KK be the number of sampling techniques, with PDFs p1,p2,,pKp_1, p_2, \dots, p_K that each match the shape of the integrand g(x)h(x)g(x)h(x) in some way. Let Xj,iX_{j, i} be sampled according to pjp_j . Given some set of weighting functions wjw_j , we can then rewrite (4)(4) as:

I^N=j=1K1Ni=1Nwj(Xj,i)g(Xj,i)h(Xj,i)pj(Xj,i)(5)\begin{aligned} \tag{5} \hat{I}_N = \sum_{j=1}^{K}\frac{1}{N}\sum_{i=1}^{N}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

Note that the order of summation is not important here. We still require the two properties from before: x,j=1Kwj(x)=1\forall x, \sum_{j=1}^{K} w_j(x) = 1 , and pj(x)=0    wj(x)=0p_j(x) = 0 \implies w_j(x) = 0 .

Sample Counts

So far we've been taking exactly NN samples from each sampling technique, resulting in KNKN samples in total. This can be generalised to allow greater or fewer samples for each sampling technique.

Let njn_j the number of samples taken for sampling technique jj (so that the total number of samples used by the estimator is jnj=N\sum_j n_j = N ). Then:

I^N=j=1K1nji=1njwj(Xj,i)g(Xj,i)h(Xj,i)pj(Xj,i)(6)\begin{aligned} \tag{6} \hat{I}_N = \sum_{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}w_j(X_{j,i})\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})} \end{aligned}

The proof that this is results in an unbiased estimator is trivial and left as an exercise to the reader.

Weighting Heuristics

Now that I've introduced the general MIS formulation, we can discuss some different choices of weighting functions.

Balance Heuristic

Veach determined that the following choice of weights is reasonable:

wj(x)=njpj(x)k=1Knkpk(x)\begin{aligned} w_j(x) = \frac{n_jp_j(x)}{\sum_{k=1}^{K} n_kp_k(x)} \end{aligned}

When inserted into the MIS definition (6)(6) , this gives:

I^N=j=1K1nji=1nj(njpj(Xj,i)k=1Knkpk(Xj,i)g(Xj,i)h(Xj,i)pj(Xj,i))=j=1Ki=1njg(Xj,i)h(Xj,i)k=1Knkpk(Xj,i)(7)\begin{aligned} \hat{I}_N &= \sum_{j=1}^{K}\frac{1}{n_j}\sum_{i=1}^{n_j}\left(\frac{n_jp_j(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})}\frac{g(X_{j,i})h(X_{j,i})}{p_j(X_{j,i})}\right) \\ \tag{7} &= \sum_{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \end{aligned}

Veach shows that (assuming positive weights), the balance heuristic is close to optimal. Specifically, with the balance heuristic:

Var[I^N]Var[I](1minjnj1jnj)E[I]2\begin{aligned} \mathrm{Var}[\hat{I}_N] - \mathrm{Var}[I] \le \left(\frac{1}{\mathrm{min}_j n_j} - \frac{1}{\sum_j n_j}\right)\mathbb{E}[I]^2 \end{aligned}

That is to say, even using the most optimal choice of weights, the variance differs by at most the term on the right hand side. A proof of this is given in the Chapter 9 Appendix of Veach's thesis.

We can interpret the balance heuristic (7)(7) in a more natural way by rewriting it slightly. Let cj=nj/Nc_j = n_j/N , i.e the fraction of samples given to technique jj . Then:

I^N=j=1Ki=1njg(Xj,i)h(Xj,i)k=1Knkpk(Xj,i)=1Ni=1njj=1Kg(Xj,i)h(Xj,i)k=1Kckpk(Xj,i)\begin{aligned} \hat{I}_N &= \sum_{j=1}^{K}\sum_{i=1}^{n_j}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} n_kp_k(X_{j,i})} \\ &= \frac{1}{N}\sum_{i=1}^{n_j}\sum_{j=1}^{K}\frac{g(X_{j,i})h(X_{j,i})}{\sum_{k=1}^{K} c_kp_k(X_{j,i})} \end{aligned}

Note that I have swapped the order of summation on the second line. One may notice that this resembles a standard Monte Carlo estimator with probability density function p^(x)=jcjpj(x)\hat{p}(x) = \sum_j c_jp_j(x) which Veach calls the 'combined sample density'. This represents the distribution of a random variable that is equal to each Xj,iX_{j,i} with probability 1/N1/N .

Cutoff Heuristic

The balance heuristic can introduce unnecessary variance in the case that one of the sampling techniques is a very close match to the integrand (this is closer to what typically happens in computer graphics). The reason for this is explored in more detail in Veach's thesis on pages 270-273.

The way to address this is to 'sharpen' the weighting functions - i.e increase the value of high weights while decreasing the value of low weights. For brevity,we will drop the argument xx and write qj=njpjq_j = n_jp_j . Veach defines the cutoff heuristic with parameter 0α10 \le \alpha \le 1 by:

wj(x)={qjk=1N{qkqkαqmax}if qjαqmax0otherwisew_j(x) = \begin{cases} \frac{q_j}{\sum_{k=1}^{N}\{q_k | q_k \ge \alpha q_{\text{max}}\}} &\text{if } q_j \ge \alpha q_{\text{max}} \\ 0 &\text{otherwise} \end{cases}

This has the effect of 'cutting off' any samples that are generated with a PDF less than α\alpha (and therefore increasing the value of the other weights so as to sum to 1). In the case of α=0\alpha = 0 this reduces to the balance heuristic.

Power Heuristic

The power heuristic with parameter β1\beta \ge 1 is defined as:

wj(x)=qjβk=1Kqkβ\begin{aligned} w_j(x) = \frac{q_j^\beta}{\sum_{k=1}^{K} q_k^\beta} \end{aligned}

In the case of β=1\beta = 1 , this also reduces to the balance heuristic. For larger values of β\beta , more sharpening occurs. Veach determined empirically that the value β=2\beta = 2 gives good results in typical scenes.

Conclusion

Although I have not directly linked multiple importance sampling to computer graphics (see this PBRT section for a brief summary), MIS is such a crucial technique for reducing variance in offline rendering that it was mentioned as a reason for Eric Veach's technical Oscar. You might be amused to watch the video of him going up on stage and receiving the award from Michael B. Jordan and Kristen Bell, who clearly have no understanding of the words they are reading off the teleprompter.

Further Reading

There is a reasonable amount of literature surrounding MIS that is worthy of note. Cornuet et al. (2012) describe a method for adaptive multiple importance sampling that effectively modify the weights according to samples obtained at runtime. Kondapaneni et al. (2019) revisit some of Veach's derivations, showing that the variance of the balance estimator can be improved upon further than the bound given by Veach if negative weights are allowed. They also describe a technique for estimating the most optimal MIS weights, since they cannot be computed directly.

Sbert et al. (2016) define an estimator based on the 'one sample model' of MIS that Veach describes but I have omitted. This involves randomly selecting one of the proposal distributions to sample from, rather than sampling from them all at the same time and combining their contributions. Their method is provably better than the balance heuristic for the same number of samples.

References