Is MultinormalDistribution[] efficient and easy to use for high dimensions?

I have a variable $n$ representing the dimension of a Monte Carlo integration I do on a multivariate Gaussian copula, where typical values of $n$ are near 100. I am using a simple correlation matrix made from ConstantArray[ρ, {n,m}] (except on the diagonal).

For now I simply wrote a function that calls RandomVariate[NormalDistribution[0,1], {n,m}], generates the correlation matrix, calculates its Cholesky decomposition, and then multiplies to obtain the correlated multivariate normal samples. It works fine and was easy to code and understand.

However, generating correlated multivariate $t$ samples is more involved, so it would be convenient to just drop in a call to MultivariateTDistribution[]. That leaves me with two issues:

Setting up MultinormalDistribution[] with an arbitrary variable count is hard because it seems to want variable names, which I am having trouble generating programmatically.

I am not sure that the internals of RandomVariate on MultinormalDistribution[] and MultivariateTDistribution[] are set up to efficiently obtain high-dimensional sample sets.

If the efficiency is not expected to be that great I will stick with my current approach. Otherwise I would appreciate advice on using MultinormalDistribution in this high-dimensional context, since it would be worth investing the time in this more elegant and Mathematica-like approach.

I was confused by the documentation examples all having variable names. I now understand better!
–
Brian BMar 12 '12 at 20:32

It looks like Parallelize may be a smart thing to use here as well.
–
Brian BMar 12 '12 at 21:01

@BrianB You can also try using the MKL's rng (SeedRandom[Method->"MKL"]) for better performance if you're on Win/Linux. Parallelize can be tricky---it'll generally only parallelize certain functional constructs (Map, Table, etc.), and the evaluations will run in separate processes (not threads), so communication overhead can be significant. It is very nice and easy to use though when the evaluations that run in parallel each take a long time and are independent of each other.
–
SzabolcsMar 13 '12 at 11:03

The Monte Carlo simulation is of course embarrassingly parallel. I ended up finding that Parallelize worked well if and only if I specified Method -> "CoarsestGrained". But in that case it was brilliant.
–
Brian BMar 13 '12 at 15:43

2 Answers
2

Multivariate distributions do not require any named variables. My hunch is that your confusion in this regard is due to the excessive use of {{1, ρ}, {ρ, 1}} in the documentation for MultivariateTDistribution. You might've missed the fact that in most cases, the value for ρ is being substituted via a Table or a ParallelTable.

You can directly input your covariance matrix (and any additional parameters depending on the distribution) and call RandomVariate to draw from that. For your example:

In principle, one call to RandomVariate will incur some overhead to diagonalize the matrix and then cost a fixed amount of time per output vector. Let's see by first creating some data to describe such a distribution:

The response is 0.687 seconds: that is, we can get about 150,000 100-vectors per second. Not shabby and unlikely to be bettered by other approaches. (MultinormalDistribution appears to be parallelized automatically, so the timing is the sum of (a) pre-processing overhead, including RAM allocation for the output, all performed by one core, and (b) generation of random values, which appears to be performed in parallel by all the available (licensed) cores. My timings reflect a four-core license.)

Incidentally, because there must be some matrix multiplication going on for each output vector, we ought to expect slightly sublinear scaling with dimension. E.g., generating 10,000 1000-vectors produces the same quantity of output numbers (i.e., $10^7$ of them in both cases) but requires 1.30 seconds--about twice as long.

I don't think the Parallel Computing Tools are used by default in any built-in Mathematica functions. However, some of them do have low level implementation which take advantage of multiple cores using a single kernel only (e.g. LinearSolve). (This is different from the high-level parallelization of the PCT) Do licenses matter in this case as well? I thought licensing restricted the number of kernels that can be run in parallel, but not how a single kernel can use multiple cores.
–
SzabolcsMar 13 '12 at 11:08

I don't know the answer to Mathematica licensing questions. What I do know is that (a) I have a license for 4 cores; (b) I have 8 available cores; (c) in repeatedly running this code, I observed a period in which only one core was used followed by another in which four cores were used, but never more than four.
–
whuberMar 13 '12 at 14:37

Do you really have 8 cores or do you have a CPU with hyperthreading? In either case, you can try SetSystemOptions["ParallelOptions" -> "ParallelThreadNumber" -> 8] Just make sure you set it back to 4 in case it would break Parallelize because of the licensing restrictions. (Anyway, it should reset to 4 after a kernel restart, and it should not influence the number of parallel kernels launched)
–
SzabolcsMar 13 '12 at 15:02

Thanks for the suggestions. You're correct; it's four physical cores with hyperthreading (Xeon 3580). Somehow I have six kernels launched at the moment. Only one was utilized in the RandomVariate call. Ultimately it was able to use only 50% of total resources even after increasing ParallelThreadNumber. Surprisingly, explicitly parallelizing this call (via ParallelTable) degrades performance: although four kernels are invoked for the initial part of the calculation, four times as much RAM is allocated and the total operation takes six times as long.
–
whuberMar 13 '12 at 16:57

Note that your code only uses a single kernel process, even though it can utilize more than one core. When you use ParallelTable, that launches multiple kernel processes, and uses a completely different sort of parallelization, which happens at the level of Mathematica code, and is not as efficient as whatever MultinormalDistribution uses internally. Actually the parallel computing tools (which provide ParallelTable) are fully implemented in Mathematica, and just use MathLink for communication between kernel processes. (The full source code is available)
–
SzabolcsMar 13 '12 at 17:04

Mathematica is a registered trademark of Wolfram Research, Inc. While the mark is used herein with the limited permission of Wolfram Research, Stack Exchange and this site disclaim all affiliation therewith.