The alpha that I create is not created in a crazily efficient way I know, but the main bottleneck in this code is in what I call the "Maximization step" (see comments in code). There are double for-loops twice (with list comprehensions inside), and the reason for this is that I need to use the computed vector from the first double for-loop in the second double for-loop. Here's some proof of my claim, I tested how long the different parts of the code took;

Finishing the vector alphas took 0.1451436 s.

Finishing the first bottleneck for loops took 13.7402911 s

Finishing the second bottleneck for loops took 13.1357222 s

A single complete loop in the while step took 26.9219609 s

There should be a more numpy-focused way to update the three vectors mu, pi, sigma_squared, I just can't seem to figure out how. Any help to make this quicker would be greatly appreciated.

The code basically implements an EM-algorithm where the purpose is to find better estimates (better yet, the MLE) of the parameters \$\mu, \sigma, \pi\$, which are part of a gaussian mixture model. If you want to understand where the calculations come from, I think the easiest way is just to show you a short description of the algorithm and the formulas below.

2 Answers
2

Congratulations to you for finding a major bottleneck yourself. There are a few more to get rid of.

Generally speaking it's quite a performance killer to convert data between Python and numpy repeatedly. And you do that a lot. You even do it to determine the number of iterations for some loops, e.g. in for j in range(np.array(alphas).shape[1]):. Since alphas is created beforehand and not expanded again, the first step would be to convert it into a numpy array once. But there is no real need to query its shape, since you know exactly how the shape of alphas is, it's exactly (range_of_i, range_of_m). So you convert \$255 \times 100\$ elements to numpy just to get range_of_m as an answer. Not really efficient, isn't it?

In addition, Python is also often not really fast at loops. There is a great talk from Jake VanderPlas from PyCon 2015 titled "Losing your Loops - Fast Numerical Computing with NumPy" (to be found on YouTube) on that topic. I highly recommend watching it!

After the general remarks, let's start from the beginning:

When you generate alphas, you draw samples from no_distributions distributions, and then discard all but one value if I'm not terribly mistaken. So why not choose the distribution beforehand, and sample that one?

This code snippet also includes the idea to have alphas as numpy array directly. You could also keep the original approach to have a list of lists and then convert that, but this helps you to save some variables. I also opted to remove the inner loop in favor of a list comprehension. Maybe there is even a clever way to get rid of the outer loop (without using a list comprehension!), but I'll leave that as an exercise to you.

Next up on the list is the expectation step. Again, funny mix-up of Python and numpy/scipy code here. And although I have the strong feeling that there has to be a better solution, the following is the best I could come up with at the moment:

Explaining the full beauty of this would be beyond the scope of what I can do here. I highly recommend reading the chapter "Fancy Indexing" from the Python Data Science Handbook by Jake VanderPlas if you want to learn more. What this does in essence is to move those pesky loops into the numpy's C back end. You can check that those results really match by pasting both pieces of code into the same file and check with np.allclose(...).

Results

So what did we win be jumping through all those hoops and losing all those loops?

Your proposed solution that moved the sum(sum(...)) out of the loop takes around \$5s\$ to converge here on my machine. With the code above in its most loop-reduced form the algorithm finishes in about \$0.2s\$ to \$0.25 s\$ with the same result (not exactly the same cause the random numbers also have a little impact here). That's a speed-up of 20-25x. I can live with that for the moment ;-). (Note: The timing results only cover the content of the while loop, not the initialization.)

I'm actually not a 100% sure the implementation is fully correct, but the new implementation is at least as correct as the original one. If you find an error here in my code, it already existed in the original solution because I tested for agreement between your and my solution ;-)

\$\begingroup\$I had to study this answer for a few hours before understanding it, I learned so much! Can't thank you enough, I'm gonna try using numpy for everything from here on out.\$\endgroup\$
– armaraJul 12 at 10:27

\$\begingroup\$Welcome to the world of numpy ;-) I also still learn new things while working on different projects. A word of warning though: numpy is not a magic bullet! There are cases where numpy can slow your code down. But whenever you can use it to vectorize your code to get rid of nested loops as presented above, the performance will almost always be significantly better.\$\endgroup\$
– AlexVJul 12 at 11:57

I searched around this forum a bit more and found some advice to use the package cProfile. This helped me locate the problem which was the sum(sum(indicator_normalized)) which took more than 95% of the total time for the complete algorithm. I just moved out the sum and the total time for my algorithm went down to 1/19 of the total time it took before.

So the solution was to write this before the maximization step, and then use this variable inside the for loops;