I don't think the optimization in the post has anything to do with monoids. It doesn't work on some monoids (like the integers under addition), and it does work on many things that aren't monoids (like toggling the accumulator when it is equal to whether or not the current item is a prime number, which isn't even of the form S x S -> S).

Monoids can in fact be parallelized trivially, but that's not what the post is about. It's using a different property:

parallelizing accumulation, when the accumulator only takes on a small number of states