SLQIO tool differing output with multiple threads vs async requests

This is on Windows 2008R2 server, i'm trying to benchmark my SAN, on NTFS and on raw device directly.
I see huge difference in throughput between " -t32 -o1" Vs "-t1 -o32", later being significantly faster. Such difference in IOPs is not seen with raw device directly.