Admittedly, I'm a bit of a novice when it comes to parallel computing, but from what I've seen so far, anything more than 4 cores has essentially no benefit. When I first started, I was really excited about the possibility of using Amazon's EC2, but now that seems completely useless. Is that right?

kwardle

July 5, 2011 17:03

There is a huge difference in architecture between a cloud system and a supercomputer. When you talk about parallel scalability to many processors, the most important thing I have seen in running CFD in parallel is the speed of the interconnect between nodes and then, of course, the speed of the cores themselves. If your interconnect is 1GB/s (i.e gigabit ethernet) you won't see much improvement above some tens of processors. New supercomputers typically have QDR Infiniband interconnect with speeds of 10GB/s.

While I have a bit of experience running OpenFOAM on clusters and supercomputers on up to a few thousand processors, I am not so familiar with trying to do it on a cloud system. Apparently, Amazon does have custom HPC-type clouds with 10-GB interconnect. They claim this can match more standard HPC system performance. Their 'cloud' may simply be a normal cluster in itself and if so I am not sure what the advantage of EC2 would be other than on-demand access. Again, I know little about these systems as my original assumption was precisely your final conclusion--they are relatively useless for large-scale CFD. Perhaps someone who knows more can chime in if I am wrong.

murrdpirate

July 5, 2011 17:23

The Amazon product that I was looking at is the HPC EC2. It supposedly offers a 10 Gigabyte connection, so maybe it actually would be fast enough.

I was a bit pessimistic because parallel computing on multicore processors seemed to reach diminishing returns very quickly (nearly 0 benefit to go from 2 to 4 processors for the geometries I've tried). I couldn't imagine a supercomputer having better connection speeds between its processors than a multicore chip, but I am pretty ignorant on much of this.

kwardle

July 5, 2011 17:29

Well, but you also have to consider the problem size. You are going to see a max speedup around some number of meshpoints/processor. On QDR Infiniband systems for the type of problems I do (interFoam based) this is typically around 5K-10K polyhedral cells/processor. How large are the problems you have tried?

murrdpirate

July 5, 2011 18:56

The cases I've been running are at around 100,000 tetrahedral cells. Going from 1 to 2 processors yields around a 40% increase in performance, and going from 2 to 4 yields an additional 10% at most. I don't suppose polyhedral meshes have better parallel performance, do they? I suppose it's possible since each polyhedral cell has more neighbors than each tet cell, and thus adds CPU calculation without adding more communication.

nileshjrane

July 6, 2011 04:44

I would say the most crucial thing which affects the parallel efficiency is the CFD algorithm itself. Hardware issues are, to me, secondary. The current CFD algorithms, most of them, are good for serial processing. But they are not ideal for parallel processing. if one can use specialized algorithms on parallel machines then one can get near idea parallel scalability even on thousands of processors. The CFD is yet to mature for highly parallel hardware.

As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

niklas

July 6, 2011 04:54

You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.

# Afact seems to be max around 10k cells/core.
# trying to keep cells/core constant at 10 k = 240k/node
# switching to constant CFL number
# in order to try and keep the number of pressure iterations equal

As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.

arjun

July 6, 2011 10:51

Quote:

Originally Posted by nileshjrane
(Post 314889)

Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

lakeat

August 22, 2011 09:44

Quote:

Originally Posted by arjun
(Post 314946)

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

This makes sense. And what algorithm would you use for large cases? GAMG, or PCG, for an unsteady case? Thanks

lakeat

June 20, 2012 17:26

Quote:

Originally Posted by niklas
(Post 314891)

You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.

Hi Niklas,

I had a very very long startup time (A hour) when I am trying to use a thousand cpus. Any ideas?

niklas

June 21, 2012 01:26

On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.

lakeat

June 21, 2012 09:57

Quote:

Originally Posted by niklas
(Post 367574)

On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.

Thanks, What did you mean by "CRAY_ROOTFS=DSL", I google a little bit and found this page, and according to it, setting "CRAY_ROOTFS=DSL" is actually helpful.

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN

lakeat

June 21, 2012 11:45

Quote:

Originally Posted by niklas
(Post 367684)

OK, I see that its not a cray, so its not that.

are you using the system mpi or are you compiling openmpi yourself?

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN

Thanks a lot, I will talk to the system manager to double check the openib issue (you know what, I am always worrying this issue, especially I am afraid that different computing nodes would use difference settings, this is a little bit tricky.)

Anyway, I will try and keep you posted. And in the meanwhile, would you mind to test my cases, see what happens in your cluster? Your email so that I can send you the download address?

Thanks

niklas

June 21, 2012 11:55

sure,
its niklas dot nordin @ nequam dot se

nileshjrane

September 6, 2012 01:08

Quote:

Originally Posted by akidess
(Post 314934)

I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.

I was talking in general sense not for OF specifically. And it was just one example to emphasize my point that algorithms are more important than the number of processors.

Quote:

Originally Posted by arjun
(Post 314946)

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.:confused:)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.;)

arjun

September 6, 2012 03:49

Quote:

Originally Posted by nileshjrane
(Post 380477)

You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.:confused:)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.;)

Quote:

Originally Posted by nileshjrane
(Post 314889)

Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.

nileshjrane

September 6, 2012 05:02

Quote:

Originally Posted by arjun
(Post 380502)

What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.

That was rude. Don't want to pollute the thread so I am backing off from this thread.

Just my last post here. I was talking about the algorithm which is developed by NASA and used extensively by them for their hypersonic flight designs, extra-terrestrial probes, reactive flows etc. So I am not considering myself "smarty pants" you see, neither did I say that others here are "mindless stupids". I am merely telling my observations/opinions. BTW something which is applicable for whole of supersonic and hypersonic regime is not that special case, now is it??

There are algorithms which are more "parallel friendly" than others. E.g. Krylov subspace solvers. The computational physics guys have been using them since years. They don't invert matrix at all. I have done some literature survey out of interest, and then derived the conclusion.

You can choose to ignore my opinions if you feel I am wrong. I did not enforced anyone to accept my views. I stand by my view, you stand by yours. But do it politely.