I just tried to run a case using interDyMFoam in parallel. The case consists of a non-moving outer mesh and an inner cylindrical mesh that is rotating (with a surface-piercing propeller in it). I use GGI to connect both meshes. The inner mesh is polyhedral and the outer one hexahedral.
The entire case consists of approx. 1 million cells (most of them in the inner mesh)

I have run this case in parallel on a different number of processors on a SMP machine with 8 quad opteron processors (decompositionMethod metis):

So the speedup doesn't even reach 3. A similar case where the whole domain is rotating and the mesh consists only of polyhedra shows a linear speedup up to 8 processors and a decreasing parallel efficieny beyond that.

I wonder if this has to do with the GGI interface? I tried to stitch it and repeat the test but unfortunately stitchMesh failed. Does anyone have an idea how to improve the parallel efficiency?

my experience also shows that 8 cores gives the best speed.
while it slows down for 16 cores.
I wonder it is because for me, each computer has 8 cores, and the communication between two computers is much inefficient.

I have consicered similar problems with the ggi performance in paralllel and I still get no speedup for more than 8 cores. THis is extremely bad since I have large cases typically running on 32 cores with lots of interfaces in. Running them on 8 cores is dam slow for me. Hrv: Are there plans to further improve parallel performance of ggi?

I am running turbDyMFoam with GGI on a full wind turbine. So, the mesh is huge (~ 4 million cells). I am having problems with running in parallel. I am running on 32 processors. It runs very slowly and eventually one of the processes dies.

The same job runs perfectly fine in serial, but I think this case would take a long time to finish in serial.

I will be very much thankful to you if you shed some light on improving the ggi parallel performance.

Which svn version of 1.5-dev are you running? Please make sure you are running with the latest svn release in order to get the best GGI implementation available.

Your problem might be hardware related. Can you provide some speed up numbers you did achieve with your hardware while running OpenFOAM simulations with some none-GGI test cases?

Could you provide a bit more information about your hardware setup? Mostly about the interconnect between the computing nodes, and the memory available on the nodes?

Are you using something like VMware machines for your simulation? I have seen that one before...

I would like to see the following piece of information from your case:

file constant/polyMesh/boundary

file system/controlDict

file system/decomposeParDict

a log file of your failed parallel run

the exact command you are using to start your failed parallel run

Are you using MPI? if so, which flavor, and which version?

Please take note that this is the kind of information that you can provide up-front with your question and that can really help other people to help you out quickly.

Regards,

Martin

Quote:

Originally Posted by ddigrask

Hello All,

I am running turbDyMFoam with GGI on a full wind turbine. So, the mesh is huge (~ 4 million cells). I am having problems with running in parallel. I am running on 32 processors. It runs very slowly and eventually one of the processes dies.

The same job runs perfectly fine in serial, but I think this case would take a long time to finish in serial.

I will be very much thankful to you if you shed some light on improving the ggi parallel performance.

Thank you for your reply. I realize I should have given all the details before it self. I am sorry for that. Wont happen in future.

1. I was earlier using the latest SVN version. But later, since I was facing the parallel problems, I read more and thought I should follow the ERCOFTAC page. So I reverted back to version 1238. I will upgrade to latest one now.

// Face zones which need to be present on all CPUs in its entirety
globalFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

NOTE: I had a doubt. Since my rotating zone has the finest mesh. So, the GGI faces have the maximum number of cells. When I use globalFaceZones in decomposeParDict, does it copy all the ggi faces on all processors? If that is the case, then it would run really slow, because, it will take time to interpolate between 100K cells and communicate data. Please forgive me if what I am thinking is wrong.

From the information present in your boundary file, I can see that your GGI interfaces are indeed composed of large sets of facets.

With the actual implementation of the GGI, this will have an impact because the GGI faceZones are shared on all the processors and communication will take it's toll.
Also, one internal algorithm of the GGI is a bit slow when using very large numbers of facets for the GGI patches (my bad here, but I am working on it...).

But not to the point of a simulation to crash and burn like you are describing.

So another important piece of information I need is your simulation log file; not the PBS log file, but the log messages generated by turbDyMFoam when running your 32 processors parallel run.

This file is probably quite large for posting on the Forum, so I would like to see at least the very early log messages, from line #1 (the turbDyMFoam splash header) down to let's say the 10th simulation time step.

I also need to see the log for the last 10 time steps, just before your application crashed.

As a side note: As I mentioned, I am currently working on some improvements to the GGI in order speed up the code when using GGI patches with a large number of facets (100K and +).

My research group needs to run large GGI cases like that, so this is a priority for me to get this nailed down asap. We will contribute our modifications to Hrv's dev version, so you will have access to the improvements as well.

Regards,

Martin

Quote:

Originally Posted by ddigrask

Hello Mr. Beaudoin,

Thank you for your reply. I realize I should have given all the details before it self. I am sorry for that. Wont happen in future.

1. I was earlier using the latest SVN version. But later, since I was facing the parallel problems, I read more and thought I should follow the ERCOFTAC page. So I reverted back to version 1238. I will upgrade to latest one now.

// Face zones which need to be present on all CPUs in its entirety
globalFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

NOTE: I had a doubt. Since my rotating zone has the finest mesh. So, the GGI faces have the maximum number of cells. When I use globalFaceZones in decomposeParDict, does it copy all the ggi faces on all processors? If that is the case, then it would run really slow, because, it will take time to interpolate between 100K cells and communicate data. Please forgive me if what I am thinking is wrong.

Sorry for a bit late reply. The turbDyMFoam output is attached below. The code doesnot crash because of solver settings, it just waits on some step during the calculation and finally dies giving MPI error.

After carefully looking at each time step output, I have observed that the maximum time consuming part of the solution is the GGI Interpolation step. That is where the solver takes about 2 - 3 minutes to post the output.

1: It would be useful to see a stack trace in your log file when your run aborts. Could you set the environment variable FOAM_ABORT=1 and make sure every parallel task got this variable activated as well? That way, we could see where the parallel tasks are crashing through the stack trace in the log file.

2: You said your cluster has 72 nodes, 8 processors per node and each node has 4 GB RAM.

3: From your log file, we can see that you have 8 parallel tasks running on each node. Overall, your parallel run is using only 4 nodes on your cluster (node76, node23, node42 and node31).

4: So basically, for a ~4 million cells mesh, you are using only 4 computers, each with only 4 GB of RAM, and 8 tasks per node fighting simultaneously for access to this amount of RAM.

Am I right?

If so, because of your large mesh, your 4 nodes probably don't have enough memory available, and could be swapping for virtual memory on the hard-drive, which is quite slow.

And depending on your memory bus architecture, your 8 tasks will have to compete for access to the memory bus, which will slow you down as well.

Did you meant 4 GB RAM per processor instead, which would give you 32 GB RAM per node or computer?

Could you just double-check that your cluster information is accurate?

Martin

Quote:

Originally Posted by ddigrask

Dear Mr. Beaudoin,

Sorry for a bit late reply. The turbDyMFoam output is attached below. The code doesnot crash because of solver settings, it just waits on some step during the calculation and finally dies giving MPI error.

After carefully looking at each time step output, I have observed that the maximum time consuming part of the solution is the GGI Interpolation step. That is where the solver takes about 2 - 3 minutes to post the output.

It only says that it crashed in a MPI operation. We don't know where. It can be in the GGI code, it can be in the solver, it can be anywhere MPI is being used. So unfortunately, this stack trace is useless.

I don't have enough information to help you much more.

Try logging on your compute nodes to see if you have enough memory while the parallel job runs. 20 mins gives you plenty of time to catch this.

Try checking if your nodes are not swapping on disk for virtual memory on disk.

I hope to be able to contribute some improvements to the GGI soon. I do not know if this will help you. Let's hope for the best.

Actually, I've got an update for you. There is a new layer of optimisation code built into the GGI interpolation, aimed at sorting out the loss of performance in parallel for a large number of CPUs. In short, each GGI will recognise whether it is located on a single CPU or not and, based on this, it will adjust the communications pattern in parallel.

This has shown good improvement on a bunch of cases I have tried but you need to be careful on the parallel decomposition you choose. There are two further optimisation steps we can do, but they are much more intrusive. I am delaying this until we start doing projects with real multi-stage compressors (lots of GGIs) and until we get the mixing plane code rocking (friends involved here).