This is Themos Tsikas from NAG, Oxford, UK. I am dealing with a user of HECToR, the UK academic Cray machine, who has a problem with Scalapack routine PDGBTRF (banded LU factorization with pivoting). Specifically, the code crashes sometimes with a memory corruption in malloc. Using valgrind, I have tracked it down to IPIV array being written beyond its documented bounds:

I've noticed a few posts on the forum reporting bad behaviour from this routine so I am wondering if I've come across a documentation bug or an algorithmic bug. I would appreciate your thoughts on this.

Hi Themos, I am not too familiar with the code. 1) Did you trace where the execution crashes in the PDGBTRF? Does it happen line 1045? Does setting IPIV of size ( DESCA(NB)+BWL+BWU ) fixes the problem? 2) There is some restrictions given from lines 134 to 160 of the code. Cheers, Julien.

I had been attempting to debug a "simple" program calling ScaLAPACK routines in C for several days. I set up everything to partition the matrix stored in general banded format on a 1xP process grid. I even got past the tricky bit about each sub-matrix of B (the RHS) having a leading dimension of NB_A (DESCA(4), with Fortran indexing), even if NB_A*P>M_B - which is the case when the order of the matrix (N_A) is not evenly divided by the number of processes. Even then, when I invoked PDGBSV, my program would either segfault when PDGBSV called PDGBTRF, or get the correct answer and then segfault toward the end of my code when it called BLACS_GRIDEXIT.

However, when I changed the length of IPIV from NB_A (again, DESCA(4), with Fortran indexing) to NB_A + BWU + BWL, my test program now appears to work properly, without segfaults, and the output matches the solutions obtained from similar LAPACK routines I wrote using DGBSV and DGBTRS. I am not enough of an expert to know if this is a true bug, or is just covering up another mistake I made, but I will update this post if I find that bugs have persisted.

THANK YOU for not only finding this bug, but also reporting it here so others like me could find it. It has made my weekend.

I was having the same problem as Steve with the segfaults using PDGBSV. I was also finding that the code was overwriting matrices that it shouldn't have had access too. Changing the length of IPIV from NB_A to NB_A + BWU + BWL also seems to have solved my problem. Steve, did you ever find any other bugs after doing that?