ARSC Symposium on Multicore and New Processing Technologies

A symposium exploring the capabilities and use of new high performance computing (HPC) technologies is taking place Aug. 13-14 on the campus of the University of Alaska Fairbanks in conference room 010 of the West Ridge Research Building. Space is limited for this free symposium and registration is required. Register here:

The two-day symposium is hosted by the Arctic Region Supercomputing Center, a national leader in the ongoing evaluation of current and next-generation processing technologies to improve the speed, memory functions and efficiencies of supercomputers. ARSC is a charter member of the National Science Foundation's Center for High Performance Reconfigurable Computing (CHPRC). For more details, see:

Taking Slices of Data From NetCDF Files: Part I

[ by Anton Kulchitsky and Lee Higbie]

If you want to work on a slice or column of data from a netCDF file, the netCDF interface allows you to extract data without reading the whole array. There can be confusion if you would like to read data into an array of smaller dimension than the original variable. The goal of this article is to show how to extract precisely the information you want from a netCDF file.

We will use a modification of the netCDF tutorial's error function in our examples, and, this week, use C:

Assume you already know all the dimensions T, Z, Y, X of your 4D array,
A
(you can get them using the nc_inq_dimid and nc_inq_dimlen functions). Consider the task of extracting a 1D vertical column A(t,z,y,x) for fixed index values, x=x0, y=y0, t=t0 into a variable float a[] with enough memory allocated for the column. The following code performs the task (see a detailed explanation below the code):

First, the code retrieves the variable id using the function nc_inq_varid with a file id of
nc_file
and variable name
A
into the variable,
parr.
The variable itself can be retrieved with nc_get_vara_float function. It uses the arrays
start
and
count
of dimension 4 (remember, we are working with a 4D array) to describe the required slice.

In this example, we fix the
x,
y,
and
t
indices with the values
x0,
y0
and
t0
and allow
z
to range from 0 to Z-1. (If you need only part of the z-column, you can modify that range as well.) Thus, array
start
indicates the corner of the array from which you read the variable, and array
count
indicates how many elements are to be read in each dimension.

It is very easy to modify the code if you need to read just one value from the 4D array, using index (t0,z0,y0,x0). The only coding change would be for the arrays
start
and
count
:

Note that nc_get_vara_float accepts a pointer to
b[0][0]
, not
b
itself.

You can use the same ideas to extract data from arrays of different dimensions.

--

To be continued. Next time, in Fortran...

Highlights from SCICOMP-13, the IBM Scientific Users Group

[ by: Tom Logan ]

This year's annual IBM scientific users group conference, SCICOMP-13, was hosted by IPP (Max Planck Institute for Plasma Physics) in Garching, Germany. Garching is a quaint little city situated just 15 short subway minutes north of Munich. It is the location of the Max Planck Institute that incorporates many disciplines, although it was originally primarily focused on plasma physics research.

The 4 1/2 day conference included a 1-day IBM tutorial, twenty two user submitted talks, and eight additional IBM talks. As always, the tutorial was most appropriate for new IBM users, but included tidbits for the experienced user as well. The additional IBM talks covered the future road maps for architectures, loadleveler, the ACTC toolkits, GPFS, BlueGene, cell processors, and, of course, the compilers.

[A few power5 vs power5+ talks reinforced this concept. Since power5+ is dual core and power5 is not, it was interesting to note that many memory intensive applications actually performed better on the power5 architecture.]

"Flops rates are not a useful measure of science achieved."

[To back this up, we saw that 1/2 of a set of 19 well-established scientific codes got <10% of the given system's peak theoretical performance on power5+, and that the average application does only ~50% floating point operations]

"Mixed-mode programming rarely performs better than MPI."

[I'll be honest, this is actually quoting myself. I attended yet another talk demonstrating that pure MPI performed the best, which reaffirmed this conclusion. However, in a side conversation with another attendee, I was assured that at least one production code at NCAR works best in mixed-mode.]

Overall, the conference contained a large amount of wonderful information for the IBM scientific user. Group discussions during the breaks and at lunch added greatly to the usefulness of the the meeting. And, of course, the late night meetings at the German beer gardens were the icing on the cake!

A complete list of the user talks, including "MPI-IO and MPI-2 One-Sided Communication" by yours truly, can be found at:

I've been assured that the IBM talks will be posted within the next week or so.

Removal of SSH default usage protocol 1

[ by: Oralee Nudson ]

Secure Shell (SSH) is a protocol used to exchange data over the network using a secure channel. There are two versions of the SSH protocol in use, versions 1 and 2. Protocol 1, originating from Finland in 1995, offered a more secure alternative to popular network protocols of the time such as rlogin, telnet, and rsh (wikipedia.com). Supporting the features of user authentication through the use of passwords and encryption, SSH protocol 1 quickly became a standard in network communication. Just one year later, SSH protocol 2 was made available and offered more security features not supported by protocol 1. Using the Diffie-Hellman key exchange and allowing a user to run more than one shell session over a single SSH connection are just two of the many significant improvements protocol 2 offered over protocol 1.

An excellent side-by-side comparison of SSH protocols 1 and 2 is presented on the following snailbook webpage:

In an ongoing effort to promote the use of more secure connections to and from ARSC systems, on August 1, 2007 SSH protocol 1 will be removed as a default usage protocol on all systems. Protocol 2 will become the default protocol. The discontinuation of protocol 1 as a default usage protocol does not mean that its use will be completely removed. Instead, should a user need to use SSH protocol 1 to connect to and from ARSC systems, they should add the following line to their ~/.ssh/config file:

Protocol 2,1

We are expecting this change in default usage protocols will be transparent to most users. However, it may be beneficial to double-check the SSH protocol settings of the client you use to connect to ARSC systems. For example, if you use Putty, confirm that the "Preferred SSH protocol version" radio button selected is "2" or "2 only" within the Putty Configuration SSH settings.

Users who would like less information are welcome to read "news ssh" on any ARSC system. As always, please feel free to direct any questions to the ARSC help desk.

Quick-Tip Q & A

A:[[ Can someone please explain this? When I use over 30,000,000 random
[[ variables in my code, the sum and average get out of whack.
[[ Below is a sample code which shows the problem.
[[
[[ mg56% cat t2.f90
[[ program t2
[[
[[ integer :: p, N
[[ real, allocatable, dimension (:) :: a
[[
[[ do N=10000000,50000000,5000000
[[ allocate (a(1:N))
[[
[[ call random_seed ()
[[ call random_number (a)
[[
[[ print*, "Count:", N, "Sum:", sum (a), "Avg:", sum(a)/real(N), "Min:" &
[[ , minval(a), "Max:", maxval(a)
[[
[[ deallocate (a)
[[ enddo
[[
[[ end program
[[
[[ mg56% pathf90 t2.f90
[[ mg56% ./a.out
[[ Count: 10000000 Sum: 5000621.5 Avg: 0.500062168 Min: 2.980232239E-7 Max: 0.999999702
[[ Count: 15000000 Sum: 7499592.5 Avg: 0.49997282 Min: 0.E+0 Max: 0.999999881
[[ Count: 20000000 Sum: 10002506. Avg: 0.500125289 Min: 0.E+0 Max: 0.99999994
[[ Count: 25000000 Sum: 12498267. Avg: 0.49993068 Min: 5.960464478E-8 Max: 0.99999994
[[ Count: 30000000 Sum: 15000845. Avg: 0.500028193 Min: 0.E+0 Max: 0.99999994
[[ Count: 35000000 Sum: 16777216. Avg: 0.479349017 Min: 0.E+0 Max: 0.99999994
[[ Count: 40000000 Sum: 16777216. Avg: 0.419430405 Min: 0.E+0 Max: 0.99999994
[[ Count: 45000000 Sum: 16777216. Avg: 0.372827023 Min: 5.960464478E-8 Max: 0.99999994
[[ Count: 50000000 Sum: 16777216. Avg: 0.335544318 Min: 0.E+0 Max: 0.99999994
[[ mg56%
[[
[[ It happens on both the IBM with xlf90 and the Sun Opteron cluster
[[ with pathf90. (Sorry, but 30000000 random numbers just isn't
[[ enough!)
#
# Orion Sky Lawlor
#
16777216 is the maximum value reached by a single-precision float
when adding numbers not exceeding 1.0--further additions of such small
values don't affect the sum due to roundoff.
Here's a C/C++ example of the same thing:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
Because of roundoff, the gnats do not affect the windshield in any way.
The "Roundoff" section at the bottom of this page has more:
http://www.cs.uaf.edu/2006/fall/cs301/lecture/11_10_float.html
To fix this, you'll have to use a more roundoff-friendly "sum"
implementation. For example, you could write an explicit loop that
adds up the array values into a real*8 (double-precision) temporary:
real*8 :: doublesum
...
doublesum=0.0
do i=1,N
doublesum = doublesum + a(i)
enddo
...
This should work with arrays of length up to 2^53 (about 10^16).
It is quite silly that the builtin compiler "sum" intrinsic doesn't
already do this!
#
# Ed Kornkven
#
program t3
!
! Demonstrate cause of erroneous sum() results. The "real" type
! is real*4 by default which has only 24 mantissa bits. Once a
! sum of 2^24 is reached, adding a small number (e.g., 0<x<1) has no
! representable effect. Solution is to do sum() in real*8
! (function sum2) or sum the array in chunks (function sum3).
! Note that when N > 2*2^24, sum3 still fails because the second
! chunk suffers the same problem as the original array.
! E. Kornkven, ARSC
!
integer :: p, N
real*4, allocatable, dimension (:) :: a
real*8 sum2
real*4 sum3
do N=10000000,50000000,5000000
allocate (a(1:N))
!call random_seed ()
!call random_number (a)
a = 1
!print*, "Count:", N, "Sum:", sum (a), "Avg:", sum(a)/real(N), "Min:" &
!, minval(a), "Max:", maxval(a), "Sum2:", sum2(a,N), "Sum3:", sum3(a,N)
print 10, N, sum (a), sum(a), sum2(a,N), sum3(a,N)
10 format ("Count:", i10, " Sum:", f10.1, " (0x", z8, ")", &
" Sum2:", f10.1, " Sum3:", f10.1)
deallocate (a)
enddo
end program
real*8 function sum2(a, N)
integer :: N
real*4, dimension(N) :: a
real*8 :: s
s = 0.0
do i = 1, N
s = s + a(i)
enddo
sum2 = s
return
end
real*4 function sum3(a, N)
integer :: N
real*4, dimension(N) :: a
real*8 :: s
integer :: half
half = N/2
sum3 = sum(a(1:half)) + sum(a(half+1:N))
return
end
#
# Sean Ziegeler
#
I don't think its a problem with the random number generator. The
summations of the real values are overflowing, which also causes the
incorrect average. You need to sum with double precision to prevent
that (unless you are going to have lots more random numbers to sum).
A quick fix to this code is to replace both instances of "sum(a)" with
"sum(dble(a))".
I tried it out with xlf90 on an IBM and it works.
Some compilers may be smart enough to convert a single element of the
array at a time for summing; others may copy the entire array into
a double precision array. If you want to prevent that for memory
efficiency, you'll have to use a loop.
#
# Olivier Golinelli
#
Note that 16777216 = 2 ** 24. This behaviour is a famous gotcha.
Internally, the function "Sum(a)" is translated to
suma = 0
Do i = 1, n
suma = suma + A(i)
Enddo
As 0 < A(i) < 1, "suma" grows during the loop execution. Consequently,
"suma" reaches the value 2 ** 24 when "i" is of order 2 ** 25. After
that, "suma" is locked at this value, because the addition
"suma + A(i)" is done with real values coded in single precision.
As A(i) < 1, the ratio suma/A(i) > 2**24 and the addition always gives
2**24.
To circumvent this problem, two solutions:
1) use real(8) everywhere by default in the program. Then if you have a
problem with memory, try to reduce with real(4), or
2) write your own function, "sum," using a tree reduction pattern,
and not a linear pattern, for example:
Program test
! author : olivier.golinelli@cea.fr
Implicit None
Integer, Parameter :: na = 10**8, wp = 4
Real(wp) :: A(na), sum1, sum2, sum3
Integer i, n, k
Call Random_number(A)
Print *, Sum(A), 'Function Sum'
sum1 = 0
Do i = 1, na
sum1 = sum1 + A(i)
End Do
Print *, sum1, 'Loop with compiler optimization'
n = size(A)
sum2 = 0
Do i = 1, na
sum2 = sum2 + A(i)
If (i == -1) Print *, 'blabla'
End Do
Print *, sum2, 'Simple loop without optimization'
n = size(A)
Do
k = (n+1) / 2
Do i = 1, n/2
A(i) = A(i) + A(i+k)
Enddo
n = k
if (n == 1) Exit
enddo
sum3 = A(1)
Print *, sum3, 'Tree reduction'
End Program
Interestingly, the behavior of the 'Sum' intrinsic and of the loop
depends on the optimization done by the compiler. I use an old Intel
Pentium4 with the Intel compiler "ifort version 8.0." With that, the
'Sum' intrinsic gives the good result. The simple loop,
Do i = 1, na
sum1 = sum1 + A(i)
End Do
gives also the good result. Clearly, the compiler substitutes
an optimized function for both.
But if I disable optimization (by putting the print statement inside
the loop) I obtain the behavior described in your newsletter. I.e.,
the result reaches a maximum at 2**24 for single precision real numbers,
real(4).
Here's the output for the above program:
4.9997072E+07 Function Sum
4.9997072E+07 Loop with compiler optimization
1.6777216E+07 Simple loop without optimization
4.9997072E+07 Tree reduction
These gotchas are more numerous than we might expect.
#
# Edward Anderson
#
If you only have 7 or 8 digits of accuracy, you can't add 10's of
millions of numbers in the same range without losing some of them.
Once the sum gets big enough, adding a relatively small number to it
has no effect. The short answer is you need to use double precision
to accumulate the sum (the compiler option "-r8" to promote reals to
8 bytes, or 64 bits, is fairly standard), unless you subtract the mean
to keep the sum around zero.
Interestingly, the Intel compiler on an SGI Altix produces the expected
result without (explicitly) resorting to double precision.
#
# Brad Chamberlain
#
I suspect the problem is that you're still using Fortran. :-)
Q: Regarding commenting in C/C++ code: I've seen code that uses /*
to start comments in some places and /** to start them in other
places. Why use /** ?

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu