Update of /cvsroot/math-atlas/AtlasBase/TexDoc
In directory usw-pr-cvs1:/tmp/cvs-serv8716
Modified Files:
atlas_devel.tex
Log Message:
explanation of tester usage
Index: atlas_devel.tex
===================================================================
RCS file: /cvsroot/math-atlas/AtlasBase/TexDoc/atlas_devel.tex,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** atlas_devel.tex 2001/11/06 04:22:21 1.8
--- atlas_devel.tex 2001/12/02 05:11:06 1.9
***************
*** 629,631 ****
--- 629,963 ----
your {\tt ATLAS/} directory.
\end{enumerate}
+
+ \section{Sanity testing for an ATLAS install}
+ From ATLAS3.3.8 forward, ATLAS has had a ``sanity test'', which just does
+ some quick testing in order to ensure that there are no obvious problems
+ with the installed ATLAS libraries. It runs all of the standard BLAS interface
+ testers, with the default input files, and it then runs a few fixed cases
+ of ATLAS's lapack tester routines (eg., {\tt ATLAS/bin/invtst.c}, etc). The
+ advantage of these lapack testers is that they depend on many of
+ the BLAS as well as the lapack routines, so you get a lot of testing for
+ a minor amount of time. The sanity checks do not require any non-ATLAS
+ libraries for testing, so the only dependence that a user who has installed
+ ATLAS may not be able to satisfy is the need for a Fortran77
+ compiler, which is required for the BLAS interface testers.
+
+ \subsection{Invoking the sanity tests}
+ These tests are invoked from the {\tt ATLAS/} subdirectory by:
+ \begin{verbatim}
+ make sanity_test arch=<arch>
+ \end{verbatim}
+
+ If you are using threads, you will want to run the same tests for threading
+ via:
+ \begin{verbatim}
+ make ptsanity_test arch=<arch>
+ \end{verbatim}
+
+ \subsection{Understanding the sanity test output}
+ Once you fire off this tester, you'll see a lot of compilation going on.
+ All compilation is done up front, and then the testers are run at the end.
+ All tester output is dumped to some files (we'll see specifics in a bit),
+ which are then automatically grepped for errors at the end of the run. It
+ is the results of this grep that the user will see. For example, here's
+ the output from a run on my Athlon running Linux:
+ \begin{verbatim}
+ dudley.home.net. make sanity_test arch=Linux_ATHLON
+ ...
+ ... bunch of compilation ...
+ ...
+ DONE BUILDING TESTERS, RUNNING:
+ SCOPING FOR FAILURES IN BIN TESTS:
+ fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
+ bin/Linux_ATHLON/sanity.out
+ 8 cases: 8 passed, 0 skipped, 0 failed
+ 4 cases: 4 passed, 0 skipped, 0 failed
+ 8 cases: 8 passed, 0 skipped, 0 failed
+ 4 cases: 4 passed, 0 skipped, 0 failed
+ 8 cases: 8 passed, 0 skipped, 0 failed
+ 4 cases: 4 passed, 0 skipped, 0 failed
+ 8 cases: 8 passed, 0 skipped, 0 failed
+ 4 cases: 4 passed, 0 skipped, 0 failed
+ DONE
+ SCOPING FOR FAILURES IN CBLAS TESTS:
+ fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
+ interfaces/blas/C/testing/Linux_ATHLON/sanity.out | \
+ fgrep -v PASSED
+ make[1]: [sanity_test] Error 1 (ignored)
+ DONE
+ SCOPING FOR FAILURES IN F77BLAS TESTS:
+ fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
+ interfaces/blas/F77/testing/Linux_ATHLON/sanity.out | \
+ fgrep -v PASSED
+ make[1]: [sanity_test] Error 1 (ignored)
+ DONE
+ \end{verbatim}
+
+ So, in the LAPACK testers we see no failures (all tests show
+ {\tt 0 failed}), and we have no output from the BLAS testers, which is
+ what we want. Notice the lines like:
+ \begin{verbatim}
+ make[1]: [sanity_test] Error 1 (ignored)
+ \end{verbatim}
+
+ This is due to fgrep's behavior, and does not indicate an error. If fgrep
+ does not find any pattern matches, it returns a 1, 0 on match. Therefore,
+ since we are grepping for error, getting an ``error condition'' of 1 is
+ what we hope for.
+
+ \subsection{Finding the context of the error}
+ If the sanity test ouput shows errors, the next step is to track down where
+ they are coming from. You can see in the output the files that are being
+ searched for errors. They are:
+ \begin{verbatim}
+ bin/<arch>/sanity.out
+ interfaces/blas/C/testing/<arch>/sanity.out
+ interfaces/blas/F77/testing/<arch>/sanity.out
+ \end{verbatim}
+
+ The threaded sanity test uses the same filenames with {\tt pt} prefixed.
+
+ The first thing to notice is which of these tests are showing errors.
+ The testers in bin are higher level than those in the interfaces directories,
+ so if you get errors in both, track down and fix the interface errors first,
+ as they may be causing the lapack errors. If both C and F77 BLAS interfaces
+ are showing errors, I always scope and fix the Fortran77 stuff first, since
+ Fortran is simpler (no RowMajor case to handle). Only if an error only
+ shows up in C testing do I scope that output instead of the Fortran77.
+
+ The grepped error message probably gives you no idea what actually went wrong
+ (it may show something as simple as:
+ \begin{verbatim}
+ FAIL
+ \end{verbatim}
+ for instance), so you must go look at the {\tt sanity.out} in question.
+ For instance, you might need to scope
+ {\tt interfaces/blas/F77/testing/<arch>/sanity.out}. You do a search for
+ whatever alerted you to the problem (eg., {\tt FAIL}), and you see by the
+ surrounding context what tester failed.
+
+ \subsection{Tracking down an error in the BLAS interface testers}
+ The BLAS testers are split by BLAS Level (1, 2 or 3) and precision/type
+ (s,d,c,z). The basic names of the tester executables are
+ \begin{verbatim}
+ x<pre>blat<lvl>
+ x<pre>cblat<lvl>
+ \end{verbatim}
+ for Fortran77 and C, respectively. The Level 1 testers
+ ({\tt x[s,d,c,z]blat1}) test certain fixed cases, and thus take no input file.
+ So if the error is in them, you simply run the executable with no args in
+ order to reproduce the failure.
+
+ The Level 2 and 3 testers allow a user to specify what tests should be run,
+ via an input file. The standard input files that ATLAS runs with are:
+ \begin{verbatim}
+ <pre>blat<lvl>.dat
+ c_<pre>blat<lvl>.dat
+ \end{verbatim}
+ respectively. The format of these input files is pretty self explanatory,
+ and more explanation can be found at:
+ \begin{verbatim}
+ http://www.netlib.org/blas/faq.html
+ \end{verbatim}
+ To run the tester with these files, you redirect them into the tester. For
+ instance, to run the double precision Level 2 tester with the default input
+ file, you'd issue:
+ \begin{verbatim}
+ ./xdblat2 < ../dblat2.dat
+ \end{verbatim}
+
+ You should be aware that only the first error report in a run is accurate:
+ one error can cause a cascade of spurious error reports, all of which may go
+ away by fixing the first reported problem. So, it is important to find and
+ fix the errors in sequence.
+
+ I usually copy the input file in question to a new file that
+ I can hack on (for instance, if the error was in the double precision Level 2,
+ I might issue {\tt cp ../dblat2.dat bad.dat}. I then repeatedly run the
+ routine and simplify the input file until I have found the smallest, simplest
+ input that displays the error.
+
+ The next step is to rule out tester error. The way I usually
+ do this is to demonstrate that the error goes away by linking to the Fortran77
+ reference BLAS rather than ATLAS (you can only do this for errors in the
+ F77 interface, obviously). I usually just do it by hand, i.e., for the
+ same example again, I'd do:
+ \begin{verbatim}
+ f77 -o xtst dblat2.o /home/rwhaley/lib/libfblas.a
+ \end{verbatim}
+ If the ATLAS-linked code has the error, and this one does not, it is a
+ strong indication that the error is in ATLAS. If the F77 BLAS are shown
+ to be in error, it is usually a compiler error, and can be fixed by turning
+ down (or off) the optimization used to compile the tester.
+
+ Now you should have confirmed the tester is working properly, and that
+ the error is in a specific routine (let us say DNRM2 as an example).
+ As a quick proof that DNRM2 is indeed the problem, you can link explicitly
+ to the F77 version of DNRM2, and to ATLAS for everything else (see
+ Section~\ref{sec-GoodBlas} for hints on how to do this). If this
+ still shows the error, you are confident that ATLAS's DNRM2 is indeed causing
+ the problem, and you should either track it down, or report it (depending on
+ your level of expertise).
+
+ \subsection{Tracking down an error in the {\tt bin/} testers}
+ \label{sec-LapackDebug}
+
+ The sanity tests only run the LAPACK testers in this directory. The LAPACK
+ routines depend on the BLAS, so ignore errors in lapack testers until all
+ the BLAS pass with no error. If you have errors in LAPACK but the BLAS pass
+ all tests, then you have to hunt for the error in the LAPACK routines.
+
+ First, rule out that it's not a problem in the BLAS that is just not showing
+ up in the BLAS testing. Get yourself a reference BLAS library, as explained
+ in Section~\ref{sec-GoodBlas}. Then, set your {\tt Make.ARCH}'s {\tt BLASlib}
+ macro to point to the created reference BLAS library. Then, you need to
+ compile a library that uses ATLAS's lapack routines, but the reference
+ BLAS. This can be done by compiling the same executable name with {\tt \_sys}
+ suffixed. For instance, if you were running the LU tester, {\tt xdlutst},
+ you would say {\tt make xdlutst\_sys}, and then run this executable with
+ the same input.
+
+ If the error goes away, then the error is really in the ATLAS BLAS somewhere.
+ I then usually look at the LAPACK routine and tester in question to find out
+ what it's BLAS dependencies are, and manually link in the reference BLAS
+ object files until I find the exact BLAS causing the problem. Usually
+ once you know what routine causes the prob, you can reproduce the error
+ with the BLAS tester (i.e. you need a IDAMAX call with N=12, incX=82).
+
+ If the error still persists using ATLAS's LAPACK and the Fortran77 BLAS,
+ the next trick is to do LAPACK just like the BLAS: download and compile the F77
+ LAPACK from netlib ({\tt http://www.netlib.org/lapack/lapack.tgz}). You then
+ set your {\tt Make.ARCH's} {\tt FLAPACKlib} to point to your Fortran77 lapack
+ library. You then suffix the base executable name with {\tt F\_sys} (eg., for
+ LU again, you would do {\tt make xdlutstF\_sys}), and
+ you will get a tester linked against the Fortran77 BLAS {\em and} LAPACK.
+ If this also shows to be in error, there is an error in the tester, or in
+ the compiler. Try turning down compiler optimization to rule in or out
+ compiler errors.
+
+ \section{Antoine's testing scripts}
+ Before a stable release, we always do as much testing as possible. The
+ 900 pound gorilla of testers is Antoine's tester scripts.
+ This tester runs for several days, and does a great number of
+ both fixed and random tests, and if it completes with no errors, you have
+ a pretty good
+ idea that the code is fairly solid. This tester takes a lot of time and
+ energy to get going, so not everyone will want to apply it. Even the casual
+ user ought to run the sanity testing as a matter of course, and that should
+ always be ran and passed first. Also, much of the methodology for
+ understanding output, tracking down problems, etc, is the same for this
+ tester and the sanity test, so read those sections first for tips I will
+ not bother to repeat here.
+
+ \subsection{Setting up and installing the tester}
+ First, you need to get the tester tarfile. You can get it from the file
+ releases on sourceforge, or, if you are using CVS, you can checkout the
+ {\tt AtlasTest} module. You then untar this guy in the directory you
+ want it ({\tt bunzip2 -c atlas\_test.tar.bz2 | tar xvmf -}).
+
+ Now, you need to edit {\tt AtlasTest/Makefile}, and set {\tt TSTdir}
+ to the directory you just created.
+
+ Some of these tests need a reference BLAS library to compare against, so
+ you need to fill in your ATLAS install's {\tt ATLAS/Make.ARCH} with a
+ trusted, complete {\tt BLASlib}. See the following section for details on
+ this.
+
+ You are now ready to start the testing, as described in the following sections.
+
+ \subsection{Getting a good {\tt BLASlib}}
+ \label{sec-GoodBlas}
+
+ Some of these tests need a reference BLAS library to compare against, so
+ you need to fill in your ATLAS install's {\tt ATLAS/Make.ARCH} with a
+ trusted, complete {\tt BLASlib}. You want an optimized library if one
+ is available, because the Level 3 tests can go on for days extra if
+ you use the reference library. However, only a few vendor libraries
+ supply all of the BLAS that ATLAS provides (to be fair, ATLAS provides
+ BLAS above those mandated by the standard; it provides all the routines
+ present in the Fortran77 reference library). So, the easiest way to
+ get a complete library is to also install the reference Fortran77 library
+ from netlib.
+
+ Get the BLAS reference tarfile from {\tt http://www.netlib.org/blas/blas.tgz}.
+ then do something similar to the following:
+ \begin{verbatim}
+ mkdir FBLAS
+ cd FBLAS
+ gunzip -c ../blas.tgz | tar xvf -
+ f77 -O -c *.f
+ ar r libfblas.a *.o
+ \end{verbatim}
+
+ You may need to substitute for your Fortran77 compiler and flags, and if your
+ system uses ranlib, run that on {\tt libfblas.a} as well.
+
+ Now, you can set {\tt BLASlib} so that the optimized library is linked in
+ first, and the reference BLAS are used for any routines not provided in
+ the optimized library. For instance, here's my {\tt BLASlib} for using
+ MKL:
+ \begin{verbatim}
+ BLASlib = /home/rwhaley/lib/libmkl32_def.a /home/rwhaley/lib/libfblas.a
+ \end{verbatim}
+
+ For many routines, the tester cannot tell the difference between an error
+ in the BLAS given by {\tt BLASlib}, and an error in ATLAS. Subsequent
+ section will explain how to figure this out, but understand that
+ a lot of optimized BLAS will fail this tester, in which case you need
+ to link against the F77 BLAS instead of the optimized version of that
+ routine. Let us say you find out that there are errors in the
+ optimized {\tt DTRSM}. In this case, you can simply link in the F77 reference
+ DTRSM object file first to override the on in the optimized lib. So, your
+ BLASlib line would then look something like:
+ \begin{verbatim}
+ BLASlib = /home/rwhaley/FBLAS/dtrsm.o \
+ /home/rwhaley/lib/libmkl32_def.a /home/rwhaley/lib/libfblas.a
+ \end{verbatim}
+
+ Obviously, if you have more than a few routines like this, just testing
+ against the f77 reference BLAS and taking the extra runtime is the way to
+ go.
+
+
+ \subsection{Running the tester}
+ The first thing to be aware of in running the tester is that the log files
+ it creates can take up a {\em lot} of space. You can kill the log files
+ as soon as the tester finishes, but you need enough space for it to complete.
+ The command to run the tester is fairly simple:
+ \begin{verbatim}
+ make TOPdir=<ATLAS install directory> ARCH=<arch>
+ \end{verbatim}
+ So, an example of this would be:
+ \begin{verbatim}
+ make TOPdir=/home/rwhaley/atlas3.3/ATLAS3.3.11.3 ARCH=Linux_PPRO_2
+ \end{verbatim}
+
+ As previously mentioned, however, this tester can run as long as several
+ days. So, if you are connected to the machine with an unreliable or short-
+ term connection, you will need to ensure it can continue to run even if
+ you are disconnected. Under most unixes, you can do this by using the
+ {\tt nohup} command. For example:
+ \begin{verbatim}
+ nohup make TOPdir=/home/rwhaley/atlas3.3/ATLAS3.3.11.3 ARCH=Linux_PPRO_2 |& \
+ tee PPRO.out &
+ \end{verbatim}
+ is what I use with the tcsh shell. Bourne shell uses users will need a
+ different redirect command.
+
+ \subsection{Tracking down errors in the {\tt bin/} testers}
+ There are two types of bin/ testers: lapack and blas. The BLAS testers
+ have executable names of the form
+ \begin{verbatim}
+ x<pre>l<level>blastst
+ \end{verbatim}
+ The BLAS testers test ATLAS against a known-good implementation, so the first
+ thing to do is make sure the error is in ATLAS, and not the known-good
+ implementation. To do this, compile the reference BLAS from netlib, as
+ discussed in Section~\ref{sec-GoodBlas}, and then relink and rerun the
+ test in question. If the error goes away, you have found an error in your
+ known-good library, not ATLAS. If it stays, you have found an error in
+ ATLAS, and you should track it down or report it.
+
+ See Section~\ref{sec-LapackDebug} for information on tracking problems in the
+ LAPACK testers.
\end{document}