Knowledge Base

Below are historic knowledge base articles.

KB18
This article explains the difference between the ACML int64 and 64bit versions.Details and RecommendationsSome of the 64bit ACML libraries provide an additional int64 version. For instance there are two 64-bit gfortran downloads:
acml-4-2-0-gfortran-64bit.tgz and acml-4-2-0-gfortran-64bit-int64.tgz. So what’s the difference, and when should the int64 version be used?Both of the named tgz files are 64-bit versions of ACML.
64-bit refers to the compiled code base: it’s compiled for 64-bit OS and 64-bit executable binary interface rules. But acml-4-2-0-gfortran-64bit expects 32-bit integer values when integers are part of a function’s argument list.The other library, acml-4-2-0-gfortran-64bit-int64 is the 64-bit, int64 library. It is still a 64-bit OS code, but using 64-bit integer arguments for all functions.int64 refers to the default type of integer arguments expected by the FORTRAN library functions (all ACML functions are FORTRAN callable). The default integer size is INT*4, for 4 bytes, or 32-bits, or int32 (which is never really used as a descriptor). Int64 refers to using full 64-bit integers as the default integer size for fortran calls. int64 has no affect at all on the default type of floating point data and calling arguments.int64 also affects the C callable interface, which has to comprehend the proper size of integer variables. Consider the DSCAL routine as an example, which has the fortran interface:
SUBROUTINE DSCAL (N, A, X, INCX)Here, A is the double precision scale factor and X is the double precision array. N and INCX are integers. By default, these integers are 32 bits. But if the array was larger than 2^32 numbers, then N must be stored in more than 32-bits. In this case the program would require that the integer parameters are 64-bits, and the int64 version of ACML would be used to supply the DSCAL routine.Most applications will use the standard build (not int64). There are codes that require the int64 build. The user will often know if this is the case for gfortran because –fdefault-integer-8 is on the compile line for their fortran code.Note that there is no performance penalty for using the int64 version.Related ResourcesACML User Guide/FFT Documentation

Audience:

CPU

Category:

Libraries

Rating:

Informational

Sub-Category:

AMD Core Math Library (ACML)

Last Updated:

03/03/2009Summary

KB35

Audience:

CPU

Category:

Libraries

Rating:

Important

Sub-Category:

AMD Core Math Library (ACML)

Last Updated:

04/02/2009Summary

What is the ECCN (Export Control Classification Number) for ACML?Details and Recommendations
We are occasionally asked what is the ECCN (Export Control Classification Number) for ACML.

The AMD Global Trade group has given us the EAR99 designation.

This means that the AMD Core Math Library is a category of product that is not found on the Commerce Control List.

Running ACML-GPU 1.0 applications on Linux systems which do not install GCC 4.1.* by default or the gfortran package by default will not be able to find a required library, libgfortran.so.1.Details and Recommendations

ACML-GPU 1.0 is linked against GCC 4.1.2 and requires the gfortran library. On systems which do not install GCC 4.1.* or do not install the gfortran package by default, ACML-GPU 1.0 application will not be able to run without installing additional packages. An example of such a system is OpenSUSE 11.0.

The solution is to find and install the appropriate package which contains libgfortran.so.1. Often, this can coexist with the existing installation of libgfortran.so, if there is already a more recent version.

On OpenSUSE 11.0, the RPM package, libgfortran41, can be found on the images for OpenSUSE 10.3.

The VALGRIND application verifying and debugging tool reports bugs in ACML that are false alarms. We believe this is due to one or more bugs in VALGRIND.Details and Recommendations

VALGRIND is a program analysis framework available online from valgrind.org. AMD is not associated with valgrind.org. We are not responsible for their product. Multiple users of ACML have reported that the VALGRIND tool reports bugs in ACML, such as:

Conditional jump or move depends on uninitialised value(s)

Uninitialised value was created by a stack allocation

Usually, the subroutine acmlcpuid() appears on the call stack.

These reports are false alarms, caused by bugs in VALGRIND. AMD has investigated these reports and determined that the code is correct and these alleged bugs do not exist.

“Using -O0 is also a good idea, if you can tolerate the slowdown. With -O1 line numbers in error messages can be inaccurate, although generally speaking Memchecking code compiled at -O1 works fairly well and is recommended. Use of -O2 and above is not recommended as Memcheck occasionally reports uninitialised-value errors which don’t really exist.” (emphasis added)

Of course, the ACML libraries that we release are highly optimized for performance.

Also, we observe that these false alarms usually point to code executed after a CPUID instruction. The x86-64 CPUID instruction is unusual in that it writes destination registers EAX, EBX, ECX, and EDX, but those destination regiters are not explicitly encoded in the binary instruction code bytes. Unlike the destination for a typical instruction like “add eax,ebx”, the registers modified by CPUID are implicit in the opcode.

VALGRIND’s binary disassembler seems to be confused on this. The false alarms indicate that the VALGRIND tool does not “understand” that the CPUID instruction writes information into those 4 registers; instead, the tool reports that they contain uninitialized values immediately after the execution of the CPUID instruction.

When using a depth / stencil format of GL_DEPTH24_STENCIL8, and displaying “Render Targets” on the HUD, the stencil buffer does not appear even though the app is using one.The display of stencil buffers of the GL_DEPTH24_STENCIL8 format on the HUD has been disabled for the GPU PerfStudio 2.3 release due to a system hang that was being introduced. Retrieving the stencil buffer picture in the client should continue to work correctly.

KB47

Audience:

CPU

Category:

Linux/Solaris Application Help

Rating:

Important

Sub-Category:

Optimization and Performance

Last Updated:

06/30/2009Summary

How to use CPU affinity with Platform MPI on AMD-based systems.Details and Recommendations

Q: How do I enable CPU affinity with Platform MPI on an AMD-based system?

A: The general guidance here is that many HPC applications broadly fall into the category of being “memory-bandwidth bound”; in this case, you may want to use more memory controllers. This is referred to as the “bandwidth” affinity mode. This implies that tasks should be distributed across nodes and getting further away from the controlling node. Remember that each AMD processor is a NUMA node, with its own integrated memory controller and associated bank of RAM. Other applications may be more sensitive to the additional latency that can rise with spreading out, and in that case the “latency” mode may be better.

These are the mpimon options that allow you to control affinity.

-affinity_mode automatic:bandwidth – use as many sockets as possible-affinity_mode automatic:latency – use as few sockets as possible
The manual affinity mode is for manual placing of tasks, and it should be tried last after the other modes.

-affinity_mode manual:0x1:0x2:0x4:0x8 ranks N..N+3 maps to the mask corresponding to these bitwise-enumerated execution units in the node; on a standard core enumeration, this would map to the four cores on a quad-core.-affinity_mode manual:0x1:0x4 ranks N..N+1 maps to the mask corresponding to these bitwise enumerated execution units in the node; on a standard core enumeration, this would map to the first core and third core on a quad-core.

There are several levels of granularity: socket, core, and execunit. The concept of “execunit” applies to platforms with a chip-level multi-threading model and does not apply to AMD processors.

-affinity_mode automatic:latency:socket – run on all execution units in a socket (on as few sockets as possible)-affinity_mode automatic:latency:core – run on all execution units in a core (on as few sockets as possible)-affinity_mode automatic:latency:execunit – run on a single execution unit in a node (on as few sockets as possible)-affinity_mode automatic:bandwidth:socket – run on all execution units in a socket (on as many sockets as possible)-affinity_mode automatic:bandwidth:core – run each process/rank exclusively on all execution units in a core (on as many sockets as possible)-affinity_mode automatic:bandwidth:execunit – run on a single execution unit in a node (on as many sockets as possible)

Summary

OpenCL and the OpenCL logo are trademarks of Apple Inc. and used by permission by Khronos.

Details and RecommendationsOpenCL™ defines both an online compilation design flow as well as an offline compilation design flow. Online compilation involves passing the OpenCL™ C kernels to the OpenCL™ runtime at runtime. For ISVs and other developers concerned about making the source code of their OpenCL™ C kernels available to the end user, offline compilation is the preferred deployment method.

Setup the OpenCL™ platform with the CL_CONTEXT_OFFLINE_DEVICES_AMD property set to 1. This gives you the ability to build OpenCL™ C kernels for OpenCL™ devices which may not necessarily be installed or accessible in your build system.
Example:

Write out the binary kernels for each of the OpenCL™ devices you need to support.The exact mechanisms used are application-dependent. Each binary kernel returned from clGetProgramInfo() is for a specific device. To make it easier to remove the stored intermediate kernel representations from the binary kernel file below, it is recommended that you write out each binary kernel into its own individual file.

Remove the stored intermediate kernel representations from the binary kernel files to protect your intellectual property. See the section below titled “Using ObjCopy To Remove Unwanted Sections From The Binary Kernel” for more information on how to accomplish this. It is recommended that you follow these steps to remove intermediate kernel presentations from all binary kernel files prior to distributing your application in order to protect your intellectual property.

Package and distribute the processed binary kernel files as appropriate for your application. The exact mechanisms used are application-dependent.

For a sample program that demonstrates this technique, see the clbinarygen.c file in clbinary.zip.

Using ObjCopy To Remove Unwanted Sections From The Binary Kernel
Binary kernels are stored in the standard Executable and Linkable Format (ELF). The OpenCL™ runtime will generate 32-bit ELF for 32-bit OpenCL™ host applications and 64-bit ELF for 64-bit OpenCL™ host applications.
Binary kernel ELF files have the following five main sections:

.source

.llvmir

.amdil

.rodata

.text

The .text section contains the ISA code for the OpenCL™ device and the .rodata section contains information necessary for the OpenCL™ runtime to properly execute the ISA code. The remaining sections contain intermediate representations of the OpenCL™ C kernel as it is transformed by the OpenCL™ compiler. Only the .text and .rodata sections are necessary when distributing your OpenCL™ application. The other sections (.source, .llvmir, .amdil) are not used during normal application execution and it is recommended that you remove them before distributing your application.
The following steps assume that readelf and objcopy are present on your system. These tools are normally installed as part of binutils on most Linux® development systems. Under Microsoft® Windows®, it is recommended that you install Cygwin and install the binutils package (under the Devel category) in order to gain access to these tools.
The following steps should be followed in order to remove the .source, .llvmir and .amdil sections from your binary kernels:

<Machine> is the “machine” code obtain in the previous step. This code should be the full hexadecimal value reported by readelf, including the preceding 0x value.
It is expected that objcopy will report that it does not support the particular alternative machine code you provide. It will treat the number as an absolute e_machine value instead. This is the desired effect and should not be interpreted as an error.

The stripped_kernel.bin binary kernel file is ready for distribution. Repeat these steps on all generated binary kernel files.

Loading A Generated Binary Kernel
The steps involved in loading a generated binary OpenCL™ kernel are as follows:

Setup the OpenCL™ platform and create a context with the OpenCL™ devices you plan to use.
Example:

For each of the OpenCL™ devices you plan to use, load the appropriate generated binary kernels into individual character buffers. For this example, we assume that you are targeting a single OpenCL™ device and that your generated binary kernel is stored in binary.
Example:

Load the generated binary kernels into an OpenCL™ program. If you are loading multiple binary kernels at the same time, the binary kernels in the array must be in the same orderas the matching devices in the device array.
Example:

Call clBuildProgram() to setup the necessary internal program state.
Example:

err = clBuildProgram( program, 1, &device_id, NULL, NULL, NULL );

At this point, you can treat the kernels in the OpenCL™ program as you would kernels that were compiled online.

For a sample program that demonstrates this technique, see the clbinaryuse.c file in clbinary.zip.

Notes

It should be noted that while online compilation allows the OpenCL™ runtime to retarget the OpenCL™ C kernel source dynamically to the currently available devices being used in the context, offline compilation, by its very nature, is essentially taking a snapshot of the generated binary kernel for a particular GPU. Developers are highly encouraged to not only generate separate binary kernels for each supported GPU, but to also test their binary kernels to ensure they function properly across the range of devices they wish to support.

The bitness of the generated binary kernel must match the bitness of the OpenCL™ runtime. Attempting to generate or load a binary kernel of a bitness that does not match the OpenCL™ runtime may result in undefined behavior. It is the application’s responsibility to perform the necessary bitness checks.

Currently, even if the binary kernel does not have intermediate kernel representations removed, the OpenCL™ runtime will not attempt to perform a full or partial recompilation. Each generated binary kernel is supported only on the OpenCL™ device it was originally generated for. Attempting to load a binary kernel onto an OpenCL™ device for which it was not originally generated for may result in undefined behavior. It is the application’s responsibility to ensure that binary kernels are only loaded ontocompatible OpenCL™ devices.

Currently, a binary kernel may only contain the ISV for a single OpenCL™ device. If support for several OpenCL™ devices is necessary, you must generate, store and load multiple binary kernels.

Related Resources

Summary

This application note describes the steps to take to successfully run your AMD APP application while remote logged into your system.

Details and Recommendations

AMD APP applications developed with the AMD APP SDK rely on AMD CAL to manage the AMD GPU for general-purpose computations. AMD CAL uses existing API hooks into the display driver to access your GPU. This works transparently if you are running the AMD APP applications locally on the graphics console. However, additional care must be taken when trying to run the AMD APP applications while remotely logged into your system.

The steps and suggestions provided in this application note should work on a wide variety of systems. Use them as a guide, and make adjustments for your particular setup.

Related Resources

Summary

How do I find a list of AMD products that work with the ATI Stream SDK?

or,

Does the ATI Stream SDK support my AMD GPU?

Details and Recommendations

We keep a list of supported devices, operating systems and compilers on the system requirements page located on the ATI Stream SDK page.

The link to the system requirements page is available below in “Related Resources”.

If you are doing double-precision floating point operations, be sure to choose a product that has double-precision floating point support. The products that do not have double-precision floating point support are appropriately annotated on the system requirements page.

Related Resources

Summary

This hotfix is no longer required with ATI Stream SDK v2.0! Please discontinue use of ATI Stream SDK v2.0-beta4 and download the production SDK instead.

In order to properly enable the OpenCL™ GPGPU benchmarks that are part of SiSoftware Sandra 2010 to run with the ATI Stream SDK v2.0-beta4, a hotfix patch must be applied to the default beta4 release.

This article describes the steps to take in order to properly patch an ATI Stream SDK v2.0-beta4 installation in order to enable the OpenCL™ GPGPU benchmarks in SiSoftware Sandra 2010 to run correctly.

Note: This hotfix is provided as is and is not supported by AMD. It has not completed full AMD testing, and is only recommended for users experiencing the particular issue described in this article.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Details and Recommendations

This hotfix resolves the following issues:

Enable the OpenCL™ GPGPU benchmarks that are part of SiSoftware Sandra 2010 to run properly with the ATI Stream SDK v2.0-beta4.

Note: This hotfix is provided as is and is not supported by AMD. It has not completed full AMD testing, and is only recommended for users experiencing the particular issue described in this article.

KB84

Audience:

GPU

Category:

SDKs

Rating:

Informational

Sub-Category:

AMD APP SDK

Last Updated:

05/25/2011

Summary

This article provides information about preview support for OpenCL™ / OpenGL® interoperability made available in the ATI Stream SDK v2.1.

Preview features are PROVIDED “AS IS” and WITHOUT WARRANTY OF ANY KIND. AMD is under no obligation to provide users with any updates, support, or maintenance of preview features or associated documentation. AMD is free to change interfaces and/or functionality at any time and without notice. Developers and users use preview features at their own risk.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Details and Recommendations

Interoperability between OpenCL™ and OpenGL®allows developers to pass computed data from an OpenCL™ kernel directly to the OpenGL® programming interface for rendering on the display. By passing the data directly to the OpenGL® programming interface, the developer can avoid unnecessary transferring of data across the PCIe® link, potentially resulting in improved overall application performance.

OpenGL® interoperability is based on the cl_khr_gl_sharing extension, described in the OpenCL™ 1.0 Specification (revision 48), section 9.11.

This preview feature is being offered to developers by AMD and conformance logs have not been submitted to the Khronos® OpenCL™ working group for this feature. This feature may behave differently from what is documented in the specification.

For more information about how to use this preview feature:

Read section 9.11 in the OpenCL™ 1.0 Specification (revision 48).

Since conformance logs have not been submitted for this feature, cl_khr_gl_sharing will not be present in the CL_PLATFORM_EXTENSIONS or CL_DEVICE_EXTENSIONS string. To access this feature,follow the specificationas described in section 9.11 of the OpenCL™ 1.0 Specification (revision 48).

To use shared resources, the OpenGL® application must first create an OpenGL® context and then an OpenCL™ context. All resources created after the OpenCL™ context has been created can be shared between OpenGL® and OpenCL™. If resources are allocated before the OpenCL™ context is created, they cannot be shared between OpenGL® and OpenCL™.

Summary

An additional file is required to properly access the preview support for OpenCL™ / Microsoft® DirectX® 9 & 10 interoperability made available in the ATI Stream SDK v2.01.

Preview features are PROVIDED “AS IS” and WITHOUT WARRANTY OF ANY KIND. AMD is under no obligation to provide users with any updates, support, or maintenance of preview features or associated documentation. AMD is free to change interfaces and/or functionality at any time and without notice. Developers and users use preview features at their own risk.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Related Resources

Summary

This article lists common OpenCL™ error messages that developers may encounter when using the OpenCL™ ICD (Installable Client Driver) feature in the ATI Stream SDK v2.2.

If your OpenCL™ code was written with a previous beta release of the ATI Stream SDK v2.0, read the following article for instructions on updating it so it continues to work with the ATI Stream SDK v2.2:

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Details and Recommendations

Error Message:clGetPlatformIDs() failed

Possible Causes:

Under Linux®, this error message appears if you try to execute ATI Stream SDK v2.2 applications without properly registering the OpenCL™ ICD model during installation. See ATI Stream SDK Installation Notes (v2.1), section 2.2, step 5, for instructions on how to register the OpenCL™ ICD model under Linux®.

Under Linux®, this error message can also appear if you try to execute ATI Stream SDK v2.2 applications without properly setting the ATISTREAMSDKROOT environment variable during installation. See ATI Stream SDK Installation Notes (v2.2), section 2.2, step 2, for instructions on how to properly set the ATISTREAMSDKROOT environment variable under Linux®.

This error message appears if you try to run samples and applications originally built with a previous beta version of the ATI Stream SDK v2.0. Run the samples provided with the ATI Stream SDK v2.2, and follow the instructions in KB71 to update your existing OpenCL™ applications so they work with the OpenCL™ ICD model.

Summary

This article provides information about experimental OpenCL™ C++ bindings made available in the AMD APP SDK v2.3.

Preview features are PROVIDED “AS IS” and WITHOUT WARRANTY OF ANY KIND. AMD is under no obligation to provide users with any updates, support, or maintenance of preview features or associated documentation. AMD is free to change interfaces and/or functionality at any time and without notice. Developers and users use preview features at their own risk.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Details and Recommendations

Due to popular demand, the Khronos® OpenCL™ Working Group has introduced a set of experimental C++ bindings to OpenCL™ 1.0. These experimental bindings allow developers to access the OpenCL™ 1.0 runtime API from C++ host code instead of C host code.

The experimental OpenCL™ C++ bindings are used in a variety of code samples and examples from AMD:

Summary

Uninstall prior versions of the ATI Stream SDK on your system before installing newer versions of the SDK. This article refers to installation of SDK version 2.3 or earlier. Starting with version 2.4, please refer to the Installation Notes for instructions on uninstalling prior versions.

Details and Recommendations

If you are installing SDK version 2.4 or newer, please refer to the Installation Notes for instructions on uninstalling prior versions.

If you are installing SDK version 2.3 or earlier, follow these steps to uninstall prior versions of the ATI Stream SDK on your system:

Reboot.

Use Programs and Features (Windows Vista or 7) or Add or Remove Programs (Windows XP) to uninstall the prior SDK files.

Manually remove any ATI Stream directories under My Documents and under Program Files or Program Files (x86). To do this, you may need to be logged in as Administrator for Windows, or root for Linux.

Delete the entire directory where you previously unzipped the previous OpenCL package for installation. The previous OpenCL installations do not have an automatic uninstall feature and must be deleted manually.

Ensure that your system PATH variable no longer references any directories in your previous OpenCL installation.

If you are using a command shell under Windows, exit from any open command shells and restart those command shells to ensure your new PATH settings are used. Under Linux, log out of your current session, then log back in to ensure your new PATH settings are used.

Search for and remove all atiocl* files on the system.

If you have, or previously had, the ATI Stream Profiler or Stream Kernel Analyzer (SKA) installed, remove any of the following temporary directories that exist:

XP 32-bit: C:ATISUPPORTstreamsdk_2-3_xp32

XP 64-bit: C:ATISUPPORTstreamsdk_2-3_xp64

Vista and Win7 32-bit: C:ATISUPPORTstreamsdk_2-3_win732

Vista and Win7 64-bit: C:ATISUPPORTstreamsdk_2-3_win764

KB75

Audience:

CPU; GPU

Category:

SDKs

Category:

Important

Sub-Category:

AMD APP SDK

Last Updated:

01/20/2011

Summary

This article provides tips and suggestions for running the OpenCL™ GPGPU benchmarks that are part of SiSoftware Sandra 2010 with the AMD APP SDK v2.3.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Details and Recommendations

Installation

Go to the AMD APP SDK v2.3 page and download the SDK packages that matches your operating system and bitness. Download the latest display driver that matches your operating system and bitness.

Make sure all previous AMD display drivers and AMD APP SDK packages are uninstalled. Reboot the system if required by the uninstall process.

Install the new display driver hotfix package on your system. Reboot the system if required by the installation process.

Install the AMD APP SDK package on your system.

Obtain the latest copy of SiSoftware Sandra 2010 from SiSoftware. Visit the SiSoftware Sandra 2010 website for more details. At the time this article was written, the latest version is: Sandra 2010 (Final 16.11) – 3 Dec 2009.

Run the installation file to install SiSoftware Sandra 2010 on your system. It is recommended that you install a full installation and allow the installer to perform any recommended component updates to ensure that you have the latest software components installed on your system.

Running the Benchmark

Start SiSoftware Sandra 2010 from the Start menu or from the desktop icon.

If required, enter your Sandra 2010 key when prompted by the program.

If a “Tip of the Day” dialog box appears, dismiss it by clicking on the green check mark.

Near the top of the application window there are a set of tabs, such as “Home”, “Tools”, “Benchmarks”, etc. If the “Benchmarks” tab is not already selected, please click on “Benchmarks” to go to that tab.

On the “Benchmarks” tab, find the section marked “Graphics Processor” and double-click on the “GPGPU Processing” icon. If a dialog box appears with the title “Customise Rank Engines”, you can dismiss it by clicking on the red X if you do not wish to configure this. The benchmark will run correctly without completing this configuration.

The benchmark will immediately start running. This may take several minutes. Allow it to finish running before proceeding. You will know when the benchmark has completed when the indicator in the upper right hand corner of the window has disappeared.

By default, the benchmark will attempt to load balance across both the CPU and the GPU in your system. You can tell if the benchmark is set to load balance across both the CPU and the GPU by looking at the setting in the “Video Adapter” field near the top of window. If it lists both the GPU and the CPU in your system, it is attempting to do load balancing.

Currently the load balancing code in Sandra 2010 is not optimal and does not fully utilize the processors in the system. AMD and SiSoftware are working to provide an optimized version in a future release of Sandra.

For now, it is recommended, for the purposes of benchmarking OpenCL™ GPGPU, that you run Sandra 2010 only on the GPU. You can do this by choosing the entry in the drop down box for “Video Adapter” so that it only lists your GPU.

Once you make this selection, the benchmark will immediately rerun. Wait for the benchmark to finish running before proceeding. A Mandelbrot image may be displayed briefly on your screen. This is expected and part of the benchmarking process.

Sandra 2010 has the ability to run ATI CAL/IL kernel benchmarks as well. It is recommended that you run the OpenCL benchmarks. By default, as long as you have properly applied the hotfix mentioned above, the benchmark will choose to run the OpenCL benchmark. You can tell which interface is being used by looking at the “Type” field near the top of the window. If that field indicates it is using the “OpenCL Processor”, then you are running the OpenCL benchmarks. If that field indicates it is using the “STREAM Processor”, then you are running the CAL/IL benchmarks.

Another way to tell if you are running the OpenCL benchmarks is to go to the “Combined Results” tab in the window and look at the text box at the bottom of the window. Under “Benchmark Results”, there is a field named “Interface”. To the right of that field, it should say “OpenCL”.

If you were not running the OpenCL benchmarks, select “OpenCL Processor” in the “Type” field and allow the benchmark to rerun before proceeding.

Reviewing the Results

After allowing the benchmark to finish running the OpenCL benchmarks on the GPU, click on the “Combined Results” tab.

Look at the text box at the bottom of the window. You will find the single-precision floating point results under “Benchmark Results” to the right of the field marked “Native Float Shaders”. For reference, on an ATI Radeon™ HD 5870, the results should be around 1.8GPixel/s. Minor variations may be encountered due to system configurations. However, if your results differ significantly, please ensure that you have followed the instructions above. If your results still differ significantly after reviewing the instructions above please file a help desk request so that we can help you resolve the issue.

The “Emulated Double Shader” benchmarking results use single-precision floating point operations to emulate double-precision floating point calculations. “Emulated Double Shader” performance is not representative of the achievable double-precision floating point performance of AMD GPUs and the AMD APP SDK. We will be adding native double-precision floating point support in a future release and will work with SiSoftware to enable native double-precision floating point benchmarks at that time.

Notes

It is known that the “GPGPU Memory Bandwidth” benchmark reports lower than expected results. AMD and SiSoftware are working to improve the code, driver and SDK for a future release.

Summary

This article provides information about the preview feature for accessing additional physical memory on the GPU from OpenCL™ applications in the AMD APP SDK v2.2.

Preview features are PROVIDED “AS IS” and WITHOUT WARRANTY OF ANY KIND. AMD is under no obligation to provide users with any updates, support, or maintenance of preview features or associated documentation. AMD is free to change interfaces and/or functionality at any time and without notice. Developers and users use preview features at their own risk.

OpenCL and the OpenCL logo are trademarks of Apple Inc. and used by permission by Khronos.

Details and Recommendations

The AMD APP SDK v2.3 currently defaults to exposing 50% of the physical GPU memory to OpenCL™ applications. Certain developers may require accessing additional physical memory in order to optimize their applications when running on the GPU.For developers who wish to experiment with increasing the amount of physical memory that is accessible to their OpenCL™ applications, the default 50% setting can be changed by setting the environment variable GPU_MAX_HEAP_SIZE to the percentage of total GPU memory that should be exposed.

For example, if you wanted to set the exposed GPU physical memory size to 75%, you need to the GPU_MAX_HEAP_SIZE environment variable to 75.

GPU_MAX_HEAP_SIZE must be set to an integer value between 0 and 100, inclusive.

It should be noted that changing the default setting for exposed GPU physical memory to the OpenCL™ application may result in unexpected behavior. This preview feature is provided solely to allow developers to experiment with accessing a larger portion of the GPU physical memory than is normally exposed.

Related Resources

Summary

AMD APP SDK v2.3 supports OpenCL™ applications with Minimalist GNU for Windows (MinGW) [GCC 4.4]. Some modifications to include files and sample source files are necessary in order to successfully compile and link those applications with MinGW.

Details and Recommendations

In order to successfully compile and link AMD APP SDK v2.3 OpenCL™ applications with Minimalist GNU for Windows (MinGW) [GCC 4.4], the following modifications need to be made.

cl_platform.h

Include the header file ‘stdint.h’.

Samples

Add opengl32 and glu32 to link libraries for OpenGL samples.

Include the header file ‘malloc.h’, if it is not already included.

Modify your source code to use __mingw_aligned_malloc instead of _aligned_malloc.

Makefiles

You must supply your own makefiles to compile the samples or your own code.

Related Resources

Summary

When OpenCL™ applications are compiled with MinGW-w64, the application exits when an OpenCL™ runtime routine is called.

Details and Recommendations

This is a known issue with the MinGW-w64 compiler.

Whereas the MinGW (32-bit) compiler does not require a libopencl.a file, MinGW-w64 does require such a file. When a libopencl.a file is not available and specified during the link stage, no link errors are reported. However, execution of the resulting binary executable terminates whenever an OpenCL™ runtime routine is called.

The recommended work around until this is resolved in the MinGW-w64 is to generate a libopencl.a file from the OpenCL.dll provided with the ATI Stream SDK v2 installation.

You will need to make sure the following tools are installed on your system:

gendef (a MinGW-w64 utility)

dlltool (a MinGW-w64 utility)

Run the following commands to generate libopencl.a from OpenCL.dll:

gendef OpenCL.dll (this generates an OpenCL.def file)

dlltool -l libopencl.a -d OpenCL.def -k -A

Copy the resulting libopencl.a file into $ATISTREAMSDKROOTlib.

When compiling an OpenCL™ application with MinGW-w64, make sure to specify the libopencl.a file as part of the input into the link stage of the compilation.

The same procedure must be done for libglut64.a (from glut64.dll) and libaticalcl64.a (from aticalrt64.dll) if those libraries are linked to directly by the application using MinGW-w64.

Related Resources

Summary

In the ATI Stream SDK v1.4-beta, programming techniques shown in the ATI CAL sample programs for accessing input stream data buffers may not work as expected on newer AMD GPUs.

Developers who used these techniques when writing their own ATI CAL-based programs need to be aware of potential issues and, when appropriate, make corresponding source code changes, in particular when processing double-precision floating point data. Due to the subtle nature of the differences between previous generations of AMD GPUs and current generation AMD GPUs, the potential issues may not necessarily be detected through normal software testing.

Tags: gpu, ati stream, ati stream sdk, cal

Details and Recommendations

Original Coding Technique

In the ATI Stream SDK v1.4-beta CAL sample programs:

All stream data buffers are allocated with one of the type codes indicating single-precision floating point data: CAL_FORMAT_FLOAT_1, CAL_FORMAT_FLOAT_2 or CAL_FORMAT_FLOAT_4 — regardless of what type of data is actually in the buffer.

The corresponding texture samplers are all left in their default state (point sampling). For example, the sampler parameter extensions documented in Appendix B.3.7 of the “ATI Stream Computing User Guide” are not used to modify the sampler.

Data is read into the kernel’s registers with a sample_resource IL instruction. For example: "sample_resource(0)_sampler(0) r3, vWinCoord0.xyxxn"

The intent of this sample is to read the binary data from the buffers into the kernel registers without any modification, bypassing most of the numeric features of the texture sampler.

For most of the sample programs in the SDK, the data happens to be single-precision floating point data, but where the sample programs use double-precision floating point data, the same coding techniques are used. Buffers of elements with one double-precision floating point number (8 bytes per element) are allocated as CAL_FORMAT_FLOAT_2 and buffers of elements with two double-precision floating point numbers (16 bytes per element) are allocated as CAL_FORMAT_FLOAT_4.

For AMD GPUs based on the ATI Radeon™ HD 4000 Series GPUs or earlier, the texture sampler hardware will treat a sample instruction from a CAL_FLOAT_* texture buffer with point sampling as a raw data access and pass the data bits through without modification.

For more recent AMD GPUs based on the ATI Radeon™ HD 5000 Series GPUs and future processors, the texture sampling hardware in this situation will assume that the data buffer contains single-precision IEEE floating point data as the program has declared, and will perform validation of that data. Denormal values may be converted to zero. Indeterminate NaN values will be converted to a canonical non-signaling NaN bit pattern. This is equivalent to adding 0.0fto each component value.

This is not an issue if the application program using the technique from the ATI Stream SDK v1.4-beta CAL samples provides single-precision IEEE floating point data.

But if the application program provides double-precision IEEE floating point data, or provides fixed-point binary data, or some mixture of data types, then this modification of the input bit patterns may produce unexpected mathematical differences that may be difficult to detect.

Recommended Coding Technique

If your ATI CAL application program is passing IEEE double-precision floating point input data and/or fixed-point input data to the GPU, the following changes are urgently recommended. If your ATI CAL application is only passing IEEE single-precision floating point data to the GPU, these changes are still recommended, for consistency and to prevent future issues:

All data buffers used as input stream buffers should be allocated with a corresponding INT32 format:

Replace CAL_FORMAT_FLOAT_1 with CAL_FORMAT_SIGNED_INT32_1 or CAL_FORMAT_UNSINGED_INT32_1.

Replace CAL_FORMAT_FLOAT_2 with CAL_FORMAT_SIGNED_INT32_2 or CAL_FORMAT_UNSIGNED_INT32_2.

Replace CAL_FORMAT_FLOAT_4 with CAL_FORMAT_SIGNED_INT32_4 or CAL_FORMAT_UNSIGNED_INT32_4.

Input buffer declarations in the IL kernels should be changed from (float) to (unknown). For example: "dcl_resource_id(0)_type(2d,unnorm)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown)n"

1D buffers used as constant buffers should still be allocated with the CAL_FORMAT_FLOAT_4 format.

Since all possible bit patterns of an INT32 or UNSIGNED_INT32 are uniquely valid, declaring these data buffers as containing 32-bit integer values forces the texture sampler to pass the input bits through unmodified (assuming the sampler state has not been modified with the sampler parameter extensions to perform filtering). This implements the original intention of a raw data transfer.

Changing the kernel resource declaration is not strictly necessary and will have no effect with current AMD GPUs and current versions of ATI CAL, but is still recommended to avoid future issues. Technically, declaring the data buffer as a 32-bit integer and the sampler output as a floating point number could be interpreted as requesting that the texture sampler perform a data conversion. We recommend removing this potential ambiguity.

Input data that a kernel receives from constant buffers and the global buffer do not pass through a texture sampler, so they are not affected. Double-precision floating point or fixed-point constants can be used with these buffers without any issues.

Declaring data buffers which contain IEEE double-precision floating point values as CAL_DOUBLE_1 or CAL_DOUBLE_2 would also prevent any issues with current AMD GPUs, but the use of the INT32 formats is preferred, since it produces a raw transfer of unmodified bits and the expected results in all instances. For example, if an application needs to mix single-precision floating point and double-precision floating point data in the same buffer, then only the INT32 format would produce the correct result.

The recommended coding technique is to always use CAL_FORMAT_INT32_* or CAL_FORMAT_UNSIGNED_INT32_* except for constant buffers, the global buffer or cases where you specifically want the texture sampler to perform data conversion and/or interpolation filtering.

Related Resources

Summary

This article provides information about the AMD vendor extension for double-precision floating point basic arithmetic in OpenCL™ C kernels.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.Tags: gpu, ATI Stream SDK, OpenCL

Details and Recommendations

This preview feature is being offered to developers by AMD and does not constitute support for any official optional double-precision floating point extension as defined in the OpenCL™ 1.0 Specification.

Double-precision floating point addition, subtraction and multiplication operators are available for use in OpenCL™ C kernels.

Before using double data types and/or double-precision floating point operators in OpenCL™ C kernels, you need to include the #pragma OPENCL EXTENSION cl_amd_fp64 : enable directive.

Double-precision floating point built-in function support will be added in a future release.

Use of double-precision floating point arithmetic on GPUs other than ATI Radeon™ HD 5900 Series GPUs, ATI Radeon™ HD 5800 Series GPUs, ATI Radeon™ HD 4800 Series GPUs, ATI Mobility Radeon™ HD 4800 Series GPUs,ATI FirePro™ V8800 Series GPUs, ATI FirePro™ V8700 Series GPUs, ATI FirePro™ V7800 Series GPUs and AMD FireStream™ 9200 Series GPUs may result in unexpected behavior.

The ATI Stream SDK v2.1 supports additional double-precision floating point functionality in OpenCL™ C kernels as part of the cl_amd_fp64 extension when kernels are running on x86 CPUs.

The following set of double-precision floating point functionality is supported in OpenCL™ C kernels for x86 CPUs:

Related Resources

Summary

The production version of the ATI Stream SDK v2.0 with OpenCL™ 1.0 support introduced the OpenCL™ ICD (Installable Client Driver) as part of the software stack. Code written with previous beta releases of the ATI Stream SDK v2.0 require changes to comply with the OpenCL™ ICD requirements.

Please read the remainder of this article for information on what you must update in your OpenCL™ code to ensure that your application continues to work properly with the ATI Stream SDK v2.2.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Tags: gpu, opencl, ati stream sdk, ati stream, GPGPU, GPU Programming

Details and Recommendations

What is the OpenCL™ ICD?

The OpenCL™ ICD (Installable Client Driver) is a means of allowing multiple OpenCL™ implementations to co-exist and applications to select between them at runtime.

How does it affect developers?

Your application is now responsible for selecting which of the OpenCL platforms present on a system it wishes to use, instead of just requesting the system default.
This means using the clGetPlatformIDs() and clGetPlatformInfo() functions to examine the list of available OpenCL™ implementations and selecting the one which best suits your requirements. We suggest that developers offer their users a choice on first run of the program or whenever the list of available platforms changes.

How does it affect end-users?

A properly implemented ICD and OpenCL™ library should not affect the end-user experience at all, except as mentioned above.

What has changed?

In previous beta releases functions such as clGetDeviceIDs() and clCreateContext() accepted a NULL value for the platform parameter. This release no longer allows this – the platform must be a valid one obtained by using the platform API.

How do I use it?

The sample code in the ATI Stream SDK v2.2 contains examples of how to query the platform API and call the functions that require a valid platform parameter.
The following represents shows code that worked with previous beta releases along with an example of updates that would make the code snippet work properly with the ATI Stream SDK v2.1:

Summary

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Tags: gpu, ATI Stream SDK, OpenCL, Direct 3D, Direct X

Details and Recommendations

Interoperability between OpenCL™ and Microsoft® DirectX® 9 allows developers to pass computed data from an OpenCL™ kernel directly to the DirectX® 9 programming interface for rendering on the display. By passing the data directly to the DirectX® 9 programming interface, the developer can avoid unnecessary transferring of data across the PCIe® link, potentially resulting in improved overall application performance.

DirectX® 9 interoperability is currently not an official OpenCL™ 1.0 extension. This extension is being offered to developers by AMD and may change if and when the Khronos® OpenCL™ working group releases an official DirectX® 9 interoperability extension.

AMD has used this extension in various public demonstrations of OpenCL™. This extension is being made available to enable similar demonstrations to be created by the AMD OpenCL™ developer community.

Before using this extension, you need to include the #pragma OPENCL EXTENSION cl_amd_d3d9_sharing : enable directive.

For more information about this extension, please see the header files included as part of the ATI Stream SDK v2.1:

include/CL/cl_d3d9.h

Use of this extension on GPUs other than ATI Radeon™ HD 5000 Series GPUs, ATI Mobility Radeon™ HD 5000 Series GPUs, ATI FirePro™ V8700 Series GPUs, ATI FirePro™ V7800 Series GPUs, ATI FirePro™ V5800 Series GPUs, ATI FirePro™ V4800 Series GPUs and ATI FirePro™ V3800 Series GPUs may result in unexpected behavior.

Summary

Your ATI CAL applications do not run properly when you install ATI Catalyst 9.3 on a Windows XP system.

Tags: ATI Stream, ATI Stream SDK, CAL, Catalyst

Details and Recommendations

This is a known issue and has been resolved with this Catalyst 9.3 HOTFIX.

Related Resources

KB28

Audience:

GPU

Category:

SDKs

Rating:

Informational

Sub-Category:

AMD APP SDK

Last Updated:

02/11/2010

Summary

Even though Ubuntu is not officially supported by ATI Stream SDK, enough people want to do it that we decided to put together an unofficial guide pulling in various tips from the ATI Stream developer forum and internally.

NOTE: This is not a statement of official support for Ubuntu 8.04.1 LTS with ATI Stream SDK. These are notes taken when running our own experiments. Your mileage may vary…

Tags: ATI Stream, Ubuntu

Details and Recommendations

Using Ubuntu 8.04.1 LTS Desktop w/ ATI Stream SDK v1.2.1-beta

Download ubuntu-8.04.1-alternate-amd64.iso from Ubuntu website and burn it to DVD.

Go to the directory with the installer file and type “tar xvfz amdstream-1.2.1_beta-lnx64.tar.gzip”

Make sure “alien” is installed. It is the tool used to convert between .rpm files and .deb files

Type “which alien”. If it does not find the executable, you need to use apt-get to install “alien” before proceeding

Type “apt-get install alien”. If that does not work, it is either because your Internet connection is down or you have not updated your package source database. Check that your Internet connection is working and type “apt-get update”

The next few steps will repackage the .rpm files into a .deb file suitable for use on Debian-based systems (such as Ubuntu):

dd if=amdstream-cal-1.2.1_beta.x86_64.run of=amdcal.tar.gz bs=1 skip=16384 [this will take a few moments; it strips away the run script at the beginning]

mkdir amdcal; cd amdcal; tar xvfz ../amdcal.tar.gz

alien amdstream-cal-1.2.1_beta-1.x86_64.rpm

dpkg -i amdstream-cal_1.2.1_beta-2_amd64.deb

cd ..

dd if=amdstream-brook-1.2.1_beta.x86_64.run of=amdbrook.tar.gz bs=1 skip=16384 [this will take a few moments; it strips away the run script at the beginning]

Go to “/usr/local/amdcal/bin/lnx64″ and type “./FindNumDevices”. This should return “Device Count = 1″ if you have a single GPU in your system supported by CAL.

NOTE: You will see a long backtrace whenever you run applications with CAL 1.2.1-beta on Ubuntu 8.04 starting with “Locking assertion failure. Backtrace:…”. This is harmless and simply annoying. Unfortunately, there is no easy way to turn this off at the moment unless you recompile libxcb. This will be fixed in a future version of CAL.

If you are adventurous and want to compile out the locking checks temporarily from libxcb, you can do the following:

Create a working directory and cd to that directory.

Type: apt-get source libxcb

Type: cd libxcb-1.1

Type: ./configure CFLAGS=-DNDEBUG (it may complain about missing packages, in which case, use Synaptics Package Manager from the GUI to grab the appropriate pacakages and run the command again)

Now, your programs should use your newly compiled version of libxcb without the locking checks.

Related Resources

KB21

Audience:

GPU

Category:

SDKs

Category:

Important

Sub-Category:

AMD APP SDK

Last Updated:

02/11/2010

Summary

How do I get a previous version of the ATI Stream SDK?

or,

My application no longer works with a recent version of the ATI Stream SDK! I need to downgrade back to a previous version of the SDK to find out when it stopped working.

Tags: ATI Stream, ATI Stream SDK

Details and Recommendations

We recommend that you report all compatibility issues with a new version of the ATI Stream SDK to streamdeveloper@amd.com. Our intention is maintain backward compatibility whenever possible. Occasionally, we may miss a one or two cases and we would love to hear about those so that we can improve our testing suite.

In the event that you absolutely need a previous version of the ATI Stream SDK, you can download them from the site below in “Related Resources”. In some cases, you may need to also downgrade your ATI Catalyst driver version number as well.

Related Resources

Summary

When running ATI Stream or other graphics applications on a ATI Radeon™ HD 4870 X2 GPU, the display may briefly flicker.

Tags: opencl, gpu, ati stream sdk, cal, radeon, ATI Stream

Details and Recommendations

This is a known condition with the ATI Radeon™ HD 4870 X2 GPU related to engine and memory clock frequency adjustments at runtime due to an increase or decrease in GPU activity. Occassionally, the retraining time for the GDDR5 memory takes slightly longer than the refresh cycle of a single display frame. When that happens, the user may notice a brief flicker on their display.

This is expected and can be safely ignored. Data and computational integrity remain unaffected.

Related Resources

KB44

Audience:

GPU

Category:

SDKs

Rating:

Critical

Sub-Category:

AMD APP SDK

Last Updated:

05/22/2009

Summary

Programs using dcl_input_interp(linear) no longer use position as the input to IL_PS programs after switching to the most recent version of Catalyst.

Tags: ATI Stream SDK, CAL, Catalyst

Details and Recommendations

Starting with Catalyst 9.6, any IL program which uses dcl_input_interp(linear) will now fail where previously they would have worked.

The correct syntax is dcl_input_position_interp(linear). The correct syntax will work with all previous versions of Catalyst as well as all future versions of Catalyst.

In previous versions of Catalyst, the compiler incorrectly enabled position to be the input to IL_PS programs when dcl_input_interp(linear) was used. A fix was made in the compiler to assume generic usage when usage is missing from the dcl_input syntax. All other usages (such as position) need to be explicitly specified.

Original (incorrect) usage example:

dcl_input_interp(linear) v0.xy__

Correct usage version:

dcl_input_position_interp(linear) v0.xy__

Related Resources

Summary

When you load the ATI Stream SDK v1.4-beta samples browser file (/usr/local/atibrook/SDKBrowser/SDK_Browser.html) into a browser under Linux and click on an EXE link, the browser reports a “File not found” error.

Tags: ATI Stream SDK

Details and Recommendations

There was an error introduced into the ATI Stream SDK v1.4-beta samples browser file for the Linux version of the SDK. This incorrectly caused a .exe extension to be appended to all of the executable names in the samples browser file.

Corrected files are available below. Copy the patch package to an empty working directory and type:

Related Resources

Summary

When installing ATI Stream SDK v1.3-beta or later, the ATI CAL libraries are no longer installed as part of the SDK installation.

Tags: ATI Stream SDK, CAL, Catalyst

Details and Recommendations

Starting with ATI Stream SDK v1.3-beta, the ATI CAL libraries are no longer shipped with the SDK installer. They are now shipped with the ATI Catalyst driver suite to enable ATI Stream applications based on ATI CAL to be more easily deployed and to allow a greater number of users access to the power of ATI Stream Technology.

You will need to install ATI Catalyst 8.12 or later in order to install the ATI CAL libraries.

The ATI CAL libraries are now located in the following places:

Windows XP 32:

%WINDIR%system32

Windows XP 64:

64-bit libraries: %WINDIRsystem32

32-bit libraries: %WINDIRSysWow

Windows Vista 32:

%WINDIR%system32

Windows Vista 64:

64-bit libraries: %WINDIR%system32

32-bit libraries: %WINDIR%SysWoW

Linux (RHEL) 32:

/usr/lib

Linux (RHEL) 64:

64-bit libraries: /usr/lib64

32-bit libraries: /usr/lib

Linux (SLES) 32:

/usr/X11R6/lib

Linux (SLES) 64:

64-bit libraries: /usr/X11R6/lib64

32-bit libraries: /usr/X11R6/lib

NOTE: If you are using ATI Stream SDK v1.3-beta, you may only use ATI Catalyst 8.12 or 9.1 (for AMD FireStream customers, use AMD FireStream Unified Driver 8.561). This is restriction is due to a renaming of the ATI CAL libraries which occured in ATI Catalyst 9.2. Please upgrade to ATI Stream SDK v1.4-beta when using ATI Catalyst 9.2 or later.

KB26

Audience:

GPU

Category:

SDKs

Category:

Important

Sub-Category:

AMD APP SDK

Last Updated:

03/05/2009

Summary

I am getting “LNK2001: unresolved external symbol” for any CAL symbol in my own example project.

Details and Recommendations

Drop the project directory into your Desktop, and select Win32 or x64 build targets, depending on our system. Assuming you have CAL and Brook installed, this example should build in either combination of Win32/x64/Release/Debug.

If your own project does not build, compare the following properties between your project and the example project:

Project->Properties->Configuration Properties->C/C++->Command Line
Project->Properties->Configuration Properties->Linker->Command Line

Please refer to the Visual Studio documentation on how to select the various property settings that eventually result in these two command lines.

A2:

Verify that the following environment variables and properties are set, and that they are in agreement regarding the build target (Win32/x32 or x64):

BROOKROOT (should be set by installer)

CALROOT (should be set by installer)

Build target in Visual Studio (Win32 or x64), should be in sync with the CAL/Brook+ installation (32-bit or 64-bit), and the PATH (see below).

The Brook and CAL examples shipped with the installation images should
build and run without modification (with the exception of switching to
x64 build target in Visual Studio on an x64 system).

KB132

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

07/26/2011

Summary

The AMD APP Profiler may report incorrect GPU counter values when running on a system with multiple GPUs. This is only a problem if there is more than one type of GPU in the system, and the application is running on a secondary GPU. Profiling results will be correct as long as an application is running on a GPU whose type matches the primary GPU.

Tags: profiler, tools, gpu performance

Details and Recommendations

Run the application on the primary GPU in order to get accurate counter results.

Related Resources

KB104

Audience:

CPU

Category:

Tools

Category:

Important

Sub-Category:

x86 Open64 Compiler Suite

Last Updated:

05/25/2011

Summary

Steps to reproduce the problem:

Export the NLSPATH variable to say /bin/ export NLSPATH=/bin/

Compile the below fortran sample to get the customer stated message, PROGRAM TO TEST OMP_GET_MAX_THREAD; PRINT *, ‘HELLO WORLD ‘ END

Details and Recommendations

To resolve the “openf95 INTERNAL: Unable to open message system”, please verify the following:

The message catalog file is available cf95.cat. (Generally it should be available under ../lib/gcc-lib/x86_64-open64-linux/4.2.3/cf95.cat)

mfef95 The Fortran frontend object. (Generally located under ../lib/gcc-lib/x86_64-open64-linux/4.2.3/mfef95)

if the above are fine. Do check where the NLSPATH variable currently points to and reset by the command export NLSPATH= This should resolve the issue.

Open64 documentation reference:

NLSPATH Specify the location of compile-time and run-time error messages. (You can use %N to denote the base name of the file.) This environment variable is useful if the main function of your program is coded in C, and other parts of the program are coded in Fortran. In this case, NLSPATH tells the Fortran run time library where to find the file containing the run time error messages.

Related Resources

KB119

Audience:

CPU

Category:

Tools

Rating:

Informational

Sub-Category:

AMD SimNow™ Simulator

Last Updated:

05/25/2011

Summary

Requirements for SimNow™:

Any AMD64 system running a 64-bit version of Windows or Linux. SimNow™ will not install or run on 32-bit versions.

A minimum of 4GB of memory is required to run the BSDs that ship with SimNow, but more may be needed or required for larger, custom BSD configurations.

Details and Recommendations

Related Resources

Summary

The Open64 compiler license details.

Why is registration required for Open64 compiler download?

Tags: x86 Open64 compiler license.

Details and Recommendations

Open64 License agreement:

The x86 Open64 compiler, downloadable from developer.amd.com, contains components which are owned and licensed by AMD and third parties. The compiler is licensed under both proprietary and open source licenses—depending on the component. Further licensing information is located in the source files. Further, AMD is bound by export compliance regulations and users from specific countries are restricted from downloading the packages under the licensing terms. The user must agree with the terms of the licenses before downloading the compiler.

Why is registration required to download?

Technical support is available to x86 Open64 customers. Registration is required. AMD is committed to providing timely technical support to customers using the x86 Open64 compiler to develop or port applications to AMD platforms within an enterprise Linux® environment. To know more, please go to this link: http://developer.amd.com/cpu/open64/pages/default.aspx

Registration also helps us in effectively capturing the user inputs and feedbacks for the betterment of product.

Summary

Details and Recommendations

Visualizing the depth buffer is currently not possible on pre DX10.1 gpus. This issue has not been scheduled to be fixed.

Related Resources

KB130

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

DX10 apps may crash when using Frame Capture with CrossFire.

Tags:

Details and Recommendations

Disable Frame Capture in SettingsGeneral or disable CrossFire

Related Resources

KB129

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

PerfStudio 2.5 may crash when profiling. We see this in some applications when more than a few counters are being used.

Tags:

Details and Recommendations

If this happens try to use less counters and retry. This issue is currently being worked on.

Related Resources

KB128

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

Symptoms:

Installed graphics hardware is part of the AMD Radeon HD 6900 series.

Catalyst 10.12 is installed.

GPU PerfStudio’s Frame Profiler reports that it cannot access the performance counters because the hardware is not supported.

GPU PerfStudio’s Frame Debugger does not have GPUTime values in the slider and instead displays individual tick marks or blocks which are all of the same height.

Cause:

Catalyst 10.12 introduces support for the ATI Radeon HD 6900 series graphics cards, however the OpenGL performance counter support is incomplete and GPU PerfStudio’s Frame Profiler is not able to utilize the counters.

Tags:

Details and Recommendations

Solution: Resolved in Catalyst 11.2

Related Resources

KB121

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

Using GPU PerfStudio 2.2, some counters may report results that are two times the actual value when profiled on ATI Radeon HD 5800 series graphics cards, while the results are correct on other cards.

Tags:

Details and Recommendations

Using GPU PerfStudio 2.2, the following counters may report results that are two times the actual value when profiled on ATI Radeon HD 5800 series graphics cards, while the results are correct on other cards.

Also, the following two counters are currently reporting incorrect data, for an approximation of the correct value, they should be multiplied by 8 for ATI Radeon HD 5800 cards.

CBMemRead
CBMemWritten

Solution: These counters are correct in GPU PerfStudio 2.5

Related Resources

KB114

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

There is full support for multi-threaded DirectX11 in GPU PerfStudio 2.5. The user must use Frame Capture when pausing an application that is using deferred contexts. In addition the user can select the “Flatten CommandLists” option in the SettingsGeneral tab to convert the Frame Capture back onto a single thread.

Tags:

Details and Recommendations

Multithreaded support is currently considered complete.

Related Resources

KB57

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/31/2011

Summary

This article describes what to do if your application goes into an infinite loop during pausing.

Details and Recommendations

Pausing your Application

Introduction

PerfStudio supports two ways to pause your application. PerfStudio can capture all of the graphics API calls that your app makes using a technique called Frame Capture. Frame Capture records the API’s function calls in a frame and plays them back after the capture is taken.
Or, PerfStudio can use a time based pause on applications that measure time using QueryPerformanceCounter, timeGetTime or GetTickCount. Using the time based pause method the 3D application appears to be paused but this is technically not true, the application is rendering the same frame over and over. This method of pausing the application is achieved because 3D applications use the system timers to update the animation, so, by changing the rate of these timers we can change the speed of the animation. In particular, pausing the timers allows us to freeze the animation and do things like stepping backwards in the Frame Debugger or performing complex measurements in the Frame Profiler. This way of forcing the application to render the same frame may cause problems in some cases.
If your application uses RDTSC you should use PerfStudio’s Frame Capture mode.
PerfStudio default setting is to use time based pause. You have to manually activate Frame Capture in the SettingsGeneral tab.

Using Frame Capture

Frame Capture can be enabled by selecting the “Frame Capture on Pause” option in the General tab of the Settings window. When the application is next paused the graphics API calls will be captured and will start to play in the server. You will be able to step through the draw calls and use all of PerfStudio’s other features. If you are working with a DX11 multithreaded application using command lists you will need to select Flatten CommandLists in order to see the draw calls in the frame debugger. Frame capture is only required when analyzing games where each frame is rendered in a different way (for example with mitithreaded rendering). For games that do not do this, Frame Capture can be disabled.

Using the Time Spoofing Controls

Regardless of whether you are using Frame Capture or Time-based Pause, you can control how PerfStudio spoofs time returned by the various time functions. Note that, without any time spoofing, if a game has been paused for a minute, when unpaused, the game will think the last frame took one minute to render. If the game is a racing game the physics engine will think the car went straight for a minute and the scene after unpausing will be totally different than when the game was paused. To prevent this from happening, PerfStudio can spoof the values returned by the time functions:

Freeze: the game will think the last frame took 0 seconds. Note that some games may crash since they try to divide by this 0. Also, if a game uses a frame limiter it may get confused when using this setting.

Slow motion: the game will think that last frame took a few milliseconds.

None: no spoofing will be performed; the actual values returned by the time functions will be returned to the game.

How Time Spoofing Works

Most modern games are time-based, meaning that the position of objects within the game can be given by the following formula:

s = v * t

where ‘s’ is the position of the object, ‘v’ is the object’s velocity and ‘t’ is the absolute time. Usually, applications use QueryPerformanceCounter to get ‘t’. By spoofing the return value of this function, PerfStudio can get the application to appear to freeze, run slower, or even run faster. By returning the same value over and over for ‘t’, PerfStudio can appear to freeze the application, though, in reality, the application is simply rendering the objects in the same position for each frame.

Troubleshooting and Recommendations

My application runs very slowly and becomes unresponsive when I pause it.
Make sure you are not enforcing a frame rate limiter, pausing will make your application believe it is running infinitely fast, this will cause the frame rate limiter to introduce huge delays that will cause the slowdown. Make sure you disable the frame rate limiter when you run with PerfStudio.

My application crashes when I pause it.
When PerfStudio pauses your application, it forces the timing functions to always return the same value. Make sure you are not dividing by the difference between these timing values or you will get a division by zero exception. A quick work around is to set the ‘Time spoofing on pause’ option to ‘Slow motion’ in the Client Settings dialog. This will cause the timers to run very slowly so lapses will be different from zero. The best fix is to handle the division by zero case in the animation engine. Another alternative is to try using PerfStudio’s Frame Capture mode.

My application does not render anything when I pause it.
A few applications think that when a frame got rendered very quickly is because they are minimized and hence they stop rendering. Check that this is not the case. Setting the ‘Time spoofing on pause’ option to ‘Slow motion’ in the Client Settings dialog may help.

My application is still animated when I try to pause it.
Your application is probably frame-based. GPU PerfStudio cannot pause frame-based application; you will have to make the application time-based or have a pause mode in your engine. In the latter case, you can try setting the ‘Time spoofing on pause’ option to ‘None’. Another alternative is to try using PerfStudio’s Frame Capture mode.

Most of the objects in my application are paused but I still can see some of them moving.
The objects that are moving are probably being animated in a frame-based manner. This will cause the speed of their animation to depend on the CPU or GPU speed of the computer. The animation may run fine on your machine, but it may be different on faster or slower machines. You will have to make these animations time-based if you would like to debug them with GPU PerfStudio. Another alternative is to try using PerfStudio’s Frame Capture mode.

My application has its own pause mode to work around the frame rate limiter. Or
PerfStudio can’t connect to my application whether the Time spoofing option is set to ‘Freeze’ or ‘Slow motion’.
If your application has its own pause mode, where it will render the exact same draw calls for each frame, you can set the ‘Time spoofing on pause’ option to ‘None’ in order to avoid overriding the time values. Applications which have a frame rate limiter may crash, hang or render too slowly with Time spoofing set to ‘Freeze’ or ‘Slow motion’, however, the ‘None’ option prevents PerfStudio from interfering with the time values, so you should not have these issues.

My application is using RDTSC to query time. What can I do?
Try using PerfStudio’s Frame Capture mode. RDTSC does not work well on multi-core systems (http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx) and is not as precise as QueryPerformanceCounter. It is recommended that you switch your application to using QueryPerformanceCounter if you would like to use PerfStudio. Alternatively, if your application supports a pause mode, you can set the ‘Time spoofing on pause’ option to ‘None’.

Details and Recommendations

When closing a GPU PerfStudio 2.0 session a series of windows may appear with a list of reference leak warnings. This indicates that there are reference leaks in your application. the warnings may be useful in identifying where the reference leaks are located.

Related Resources

KB77

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

11/04/2010

Summary

This article states that the ATI Stream Profiler can’t profile 64-bit apps.

Tags: profiler, opencl, Tools, ATI Stream

Details and Recommendations

This issue is fixed in ATI Stream Profiler v2.0 and beyond.

Related Resources

KB124

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

09/27/2010

Summary

Opening the depth or stencil buffer in the picture viewer in PerfStudio shows the buffer to be very blocky, but the shapes within the buffer can be seen. Similarly, displaying the RenderTargets on the HUD (which includes depth and stencil buffers) shows the buffer to be similarly blocky, and they may flicker slightly.

Tags:

Details and Recommendations

This is a known driver issue and is specific to GL_DEPTH24_STENCIL8 formats. Other formats do not experience the same corruption, so temporarily changing the format used by your application may help you resolve the issue that you are investigating. We do not have any estimates on when the issue will be resolved by the driver.

Related Resources

KB81

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

05/27/2010

Summary

This article states that the ATI Stream Profiler generates incorrect results if the memory state are different when dispatching a kernel multiple times versus a single time.

Tags: profiler, Tools, OpenCL, ati stream

Details and Recommendations

This issues has been fixed in ATI Stream Profiler v1.2.

Related Resources

Summary

One of SimNow™’s goals is the ability to run unmodified, production software. Provided that the necessary device model components are available, users can run unmodified System BIOS images, Video BIOS ROMs, expansion ROMs, Operating Systems, drivers, applications, etc. with the SimNow™ environment. This powerful, flexible simulation environment is one of the major differentiators between SimNow™ and other modeling software.

Tags: SimNow

Details and Recommendations

Related Resources

Summary

Monitor module is a dynamically loaded library that is configurable by the user to gain knowledge of events and states of the system during a simulation. This knowledge and information can then be fed into a models such as performance timing, network utilization, disk utilization, power utilization information, or other data models useful to the user.

Details and Recommendations

Related Resources

Summary

User-written analyzers are a key feature of AMD’s SimNow™ simulator software. Analyzers allow the user to insert arbitrary pieces of C code into various points of execution of the processor model.

Tags: SimNow

Details and Recommendations

There are many reasons to write analyzers; for example, you could write an analyzer to:

Create workload traces

Gather information to help you debug software

Modify architectural semantics

Collect information to feed to architectural studies

Implement tiny devices

We provide a sample analyzer that implements CPU logging in our public release. You may base your analyzer off this sample. The sample code is included in the ./devel/analyzers /cpulogger directory of your SimNow™ installation For more information on usage of the analyzer interface please read the ‘Analyzer Developer’s Guide’ located in the devel/analyzers/ directory.

Related Resources

Summary

Beginning with SimNow™ v4.5.2 there is little that limits the public release user from accomplishing what can be done with the NDA release. While most users will find that the public version should meet their requirements, the major differences between the public and NDA versions are highlighted below:

The NDA versions ships with more sample BSDs, containing more complex MP and IO configurations. However, the public version has no limit on how the user can customize their BSD configurations.

The NDA version contains models and platforms supporting AMD’s next generations technologies and latest R&D efforts.

The NDA version ships with more chipset and IO device support.

The NDA version contains the SimNow™ device SDK so that users can create their own devices to plug into the SimNow™ environment.

Tags: SimNow

Details and Recommendations

Related Resources

KB113

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

Some applications (especially benchmarks) have a separate monitor or start-up process that keeps tabs on the main application. In some cases a full profile (which can take a long time) can cause the monitor process to stop working. When this happens the application and monitor process may stop responding and the client will eventually time out. The number of draw calls in the main application affects how long the profile will take.

Tags:

Details and Recommendations

Use custom a profile that contains a subset of counters. Do multiple profiles using less counters.

Related Resources

KB112

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

The 10.2 driver contains a bug that causes the Shader Debugger to get stuck in some loops when debugging pixel shader code. When this happens the debug session will not run to the end and will gradually use up all available memory.

Tags:

Details and Recommendations

This will be fixed in a later driver release (scheduled for 10.4).

Related Resources

KB111

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

When trying to profile an OpenGL application on ATI Radeon 5800 Series hardware, the application will crash. Timing passes with the Frame Profiler do work correctly though.

Tags:

Details and Recommendations

This issue has been identified as being specific to
Catalyst 10.1; newer Catalyst releases resolve the issue, as does
Catalyst 9.12. Other cards in the ATI Radeon HD 5000 Series do not
suffer from the same issue, and can be used to perform a profile.

Related Resources

KB110

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

After updating to Catalyst 10.2 there are some
OpenGL entrypoints that do not get recognized by GPU PerfStudio. Among
the missed functiosn are glDrawArrays and glDrawElements which are
easily noticed as the draw calls are not included in the Frame Debugger
Draw Call Selector.
The Frame Profiler and API Trace do not
include the entrypoints either.

Tags:

Details and Recommendations

A future Catalyst release will correct the issue. In the short term, Catalyst 9.12 or Catalyst 10.1 should be installed in order to correctly use GPU PerfStudio with OpenGL applications.

Related Resources

KB109

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

ALUBusy, ALUInstCount, ALUEfficiency, or ALUTexRatio may report a 0 result in DirectX10 and DirectX11 applications when profiled on ATI Radeon HD 4000 series hardware. This has been identified as a common problem on ATI Radeon HD 4670 graphics cards. OpenGL applications can be profiled correctly.

Tags:

Details and Recommendations

Currently no work around has been identified, but
we are investigating the issue. If your application can run in OpenGL
mode, you may want to profile using OpenGL instead. Alternatively, we
recommend upgrading to an ATI Radeon HD 5000 series graphics card.

KB108

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

03/24/2010

Summary

Applications which use FMOD as a sound library may
not work correctly with GPU PerfStudio due to conflicting Windows Socket
implementations. The following error message will appear in the server
log: “InitCommunication – WSAStartup network subsystem not ready for
communication”

Tags:

Details and Recommendations

At this time, the only known way to work around the issue is to disable FMOD in the application.

Related Resources

KB107

Audience:

GPU

Category:

Tools

Rating:

Informational

Sub-Category:

AMD APP Profiler

Last Updated:

03/05/2010

Summary

Details and Recommendations

The ATI Stream Profiler requires at least Visual Studio 2008 Standard. It supports Visual Studio 2008 Standard, Professional and Team System Edition.Visual Studio 2008 Express Edition is not supported due to the minimum requirement imposed by Microsoft to support Visual Studio plugins.

Related Resources

KB76

Audience:

GPU

Category:

Tools

Rating:

Critical

Sub-Category:

AMD APP Profiler

Last Updated:

02/11/2010

Summary

This article describes what to do if your application crashes when ran with the profiler using a GPU device.

Tags: profiler, OpenCL

Details and Recommendations

Install the latest ATI Catalyst driver. The profiler requires ATI Catalyst 9.12 (December 2009) or later. Ensure that these three files (aticalrt.dll, aticalcl.dll, aticaldd.dll) in your C:WindowsSysWOW64 (for 64-bit system) or C:WindowsSystem32 (for 32-bit system) have timestamps of December 1st 2009 10:33PM or later. If you have older files, uninstall the Catalyst driver, remove these three files manually (if they are not removed automatically by the uninstallation), and reinstall the latest ATI Catalyst driver.

Related Resources

KB82

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

02/11/2010

Summary

This article states that Visual Studio might crash when you immediately sort on the first column (Method) in the OpenCL Session Panel after profiling is completed.

Tags: profiler, ati stream, OpenCL, Tools

Details and Recommendations

This is a known issue. The workaround is: Do not sort on the first column (Method) immediately after profiling is completed. You can sort on the other columns first.This issue has been fixed in ATI Stream Profiler v1.1.

Related Resources

KB80

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

02/11/2010

Summary

This article states that the ATI Stream Profiler can’t generate the ISA and IL codes when the the Working Directory field (under Debugging) is set to a path other than the project’s path in the project settings.

Tags: profiler, ati stream, Tools, OpenCL

Details and Recommendations

This issue will be addressed in an upcoming release. Current workaround: Do not set your Working Directory field.
This issue is fixed in ATI Stream Profiler v1.1.

Related Resources

KB79

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

02/11/2010

Summary

This article states if your application performs many clCreateBuffer operations, you might get an assertion error with out of video memory message when using the ATI Stream Profiler.

Tags: profiler, ati stream, OpenCL, Tools

Details and Recommendations

This issue will be addressed in an upcoming release.This issue is fixed in ATI Stream Profiler v1.1.

Related Resources

KB106

Audience:

GPU

Category:

Tools

Rating:

Critical

Sub-Category:

AMD APP Profiler

Last Updated:

02/10/2010

Summary

After installing a new version of ATI Stream Profiler, Microsoft Visual Studio 2008 shows a package loading error upon starting.

Tags: opencl, profiler, tools, gpu performance, stream profiler

Details and Recommendations

After installing a new version of ATI Stream Profiler, Microsoft Visual Studio 2008 shows the following error message:

The AdvancedMicroDevices.CLProfiler.CLProfilerPackage, CLProfiler, Version=1.1, Culture=neutral, PublicKeyToken=null ({8F386C9B-57E6-492F-8738-D7179F53B15C}) did not load because of previous errors. For assistance, contact the package vendor. To attempt to load this package again, type ‘devenv /resetskippkgs’ at the command prompt.

Solution:

Uninstall and reinstall ATI Stream Profiler.

If the problem persists, type the following command in a command prompt window (with Run As Administrator option):

Related Resources

Summary

AMD CodeAnalyst for Linux® provides a set of command-line tools based on the OProfile utilities.

Tags: Command-Line Utility, opcontrol, opreport, opannotate, OProfile

Details and Recommendations

Users can collect and display profile results using OProfile command-line tools. These are installed in the “<CA install directory>/bin/” directory. They are modified to add support for the latest processor features that have not yet been included in the upstream OProfile distribution. Examples of these tools are:

opcontrol: configure, start, and stop profiling

opreport: view profiling results at the system/module level

opannotate: view profiling result at the source/assembly level

These tools should be compatible with standard OProfile-0.9.3. To identify CA-modified OProfile tools, please use the “–version” option.

KB101

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

01/14/2010

Summary

A high resolution application can prevent PerfStudio from allocating enough memory to view the Shader Debugger’s Mask and Register images. This is more likely to occur on 32bit applications.

Tags:

Details and Recommendations

Reduce the resolution of your application. This will reduce the amount of memory required to generate the Mask and Register images.

Related Resources

Summary

To detect that your software is running on SimNow against real hardware there are a number of things you can do. But you probably need to thinkg of a solution for any generic full system simulator.
Normally a full system simulator doesnt model timing as accurate as real hardware. So you can do some microbenchmarking to detect unrealistic timing. For example, the software could detect that the IPC was constant by looking at the TSC and instruction count, which would not happen on real hardware.
Another thing is SimNow doesnt model performance counters so they do not increment when they would on real hardware.

Tags:

Details and Recommendations

Related Resources

KB90

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

01/04/2010

Summary

On a 64bit OS some 32bit applications are mistakenly installed in the Program Files directory instead of the Program Files (x86) directory. Shortcuts to 32bit apps in the Program Files directory will have (x86) appended to the path and executable files will not be found by PerfStudio.

Tags: gpu perfstudio

Details and Recommendations

There are 2 workarounds at the moment for this issue.

Launch the shortcut using the right-click “Open With” method – then attach from the client.

Don’t use a shortcut – launch the app directly from the client

Related Resources

KB89

Audience:

GPU

Category:

Tools

Rating:

Informational

Sub-Category:

GPU PerfStudio

Last Updated:

01/04/2010

Summary

If the client times out the play button can sometimes appear to be active even though the client is no longer connected.

Tags: GPU PerfStudio

Details and Recommendations

Close the client and reconnect. If the timeout continues consider increasing the message timeout time or restarting the server and application.

Related Resources

KB78

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

AMD APP Profiler

Last Updated:

12/22/2009

Summary

This article states that ATI Stream Profiler does not support user defined macros in the Output File field (under Linker) and Working Directory field (under Debugging) in the project settings.

Tags: profiler, ati stream, Tools, OpenCL

Details and Recommendations

Do not use user defined macros in the Output File field and the Working Directory field.

Summary

Details and Recommendations

This will be addressed in an upcoming release.

Related Resources

KB70

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

11/03/2009

Summary

GPU PerfStudio uses the data returned from ID3D10ShaderReflection ID3D11ShaderReflection to determine which textures are in active use for a given shader stage & draw call. Using D3DStripShader to strip reflection data from your shaders restricts GPSs ability to do this. It is recommended that you do not strip reflection data whilst using GPU PerfStudio to get the best possible debugging & profiling experience.

Tags: GPU Programming, GPU PerfStudio

Details and Recommendations

Do not use D3DStripShader if you intend to use the shader debugger component of GPUPerfStudio2.

KB69

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

08/07/2009

Summary

This article describes why the help system has no content and how to fix the issue.

Tags: performance tuning, gpu

Details and Recommendations

If you expand the GPU PerfStudio 2.0 zip file using the built in Windows zip extractor the help files are identified as a security threat and the content is blocked.

There are two solutions:

Use the WinZip executable to expand the zip file.

If you have already expanded the files using the built in Windows zip extractor then select the files in the HTML directory, right click to bring up the properties. In the General tab there will be an “Unblock” button in the bottom right hand corner. Click this to unblock the content.

Details and Recommendations

If this happens it is possible to increase the timeout time in Windows->Settings->Frame Debugger.
The shader debugger may timeout while running over shaders with long loops; after this the client appears to stops responding. The shader debugger is still actively working within your application, thus the application cannot respond to the client. Wait until the shader debugger is done running, then start the shader debugger again to get back into debugging mode. You will have to start and stop the shader debugger to put your application in the correct state and to use other features of GPU PerfStudio.

Related Resources

KB62

Audience:

GPU

Category:

Tools

Category:

Important

Sub-Category:

GPU PerfStudio

Last Updated:

07/29/2009

Summary

This article describes a limitation in DX10 when copying the depth buffer and how this affects the Shader Debugger image.

Details and Recommendations

A limitation on DX10.0 graphics cards (the inability to copy the depth buffer) restricts the ShaderDebugger’s ability to handle depth-occlusion on these cards. Draw calls made while debugging shaders will be self-occluded but not occluded against other draw calls within the frame.

Details and Recommendations

On Win7/Vista, you must turn User Account Control (UAC) off (Start->Control Panel->User Accounts).
The reason for this is that in order to Profile ATI hardware we need to set a registry key to enable the counters.

Summary

If you are not sure about the exact CPU your application users will have, or if you need your application to be optimized for both AMD and Intel platforms, then you will need to use the ‘common target‘ flag.

Details and Recommendations

Use ‘-march=anyx86’ to produce optimized code for the most common x86-32/x86-64 processors. Use this option if you do not know the exact CPU that users of your application will have.
As new processors are deployed in the marketplace, the behavior of this option will change. Therefore, if you upgrade to a newer version of x86 Open64, the code generated option will change to reflect the processors that were most common when that version of Open64 was released.

Related Resources

Summary

FDO improves the performance of programs by applying optimizations which use the information gathered during the profile run. This allows the compiler to be more precise when applying optimizations.
FDO and PGO are different names for the similar approach of gathering profile information before the final compilation.

Related Resources

Summary

Hugepages can enable higher performance for applications but to use them you need specially compiled binaries and hugepages allocated on a system cannot be shared with other non-hugepage applications.

Tags: Hugepages

Details and Recommendations

Pros & Cons: In the x86 architecture, the default VM page size is 4KB. Hugepages are VM pages which are bigger than the default size of 4KB, usually 2MB. Hugepages can significantly reduce the L1 and L2 TLB misses for applications. This is because while using hugepages you need to access the L1 and L2 TLB (Translation Lookaside Buffer) caches less frequently since each entry in the TLB’s is now capable of addressing a much larger range of memory of 2MB as compared to 4KB of ‘smallpages’. This can lead to significant performance improvements.
On the other hand, hugepages lock the memory away, which can’t be used by non-hugepage applications. So if your application has ‘smallpage’ requirements but if most of the system memory is allocated as hugepages, then the ‘smallpage’ application might start thrashing pages to the hard drive, significantly slowing down the application. So you need to carefully balance the use of hugepages with the rest of the non-hugepage applications on the system.Guidelines: Hugepages will benefit applications which have a huge working set size (hundreds of MB’s or many GB’s and above). Since this would require a lot of virtual to physical address translations, it will incur a lot of L1 and L2 TLB misses. By using hugepages, the number of translations required is brought down significantly and hence benefit application performance by removing the wait time to fill up the TLB caches with translation data from DRAM for each TLB miss.
Though it’s always best to benchmark your application with hugepages to see if the application benefits. Remember though, that in the worst case of memory access which spans more than 2MB you will still incur a TLB miss penalty irrespective.

You can ask the compiler to enable hugepages by using the following flag: -HP:bdt=2m:heap=2m. The options specify that the bdt (bss, data and text) sections and the heap would use hugepages. You can also limit the number of hugepages the application can use by specifying a limit in the above flag as follows: -HP:bdt=2m:heap=2m,limit=400”. This will limit the application to use only 400 hugepages. Without this limit specified it would use as many as it needs. A supporting library libhugetlbfs.so is also needed and is provided with the open64 compiler distribution.

How do you allocate hugepages?

Runtime allocation. Execute echo 7168 > /proc/sys/vm/nr_hugepages, where 7168 is the number of hugepages needed and should be decided by you depending on your needs.

Boot time allocation. Add vm.nr_hugepages=7168 to the file /etc/sysctl.conf. This is a permanent method in that it will be there even after a reboot.

To check the actual number of hugepages allocated execute cat /proc/meminfo and look for HugePages_Total field.

The above requires root privileges. To make hugepage allocation user settable you can add vm.disable_cap_mlock = 1 to /etc/sysctl.conf or execute echo "1" > /proc/sys/vm/disable_cap_mlock. This will allow unprivileged users to lock hugepages. This can also be achieved by changing the user limits in /etc/security/limits.conf to enable users to lock a large amount of memory per process. For example, adding these lines:

hard memlock 2097152
soft memlock 2097152

will allow users to lock up to 2GB of physical memory per process. This can be confirmed by executing the command ulimit -l to see what value is the current limit.

In addition to allocating the hugepages, a mount point must also be created since hugepages are allocated via an in-memory filesystem. One way to ensure the mount is always available is to add the following lines to the /etc/rc.d/boot.local script:

Summary

-O3 is an exclusive compilation level and cannot be realized fully using other options. -O3 turns on all optimizations specified by -O2, but takes more aggressive approach and additionally turns on LNO (Loop Nest Optimization). But LNO options comes into effect only at -O3 and above options. Hence -O3 is exclusive when compared to lower optimizations like -O1 and -O2

Tags: Optimization, Compilers

Details and Recommendations

LNO (Loop Nest Optimization) is a superior optimization but gets enabled only with -O3 and above option. Given this you will not be able to directly provide flags that can provide all the benefits from -O3.

Apart from LNO, the rest of the benefits from -O3 can be achieved by adding the following options to the default compilation without specifying optimization level (the default optimization level is -O2)

Follow the instructions from the Help section – ‘Setting up Linux’ – of the simulator.
To get enough “mmap“able virtual address space, make sure you have a /etc/sysctl.conf file and it contains the following line: vm.max_map_count = 1048576

Run ./simnow executable from the simnow home directory (simnow_dir).

Open the cheetah_1p.bsd using File->’Open BSD’.This BSD essentially contains the following:

AMD-8132 PCI-X controller

AMD-8111 I/O hub

Winbond W83627HF SuperIO

Copy the coreboot.rom image to simnow_dir/Images.

Open the SimNow Device Window (View->Show Devices).

Double click on Memory Device (ROM device).

Click on the Memory Configuration Tab and check the Base Address and Size for the image (Base Address = fff00000 and Size = 32 for 1 MB image).

Click OK to save your changes.

Go to the main window and hit ‘Run Simulation‘ (Play button) to start the simulation.

Communication via Serial Port Configuration

To set up the console output go to the terminal you are running the simulator from and hit ENTER. You should see a ‘simnow>’ prompt.

Execute the following commands on the ‘simnow>’ prompt:

serial.SetCommPort pipe
serial.GetCommPort

this returns path (/home/<username>/.simnow/com1) with mode(PIPE)

Now to get the output from coreboot, open a new terminal and typecat /home/<username>/.simnow/com1/simnow_out

To send input to the serial port, echo to /home/<username>/.simnow/com1/simnow_in

Details and Recommendations

Provide support for new AMD processors when support has not yet been added in the upstream kernel, or to back-port new processor support on existing Linux® distributions.

Provide new features (e.g. Instruction-Based Sampling or event-multiplexing) that have not yet been accepted upstream.

We do not provide CAKM for all versions of the Linux® kernels; only those which are original to each respective Linux® distribution. If users do not wish to use the added-on features, the stock OProfile drivers should allow you to run normal profiling.

The “mod_install.sh” script should determine the suitable version of CAKM to build and install on a system. In case it fails or no CAKM is available for the distribution, you might try to manually build and install using a CAKM version that has the same kernel base. For instance, SLES11, Fedora10, openSUSE11.1, and Ubuntu-8.10 come with a different version of kernel. However, they all are based on the Linux® 2.6.27 kernel. Therefore, you should be able to build and install CAKM in the “src/cakm/kernel2.6.27/” directory. Please see the README file in that directory for instruction on how to manually build and install CAKM.

Summary

Some features are not yet supported by the upstream OProfile daemon or driver from standard Linux® distributions. The CodeAnalyst team is working to push these features upstream.

Tags: IBS, Event Multiplexing, CAKM, OProfile

Details and Recommendations

CodeAnalyst provides a modified version of OProfile-0.9.3 and a set of kernel drivers for supported Linux® distributions.

The CodeAnalyst version of OProfile is installed by default in the CodeAnalyst installation directory. Users can check by running the OProfile utilities (e.g. opcontrol, oprofiled, opreport, and opannotate) with the “–version” option.

Perform these steps to check for the CodeAnalyst version of the Oprofile kernel driver:

The “/dev/oprofile/ibs-fetch” and “/dev/oprofile/ibs-op” directory should be present for IBS supports.

The “/dev/oprofile/time_slice” file should be present for event multiplexing support.

Related Resources

Summary

Discusses differences between using Oprofile-0.9.4 and CodeAnalyst for Java Profiling.

Tags: Java, JVMTI, Jitted

Details and Recommendations

CodeAnalyst enables developers to profile Java applications. Profile results are limited to native code generated by the Java JVM (just-in-time, compiled code). AMD CodeAnalyst provides a Java profiling agent and a modified version of OProfile (version 0.9.3) in order to perform java profiling.

Starting from version 0.9.4, OProfile provides its own Java profiling capability using a different internal mechanism, and therefore, a different Java profiling agent. These agents are not compatible. One major difference is that CodeAnalyst aggregates samples for JIT compiled modules during runtime instead of post-runtime as in OProfile’s implementation.

Starting from CodeAnalyst-2.8.29, CodeAnalyst can be configured to build with OProfile-0.9.4 (with the exclusion of IBS and event-multiplexing support), allowing users to use OProfile’s implementation of the Java profiling agent.

Please see the INSTALLATION document for more information on how to configure CodeAnalyst with external OProfile.

Related Resources

Summary

CodeAnalyst displays a runtime error message when the dynamic loader cannot locate the shared libraries that are installed with CodeAnalyst. The message is usually displayed when starting starting CodeAnalyst.

Tags: Runtime Errors, Shared Libraries

Details and Recommendations

If you get a runtime error message that identifies:

libCA.so

libCAbba.so

libopdata.so

lib_tbp_output.so

then these libraries are not in the search path for dynamically loading shared objects. Please add the installed CodeAnalyst lib directory, such as

/opt/CodeAnalyst/lib/

or

/opt/CodeAnalyst/lib64

to “/etc/ld.so.conf”. This may require running “ldconfig” after updating the path. Or, specify the path using environment variable LD_LIBRARY_PATH when starting CodeAnalyst.

Related Resources

Summary

A performance loss has been seen in applications where the total app thread count is less than the core count on Windows Vista OS’s. An unfavorable interaction between Vista’s threading policy and AMD’s Family 0x10 processors ability to throttle clock speeds on a core-by-core basis results in degraded performance. For instance, single-threaded apps on Vista tend to rotate around all available cores before an individual core fully powers up from a throttled state.

Tags: Vista, Power Options, Performance

Details and Recommendations

Linux and XP builds of Windows have not shown this behavior, and fully-threaded apps where the number of threads is equal or greater than the number of available cores will keep all the cores busy at their unthrottled state.

Several solutions exist for users who may be seeing this behavior. Users may isolate the affinity of their threads to individual cores to avoid the wandering of their threads. This is the most power efficient option as unused cores will stay throttled; this can be achieved either programmatically or through tools such as Process Explorer. An easier solution is to go into the ‘Power Options’ pane of the Vista control panel, and change the machine from ‘Balanced’ to ‘High Performance’ mode; this stops Vista from throttling the cores. The user should note that if the machine is set to ‘High Performance’ mode, the power consumption of the machine most likely will go up.

Summary

A performance loss has been seen in applications where the total app thread count is less than the core count on Windows Vista® OS’s. An unfavorable interaction between the Windows Vista threading policy and AMD’s Family 0x10 processors’ ability to throttle clock speeds on a core-by-core basis results in degraded performance. For instance, single-threaded apps on Windows Vista tend to rotate around all available cores before an individual core fully powers up from a throttled state.

Tags: Vista, Power Options, performance

Details and Recommendations

Linux® and Windows®XP builds have not shown this behavior, and fully-threaded apps where the number of threads is equal to or greater than the number of available cores will keep all the cores busy at their unthrottled state.

Several solutions exist for users who may be seeing this behavior. Users may isolate the affinity of their threads to individual cores to avoid the wandering of their threads. This is the most power efficient option as unused cores will stay throttled; this can be achieved either programmatically or through tools such as Process Explorer. An easier solution is to go into the ‘Power Options’ pane of the Windows Vista control panel, and change the machine from ‘Balanced’ to ‘High Performance’ mode; this stops Windows Vista from throttling the cores. The user should note that if the machine is set to ‘High Performance’ mode, the power consumption of the machine most likely will go up.