Profiling Tensorflow workloads with Intel VTune Amplifier

Machine learning applications are very compute intensive by their nature. That is why optimization for performance is quite important for them. One of the most popular libraries, Tensorflow*, already has an embedded timeline feature that helps understand which parts of the computational graph are causing bottlenecks but it lacks some advanced features like an architectural analysis. In this short tutorial, we will show how to combine the data provided by Tensorflow.timeline with options available in one of the most powerful performance profilers for Intel Architecture – Intel® VTune™ Amplifier.

advertisements

Tensorflow.timeline generates the data in Trace Event Format that cannot be consumed by VTune Amplifier directly but can be converted to .csv format it supports. We will do this conversion at the end of the collection with the help of a special custom collector script listed below:

When configuring a VTune Amplifier project, go to the Analysis Target window and specify the path to the collect.sh script and a path to the .json files generated by Tensorflow.timeline in the Custom collectorfield as follows:

$ <path_to_collect.sh>/collect.sh <path_to_dir_with_json_files>

For example:

The script accepts one parameter:a path to the .json files generated by Tensorflow.timeline, which should be specified for the custom collector script. The script will automatically pick up the .json files from that directory at the end of collection, convert them to the .csv format, put the converted files to the result directory next to other traces collected by VTune Amplifier. When collection is done, VTune Amplifier automatically loads all the data and shows everything on the same timeline, correlated:

and aggregated:

The example above uses the Source Function / Function / Call Stack grouping instead of the default Function / Call Stack since Tensorflow was built with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) support which does JITting. As a result, Intel MKL-DNN in some cases generates multiple instances of the same function. With the default Function / Call Stack grouping, the VTune Amplifier would show these instances as different functions, which could lead to an incorrect interpretation of the result where each instance is not hot by itself but the accumulation of all of them would be the hotspot.

The described technique allows to apply a full power of analyses available in the VTune Amplifier to Tensorflow-based applications. For instance, finding operations caused by the hotspots is just a matter of applying a proper Source Function / Function Domain grouping. This grouping can be configured manually as a custom grouping: