I have recently got access to Intel Nervana Dev Cloud as a part of student ambassador program. I tried running my TensorFlow codes for Neural Captioning System but i am not getting a speedup in training compared to my local computer. I am doubtful whether TensorFlow is utilising underlying MKL Library or not. A code which took 12 mins when previously ran on google cloud with 8 core virtual CPU Intel Ivy Bridge, took 5 hours on cluster. I am probably making a mistake but not able to figure out. Please help

I checked the version, its 1.2.1 and not 1.4. Do i need to explicitly update this?

Further i also get following warning when i run my code

[u6717@c009 ImageCaptioningModel]$ qsub run.sh

26499.c009

[u6717@c009 ImageCaptioningModel]$ qpeek -e -f 26499

2017-11-19 10:04:40.012650: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

2017-11-19 10:04:40.012691: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

2017-11-19 10:04:40.012696: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

2017-11-19 10:04:40.012700: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

2017-11-19 10:04:40.012703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.

2017-11-19 10:04:40.012706: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

What i believe is its not able to use AVX512F instruction set to optimise performance on Intel Xeon Scalable processors.

One more thing i found while trying stuffs was that there exist no path /glob/development-tools/gcc/bin/lib64 but it is /glob/development-tools/gcc/lib64. I am not exactly sure if this is relevant. Just wanted to inform you

Some Stats: Its 1.19 iterations / second on my local machine(2nd Gen i3/ 4GB RAM) and 1.39 iterations / second on cluster. It takes 1 minute 49 seconds on my local machine and 1 minute 32 seconds on cluster.