This code still forms a computational bottleneck. Maybe making specialized code that checks for sparsity might help. I don't know. Perhaps I even have to use sparse matrices. Don't know yet. Only getting more complicated which might introduce bad bugs. Already encountered a bug few days ago.

One epoch costs me already about twelve minutes and at least 200 epochs are needed etc.
And then you find out you need more filters/layers making it even more slow and then it starts all over again.

I don't disagree on the need to understand the inner workings but you will have a hard time
beating vendor supplied optimized libraries such as Intel MKL, CuDNN, TensorRT etc...
Lczero already tried the former approach first and eventualy switched to cuDNN and MKL blas etc.
I am sure GCP put a lot of effort into coding winograd etc but these AI libraries are used by a lot of industry
so nvida/intel has a lot to gain from offering highly optimized libraries.

FWIW, Leela Zero and lc0 in OpenCL mode still use my code (though Henrik Forsten co-wrote large parts of the current implementation and he should get credit). When we benchmarked it against cuDNN in Leela Zero it was faster. It seems we need much more aggressive batching for cuDNN to outperfom it (for chess things are very different). This may have changed with RTX cards and tensor cores, which is why I was asking about this in the other thread. People are working on more aggressive batching for Leela Zero as well, but that should remind you that these days you cannot separate the DCNN implementation from the search specifics and tuning.

The main reason to write my own implementation was in any case to not have to depend on the whims of the vendors' licensing, and to not have split versions for both card vendors.

Note that lc0's cuDNN backend was written by an NVIDIA driver engineer, and he also dealt with getting the redistribution permission. I'm sure the implementation is state of the art