JtR had that optimization between May 2013 and July 2014 but it was accidentally removed by JimF when he did other improvements. And it was also never implemented for GPU. That is fixed now.

I believe you still miss an opportunity of early-reject that makes for even more boost. JtR always does a single chunk of PBKDF2 regardless of key size. So compared to the naive implementation we can actually boost AES256 key by 300%, AES192 by 200% and we do get a boost for AES128 too, of 100%.