Class AtomicAllocator

Just-in-Time Allocator for CUDA
This method is a basement for pre-allocated memory management for cuda.
Basically that's sophisticated garbage collector for both zero-copy memory, and multiple device memory.
There's multiple possible data movement directions, but general path is:
host memory (issued on JVM side) ->
zero-copy pinned memory (which is allocated for everything out there) ->
device memory (where data gets moved from zero-copy, if used actively enough)
And the backward movement, if memory isn't used anymore (like if originating INDArray was trashed by JVM GC), or it's not popular enough to hold in device memory
Mechanism is as lock-free, as possible. This achieved using three-state memory state signalling: Tick/Tack/Toe.
Tick: memory chunk (or its part) is accessed on on device
Tack: memory chink (or its part) device access session was finished
Toe: memory chunk is locked for some reason. Possible reasons:
Memory synchronization is ongoing, host->gpu or gpu->host
Memory relocation is ongoing, zero->gpu, or gpu->zero, or gpu->host
Memory removal is ongoing.
So, basically memory being used for internal calculations, not interfered with manual changes (aka putRow etc), are always available without locks