In the very essence, without the use of the octrees, presumably I could do everything via two 3D textures.
Here are the steps I am taking:

1) Voxelize objects, accumulate them in 256x256x256 texture (store their color and opacity)

2) Using shadow mapping, inject lighting in the voxels seen by light (As I understand, this requires another 256x256x256 texture, which stores direction to the light and the incoming intensity via RGBA8)

3) Mipmap this light-texture (up to 4 mipmaps maybe)

4) For each geometry fragment visible to the viewer, at the end of deferred pipeline, determine its corresponding voxel.
Then perform cone sampling (from this determined, original voxel) through all the nearby "light texture" mips and the corresponding "color texture" voxels, reflecting "colored" light towards our original voxel.

5) Give this "accumulated" light result to our original fragment.

Did I miss something, this looks like Single-Bounce, should I use Final Gather for 2nd bounce?