N-Body with GLSL and VBO's (how to approach)?

I would need to utilize a calculations similar to N-Body simulation for large group of vertexes. I have read everything I can find for VBO's , PBO's etc. but my limited understanding lacks the process. The papers are confusing when talking about glReadPixels(), buffers and memory since they don't specify clearly is the buffer/memory mentioned in the graphics card or in the CPU-side.
I understand that with 1M particles it's impossible to transfer data via bus anymore so all the magic has to happen inside GPU ?

What would be the correct way to setup such rendering- pipeline?
Could someone write a short list of events in CPU's opengl-side and GPU's shader-program side?

I have setup VBO already and have some very basic understanding of GLSL vertex & fragment-shaders. With some more study I think I could implement the rendering but need some pointers first.

What do I need in OpenGL-side? (textures to hold velocity, force-vector data etc. which are uploaded to graphics card) VBO? FBO? PBO?

Are the force/velocity/etc calculations done in vertex- or fragment-shaders? How are the results looped back to modify actual vertex-positions? Does that happen also inside GPU without bus & cpu?

I have especially hard time to grasp when things happens inside GPU and when the is bus involved to transfer data back for CPU. I tried briefly to look apples PBOexample (heightfield) but couldn't understand it fully. Is the bus involved in that example? If so then what is transferred back and forth?

OpenGL does not guarantee that a buffer will reside in RAM or VRAM, or that a function will be executed by the CPU or GPU.

In general, it is always possible for you to create something too complex for the graphics card to execute. So it is purposely ambiguous; if you allocate a buffer and render particles into it, you can hint where you want the buffer allocated, but the implementation may still need to run on the CPU, depending on the capabilities of the particular graphics card you have, and the complexity of your state.

Thanks. I am aware of that and many similar papers and have read them (even multiple times )
I understand the theory behind but lack the implementation level details. i.e. what particular method should I choose with current hardware. It appears that OpenGL 2.0 with glsl 1.2 would be required level. (according to capability chart provided by some helpful dude )

Is this correct way in general level:
1) create FBO and textures
2) attach textures (color-buffers) to FBO
3) bind FBO to framebuffer
4) draw for example single quad size of FBO to trigger fragment-shader for each texel.

this is what I don't know how to do...5) fragment shader reads position (x,y,z float32) and velocities (x,y,z float32) from texture? forces from uniforms and then calculate new positions and velocities and writes them to another texture(s) ? There are some limitations involved and I can't write to same texture I am reading?

6) read back new updated values with glReadPixels() and just bind the resulting buffer to VBO without any memcpy
7) render VBO to visible framebuffer
8) repeat

I have looked some examples but since everything is new to me it's little difficult to pick the idea from 500 lines of codes in those samples.

OK, so let me explain how PBORenderToVertexArray is working. Then discuss how you'd need to modify it for particles.

It is similar to the steps you listed, except that in that demo the state is all maintained in simple uniforms, like "ripple_xlate". It doesn't need to track particle position/velocity from the previous frame. So:

1) create FBO and textures (one float texture which will have RGBA colors rendered into it, to be used as XYZW positions. RECTANGLE target is used only for compatibility with the GeForce 5200.)
2) attach textures to FBO (depth buffer is only needed for the spinning rings.)
3) bind FBO to framebuffer.
4) draw single quad, fragment shader executes once per pixel in the quad (here, it calculates two twirled ripples, based on the animation uniforms.)
5) each resulting RGBA pixel of the ripple is written to the texture bound to the current FBO (using orthographic projection). At this point, I have a W x H x 4float texture containing my 2D array of positions. This array very likely is in VRAM (on Mac OS X, it will be resident in VRAM unless some other process has just evicted it, i.e. you're thrashing.)
6) bind PBO to previously allocated buffer of size W x H x 4floats. This buffer was hinted to live in VRAM, but the decision is up to OpenGL. Call ReadPixels to copy from the FBO into the buffer. If the buffer was in VRAM, this is a fast VRAM->VRAM copy (on Mac OS X, it will be in VRAM if the driver supports it. One example where it is not supported is the GMA 950, which does not have hardware TCL. There, the VBO data must always reside in RAM for the CPU to do vertex transform, so that will be a VRAM->RAM readback.)
7) bind buffer as VBO, render as points, etc to visible framebuffer.
8) update uniforms and repeat.

For particle simulations, you need to track some per-particle state. Let's say just position and velocity. You can not read from the texture you are currently rendering to (due to hardware-dependent synchronization unpredictability) so, you need to create two textures per state attribute that you want to track. Lets say PosA, PosB and VelA, VelB. These are each W x H x 4float. Ping-pong the data between A and B every other frame like:

setup) A contains initial state, B undefined
frame 1, 3, 5...) attach B to FBO to capture result. Bind A to samplers, read from them in shader.
frame 2, 4, 6...) attach A to FBO to capture result. Bind B to samplers, read from them in shader.

Here, if you have 2 attributes, you can run two passes, reading from PosA and VelA, and writing to PosB in the first pass and VelB in the second pass. Or, if the renderer supports ARB_draw_buffers, you can attach both Pos and Vel to the FBO (as COLOR_ATTACHMENT_0 and COLOR_ATTACHMENT_1) and render to them in a single pass by writing to gl_FragData[0] and gl_FragData[1] instead of gl_FragColor.

You don't necessarily need OpenGL 2.0 or GLSL 1.20. In fact, if you care about running on less capable hardware, you can replace FBO with pbuffers or CopyTexImage, and float textures with RGBA8 (with obvious precision/range limitations.) You can replace ARB_draw_buffers with multiple passes.

On the other hand, if you only care about running on new hardware, like the GeForce 8x00, then you can use vertex texturing to eliminate the FBO->VBO copy:

6) bind FBO texture to sampler defined in vertex shader. (As of Leopard, all renderers support vertex texturing, but older hardware will fall back to software emulation.)
7) bind a dummy static VBO big enough to trigger the vertex shader execution W x H times. Read the real position from the texture. Transform etc and render to visible framebuffer.

I think this is mentioned in the 2007 GDC paper, but another approach possible on modern hardware is to move all of the simulation out of the fragment shader, to the vertex/geometry shader. Capture the resulting position/velocity into VBOs directly, with EXT_transform_feedback (on Mac OS X, this is implemented everywhere, but falls back to software emulation on older hardware.)

As to which implementation is best for current hardware-- I suggest you try them all