I could obviously map the resulting buffer and do the accumulation myself on the CPU side, but it seems like there should be a way to do this. I'd like to keep things roughly 3.0 compatible, I'm not sure if that restricts things too much.

Limitations here include the maximum number of attributes you can feed the vertex shader. And you need to decide how to capture both the real transformed positions, and the min/max, which on GL3 hardware with a single transform feedback stream, means log2 passes in addition to the regular transform.

If you go way back to even older hardware (GL1 with float textures, or, say, ES2 on an iPhone) another approach is to use the fragment stage to compare multiple vertices:

b) modify vertex shader to always output 0,0 position. Output transformed position as a generic attribute. Set up simple pass-through fragment shader.
c) set up a 1x1 floating point renderbuffer, with either MIN or MAX blending. Initially clear to zero.
d) draw N vertices as 1-px POINTS. Each point is blended onto the same framebuffer pixel, taking the min (or max.)
e) ReadPixels (or etc) to get the final result.
f) repeat a second pass with the other blend equation (on GL4 hardware, you could use ARB_draw_buffers_blend to set up both blend equations in a single MRT pass.)

Limitations here include the maximum number of components you can manipulate in the fragment stage (i.e. RGBA = 4 floats. More if your hardware supports ARB_draw_buffers.)

As mentioned, this is trivial with GL4.2+ hardware, using any of the atomic, image load/store, or compute shader solutions.
OpenCL is another option, and it is available on GL3 hardware.

Could you elaborate a little on this one? Specifically the fragment shader variant..

.
Poor-man's compute shader:

1. Create two textures of the same size and format, large enough to hold all of your data.
2. Upload your data into one texture, and bind it to a texture unit (with no filtering).
3. Attach the other texture as the colour buffer of a framebuffer object, bind it, and set the viewport accordingly.
4. Render a screen-sized quad.

Each invocation of the fragment shader will calculate one "pixel" to be stored in the output texture. It can use data from the input texture in the calculation.

Because fragment shader invocations are performed in parallel, you need to use a divide-and-conquer approach, where each step halves the amount of data. After each step, you'd swap the textures, so the output from the last step becomes the input to the next step. You need log2(N) steps to process N elements. You would halve the size of the viewport at each step (as you're only calculating half as many values).

On each step, you'd calculate

Code :

out[i] = min(in[2*i], in[2*i+1]);

So if you start with e.g. 1024 pixels, you'd group these into 512 pairs, and the output would be the minimum of the 2 values for each pair, giving 512 values. On the next step, you'd group these into 256 pairs, and so on. On the tenth step, you'd have 2 inputs and one output, which would be the minimum of the 1024 initial values.

One minor complication is that the textures have to be 2D, and there's a limit to the width. So if you had 65536 values, you might use 256x256 textures. So the calculation above would look something like: