My game needs to stream everything to the gpu. So all three buffers (ARRAY_BUFFER, ELEMENT_ARRAY_BUFFER, UNIFORM_BUFFER) will be updated before drawing.
To get useable performance, I map each buffer with MAP_UNSYNC and update only small parts in a ringbuffer manner. If a buffer is full, I orphan it by BufferData(NULL, GL_STREAM_DATA).
But my glsl shaders seem to be much slower on using uniform buffer objects.
I've two codepath for uniforms: First uses uniform buffers, the second one updates all uniforms by glUniform. So all next steps are done once with uniform buffers and once with glUniform.
All of the next files are stored on: http://markus.members.selfnet.de/i965-ubo/
For profiling, I've made an apitrace dumps.
qapitrace profiler shows that I am gpu bottlenecked almost all the time.
I've also dumped INTEL_DEBUG=wm,shader_time of both "glretrace -b".
To be complete, there is also the intel_gpu_top output.
I think the qapitrace output says that all shaders are slower, so it shouldn't be an issue with one of them. Maybe the optimizion fails for ubo uniforms?
My test environment:
Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
HD4000 GPU
3.0 Mesa 9.2-devel (git-8cabe26)

The patch series I just sent out (also available as the "ubo" branch of git://people.freedesktop.org/~anholt/mesa) fixes some rendering failures with your trace on my ivb while improving performance 20%. Unfortunately, your non-ubo trace spewed endless errors about uniform updates (have you checked for GL errors from your app? Did the replay play back cleanly for you?), so I couldn't compare the two side by side.