Hey,
I work on Frostbite Rendering and DICE games :)
We have published information about older versions of this tech. We have improved this in numerous areas over time, but many aspects still hold true to this information:
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Andersson-TerrainRendering(Siggraph07).pdf
http://www.slideshare.net/DICEStudio/terrain-in-battlefield-3-a-modern-complete-and-scalable-system
For Dragon Age Inquisition, we also added advanced terrain displacement mapping to the terrain (BF3 had basic terrain tessellation), and this tech made its way into Star Wars Battlefront, too.
Shading (i.e. PBR) has been one of the biggest improvements as of late to our fidelity (aside from great artists and photogrammetry - http://www.frostbite.com/2016/03/photogrammetry-and-star-wars-battlefront/)
We continually publish quite a lot of information here:
http://www.frostbite.com/topics/publications/
Cheers!
Graham

Hi, in my GDC 2016 talk, I discuss using the approach MJP mentioned to pass the draw index through to the indirect args using a root constant:
http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/
Cheers,
Graham

Hi!
I'm the author\presenter of this research. Hodgman is correct here. Triangle culling is definitely worth it, as evidenced by my initial slides showing peak primitive rate per triangle vs. available ALU. I mention cluster culling just to show we have it, but previous research (like the Siggraph 2015 GPU-Driven Rendering Pipelines work) shows cluster culling, so I wanted to take it further and detail per-tri instead. Combining per cluster with per triangle is significantly better than just doing per cluster. You can have lots of surviving clusters within your frustum that contain tiny triangles which will not be removed. The same goes for depth and frustum. Additionally, I go over some algorithms like blend shapes, cloth, or even voxelization which can't be done per-cluster, so this technique efficiently iterates per triangle, enabling the improved usage of these algorithms.
NV has different bottlenecks when talking about primitive rate, so async compute isn't the showstopper here (I can't get into the specifics due to NDA, but these techniques can be implemented differently on NV for a big win - i.e. fast passthrough geometry shaders on Maxwell+).
It comes down to, what is the primitive rate between setup vs. rasterizer. If it's the same rate, culling in compute will be faster. If setup is 2x the rate of rasterizer, you need more than 50% backface for it to be effective, and the gains will be less.
In a future blog post, I may show more details of the cluster culling that we're doing - though no promises yet :)
In summary, per-triangle culling is currently absolutely beneficial on AMD. Sure, just like most algorithms you can do a coarse\broad phase cull pass, and then do a fine\narrow phase cull phase. This talk\research is about how to perform the absolute fastest narrow phase cull on GCN.
-Graham