Not sure I 100% get how you're approaching it now but I cant ell you how I do similar.

In my case, the demos are shot with three or four cameras. One camera is usually a wide-ish shot of the talent presenting, the others are detail shots from different angles. The demos run 10-15 minutes each before editing.

I build the narration and overall demo from the wide shot on the primary storyline. I then go back and add secondary storylines or connected clips where appropriate, grabbing b-roll footage from the browser to show the detail. The b-roll action has to match audio from the primary storyline but it's not critical it be frame accurate so I just eyeball the sync if I have to.

Occasionally we have two angles of the presenter talking so I'll make a multicam of them to build the primary storyline but I'll still end up just grabbing the b-roll from the clip browser rather than messing with multi-cam.

Anyway, that's what works for me. Everything is shot with my editing approach in mind and It takes me maybe an hour to get a rough cut down.

The easiest I've found it multi-cam, then copy & paste the last clip (some times is 20m long) onto the secondary storyline, & then get rid of it's audio.
Cut the montage on the secondary, & then cut back to the primary, deleting the remaining 18 minute clip on the secondary. Repeat that 10 seconds later.

It just gets frustrating if you want to come back to different angle on the primary & it isn't already selected ( changes angles automatically applies to the secondary SL, not the primary).
Plus the clips layered above must be a storyline to become magnetic.

I don't know if this is totally relevant to your situation but I make a full multicam, then I make a second cleaned up multicam before I start editing. Lay out your narration with that -then use the first one to grab whatever b-roll you need to overlay.