Well, I think it's the camera. Here's a test you can do: take a video of your hands slowly clapping, then bring the video into an editing program where you can see the audio waveform together with the video, frame-by-frame.

Below is an example of what this might look like in FinalCutX. We're looking at frame 4:09 (you can see a slight lightened bar at the top right of the playhead, indicating the extent of the frame). Below, in green, the audio waveform. My hands are still far apart, yet in the audio, the clap has already happened -- about in the middle of the previous frame.

Audio ahead of video, Olympus OM-D E-M5

And here we are, two frames later (4:11). My hands have finally clapped in the video, too. So we're looking at the audio being about 3 frames ahead of the video. I've tried the experiment with 720 and 1080, and both have about the same problem. (My decade-old tape-based video camera has no such issues.)

Obviously, the fix is easy (detach audio and nudge ~3 frames to the right), but (a) it should not be necessary, and (b) if you want better than 1-frame accuracy (which a modern device really ought to be able to do), then you have to either guesstimate, or record a bunch of claps and find the right amount of shift that works best on average (i.e., you're manually doing sub-frame sampling).