How should I transcribe hesitations into my Speech to Text training corpus?

I'm training Watson Speech to Text with a set of sample audio recordings, and uploading a custom language model to accompany them.

Reading the docs it looks like corpus file + audio files is the closest I can get to explicitly tagging "this flac = this text", but I'd still like to make my transcriptions as accurate as possible to ensure good performance.

How should I transcribe hesitations like 'umm'/'err'/'ahh' for best results? Skip them entirely? Use some kind of marker like %HESITATION?