Introduction

Microsatellite expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. We developed a novel algorithm called RepeatHMM to estimate repeat counts from long-read sequencing data. Evaluation on simulation data, real amplicon sequencing data on two repeat expansion disorders, and whole-genome sequencing data generated by PacBio and Oxford Nanopore technologies showed superior performance over competing approaches. We concluded that long-read sequencing coupled with RepeatHMM can estimate repeat counts on microsatellites and can interrogate the “unsequenceable” genomic trinucleotide repeat disorders.

RepeatHMM consists of several steps, as shown in Figure below. We used trinucleotide repeat as an example below to illustrate the procedure, but RepeatHMM can be used for microsatellites of any size.

Performance of RepeatHMM, RepeatCCS, BAMself, and TRhist on estimating the repeat counts in ATXN3 for 20 patients with SCA3 and five controls. The gold standards (x-axis) were determined by capillary electrophoresis for 20 patients or by Sanger sequencing for five controls. a Scatterplot of estimated repeat counts and true counts. b, c The difference of estimated repeat counts and true counts by RepeatHMM, RepeatCCS, BAMself, and TRhist. RepeatCCS refers to the use of RepeatHMM on error-corrected reads generated by the circular consensus sequencing protocol.