Conventionally, gesture recognition based on non-intrusive muscle-computer interfaces required a strongly-supervised learning algorithm and a large amount of labeled training signals of surface electromyography (sEMG). In this work, we show that temporal relationship of sEMG signals and data glove provides implicit supervisory signal for learning the gesture recognition model. To demonstrate this, we present a semi-supervised learning framework with a novel Siamese architecture for sEMG-based gesture recognition. Specifically, we employ auxiliary tasks to learn visual representation; predicting the temporal order of two consecutive sEMG frames; and, optionally, predicting the statistics of 3D hand pose with a sEMG frame. Experiments on the NinaPro, CapgMyo and csl-hdemg datasets validate the efficacy of our proposed approach, especially when the labeled samples are very scarce.