During face-to-face conversation, people use visual feedback such as head nods to communicate relevant information and to synchronize rhythm between participants. In this paper we describe how contextual information from other partic- ipants can be used to predict visual feedback and improve recognition of head gestures in human-human interactions. For example, in a dyadic interaction, the speaker contextual cues such as gaze shifts or changes in prosody will in uence listener backchannel feedback (e.g., head nod). To auto- matically learn how to integrate this contextual information into the listener gesture recognition framework, this paper addresses two main challenges: optimal feature representa- tion using an encoding dictionary and automatic selection of optimal feature-encoding pairs. Multimodal integration between context and visual observations is performed using a discriminative sequential model (Latent-Dynamic Condi- tional Random Fields) trained on previous interactions. In our experiments involving 38 storytelling dyads, our context- based recognizer signicantly improved head gesture recog- nition performance over a vision-only recognizer.