Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.

Automated systems are not yet able to engage in a robust dialogue with users due the complexity and ambiguity of natural language. However, humans can easily converse with one another and maintain a shared history of past interactions. In this paper, we introduce Chorus, a system that enables real-time, two-way natural language conversation between an end user and a crowd acting as a single agent. Chorus is capable of maintaining a consistent, on-topic conversation with end users across multiple sessions, despite constituent individuals perpetually joining and leaving the crowd. This is enabled by using a curated shared dialogue history. Even though crowd members are constantly providing input, we present users with a stream of dialogue that appears to be from a single conversational partner. Experiments demonstrate that dialogue with Chorus displays elements of conversational memory and interaction consistency. Workers were able to answer 84.6% of user queries correctly, demonstrating that crowd-powered communication interfaces can serve as a robust means of interacting with software systems.

Web automation often involves users describing complex tasks to a system, with directives generally limited to low-level constituent actions like "click the search button." This level of description is unnatural and makes it difficult to generalize the task across websites. In this paper, we propose a system for automatically recognizing higher-level interaction patterns from user's completion of tasks, such as "searching for cat videos" or "replying to a post". We present PatFinder, a system that identifies these patterns using the input of crowd workers. We validate the system by generating data for 10 tasks, having 62 crowd workers label them, and automatically extracting 14 interaction patterns. Our results show that the number of patterns grows sublinearly with the number of tasks, suggesting that a small finite set of patterns may suffice to describe the vast majority of tasks on the web.