Abstract

Searching collaboratively for places of interest is a common activity that frequently occurs on individual mobile phones, or on large tourist-information displays in public places such as visitor centers or train stations. We
created a public display system for collaborative travel planning, as well as a mobile app that can augment the display. We tested them against third-party mobile apps in a simulated travel-search task to understand how the unique features of mobile phones and large displays might be leveraged together to improve collaborative travel planning experience.

Abstract

We leverage a popularity measure in social media as a distant label for extractive summarization of online conversations. In social media, users can vote, share, or bookmark a post they prefer. The number of these actions is regarded as a measure of popularity. However, popularity is not solely determined by content of a post, e.g., a text or an image in a post, but is highly contaminated by its contexts, e.g., timing, and authority. We propose a disjunctive model, which computes the contribution of content and context separately. For evaluation, we build a dataset where the informativeness of a comment is annotated. We evaluate the results with ranking metrics, and show that our model outperforms the baseline model, which directly uses popularity as a measure of informativeness.

Abstract

Prewriting is the process of generating and organizing ideas before drafting a document. Although often overlooked by novice writers and writing tool developers, prewriting is a critical process that improves the quality of a final document. To better understand current prewriting practices, we first conducted interviews with writing learners and experts. Based on the learners’ needs and experts’ recommendations, we then designed and developed InkPlanner, a novel pen and touch visualization tool that allows writers to utilize visual diagramming for ideation during prewriting. InkPlanner further allows writers to sort their ideas into a logical and sequential narrative by using a novel widget— NarrativeLine. Using a NarrativeLine, InkPlanner can automatically generate a document outline to guide later drafting exercises. Inkplanner is powered by machine-generated semantic and structural suggestions that are curated from various texts. To qualitatively review the tool and understand how writers use InkPlanner for prewriting, two writing experts were interviewed and a user study was conducted with university students. The results demonstrated that InkPlanner encouraged writers to generate more diverse ideas and also enabled them to think more strategically about how to organize their ideas for later drafting.

Abstract

With the tremendous progress in sensing and IoT infrastructure, it is foreseeable that IoT systems will soon be available for commercial markets, such as in people's homes. In this paper, we present a deployment study using sensors attached to household objects to capture the resourcefulness of three individuals. The concept of resourcefulness
highlights the ability of humans to repurpose objects spontaneously for a different use case than was initially intended. It is a crucial element for human health and wellbeing, which is of great interest for various aspects of HCI and design research. Traditionally, resourcefulness is captured through ethnographic practice. Ethnography can only provide sparse and often short duration observations of human experience, often relying on participants being aware of and remembering behaviours or thoughts they need to report on. Our hypothesis is that resourcefulness can also be captured through continuously monitoring objects being used in everyday life. We developed a system that can record object movement continuously and deployed them in homes of three elderly people for over two weeks. We explored the use of probabilistic topic models to analyze the collected data and identify common patterns.

Abstract

Despite reflection being identified as a key component of behavior change, most existing tools do not explicitly design for it, carrying an implicit assumption that providing access to self-tracking data is enough to trigger reflection. In this work we design a system for reflection around physical activity. Through a set of workshops, we generated a corpus of 275 reflective questions. We then combine these questions into a set of 25 reflective mini-dialogues. We deliver our mini-dialogues through MMS. 33 active users of fitness trackers used our system in a 2-week field deployment. Results suggest that the mini-dialogues were successful in triggering reflection and that this reflection led to increases in motivation, empowerment, and adoption of new behaviors. Encouragingly, 16 participants elected to use the system for two additional weeks without compensation. We present implications for the design of technology-supported dialog system for reflection.

Abstract

Continuous monitoring with unobtrusive wearable social sensors is becoming a popular method to assess individual affect states and team effectiveness in human research. A large number of applications have demonstrated the effectiveness of applying wearable sensing in corporate settings; for example, in short periodic social events or in a university campus. However, little is known of how we can automatically detect individual affect and group cohesion for long duration missions. Predicting negative affect states and low cohesiveness is vital for team missions. Knowing team members’ negative states allows timely interventions to enhance their effectiveness. This work investigates whether sensing social interactions and individual behaviors with wearable sensors can provide insights into assessing individual affect states and group cohesion. We analyzed wearable sensor data from a team of six crew members who were deployed on a four-month simulation of a space exploration mission at a remote location. Our work proposes to recognize team members’ affect states and group cohesion as a binary classification problem using novel behavior features that represent dyadic interaction and individual activities. Our method aggregates features from individual members into group levels to predict team cohesion. Our results show that the behavior features extracted from the wearable social sensors provide useful information in assessing personal affect and team cohesion. Group task cohesion can be predicted with a high performance of over 0.8 AUC. Our work demonstrates that we can extract social interactions from sensor data to predict group cohesion in longitudinal missions. We found that quantifying behavior patterns including dyadic interactions and face-to-face communications are important in assessing team process.

Abstract

Accurate localization is a fundamental requirement for a variety of applications, ranging from industrial robot operations to location-powered applications on mobile devices. A key technical challenge in achieving this goal is providing a clean and reliable estimation of location from a variety of low-cost, uncalibrated sesnors. Many current techniques rely on Particle Filter (PF) based algorithms. They have proven successful at effectively fusing various sensors inputs to create meaningful location predictions. In this paper we build upon this large corpous of work. Like prior work, our technique fuses Received Signal Strength Indicator (RSSI) measurements from Bluetooth Low Energy (BLE) beacons with map information. A key contribution of our work is a new sensor model for BLE beacons that does not require the mapping from RSSI to distance. We further contribute a novel method of utilizing map information during the initialization of the system and during the resampling phase when new particles are generated. Using our proposed sensor model and map prior information the performance of the overall localization is improved by 1.20 m on comparing the 75th percentile of the cumulative distribution with traditional localization techniques.

Abstract

In this paper, we develop a system for the lowcost indoor localization and tracking problem using radio signal strength indicator, Inertial Measurement Unit (IMU), and magnetometer sensors. We develop a novel and simplified probabilistic IMU motion model as the proposal distribution of the sequential Monte-Carlo technique to track the robot trajectory. Our algorithm can globally localize and track a robot with a priori unknown location, given an informative prior map of the Bluetooth Low Energy (BLE) beacons. Also, we formulate the problem as an optimization problem that serves as the Backend of the algorithm mentioned above (Front-end). Thus, by simultaneously solving for the robot trajectory and the map of BLE beacons, we recover a continuous and smooth trajectory of the robot, corrected locations of the BLE beacons, and the time varying IMU bias. The evaluations achieved using hardware show that through the proposed closed-loop system the localization performance can be improved; furthermore, the system becomes robust to the error in the map of beacons by feeding back the optimized map to the Front-end.

Abstract

In this chapter we discuss the use of external sources of data in designing conversational dialogues. We focus on applications in behavior change around physical activity involving dialogues that help users better understand their self-tracking data and motivate healthy behaviors. We start by introducing the areas of behavior change and personal informatics and discussing the importance of self-tracking data in these areas. We then introduce the role of reflective dialogue-based counseling systems in this domain, discuss specific value that self-tracking data can bring, and how it can be used in creating the dialogues. The core of the chapter focuses on six practical examples of design of dialogues involving self-tracking data that we either tested in our research or propose as future directions based on our experiences. We end the chapter by discussing how the design principles for involving external data in conversations can be applied to broader domains. Our goal for this chapter is to share our experiences, outline design principles, highlight several design opportunities in external data-driven computer-based conversations, and encourage the reader to explore creative ways of involving external sources of data in shaping dialogues-based interactions.

Abstract

We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be checked to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.

Abstract

Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using meta-data and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.

Abstract

SlideDiff is a system that automatically creates an animated rendering of textual and media differences between two versions of a slide. While previous work focuses either on textual or image data, SlideDiff integrates text and media changes, as well as their interactions, e.g. adding an image forces nearby text boxes to shrink. Provided with two versions of a slide (not the full history of edits), SlideDiff detects the textual and image differences, and then animates the changes by mimicking what a user would have done, such as moving the cursor, typing text, resizing image boxes, adding images. This editing metaphor is well known to most users, helping them better understand what has changed, and fosters a sense of connection between remote workers, making them feel as if we edited together. After detection of text and image differences, the animations are rendered in HTML and CSS, including mouse cursor motion, text and image box selection and resizing, text deletion and insertion with its cursor. We discuss strategies for animating changes, in particular the importance of starting with large changes and finishing with smaller edits, and provide evidence of the utility of SlideDiff in a workplace setting.

Abstract

Exploring coordinated relationships (e.g., shared relationships between two sets of entities) is an important analytics task in a variety of real-world applications, such as discovering similarly behaved genes in bioinformatics, detecting malware collusions in cyber security, and identifying products bundles in marketing analysis. Coordinated relationships can be formalized as biclusters. In order to support visual exploration of biclusters, bipartite graphs based visualizations have been proposed, and edge bundling is used to show biclusters. However, it suffers from edge crossings due to possible overlaps of biclusters, and lacks in-depth understanding of its impact on user exploring biclusters in bipartite graphs. To address these, we propose a novel bicluster-based seriation technique that can reduce edge crossings in bipartite graphs drawing and conducted a user experiment to study the effect of edge bundling and this proposed technique on visualizing biclusters in bipartite graphs. We found that they both had impact on reducing entity visits for users exploring biclusters, and edge bundles helped them find more justified answers. Moreover, we identified four key trade-offs that inform the design of future bicluster visualizations. The study results suggest that edge bundling is critical for exploring biclusters in bipartite graphs, which helps to reduce low-level perceptual problems and support high-level inferences.

Abstract

Devices with embedded sensors are permeating the computing landscape, allowing the collection and analysis of rich data about individuals, smart spaces, and their interactions. This class of de- vices enables a useful array of home automation and connected workplace functionality to individuals within instrumented spaces. Unfortunately, the increasing pervasiveness of sensors can lead to perceptions of privacy loss by their occupants. Given that many instrumented spaces exist as platforms outside of a user’s control—e.g., IoT sensors in the home that rely on cloud infrastructure or connected workplaces managed by one’s employer—enforcing access controls via a trusted reference monitor may do little to assuage individuals’ privacy concerns. This calls for novel enforcement mechanisms for controlling access to sensed data.
In this paper, we investigate the interplay between sensor fidelity and individual comfort, with the goal of understanding the design space for effective, yet palatable, sensors for the workplace. In the context of a common space contextualization task, we survey and interview individuals about their comfort with three common sensing modalities: video, audio, and passive infrared. This allows us to explore the extent to which discomfort with sensor platforms is a function of detected states or sensed data. Our findings uncover interesting interplays between content, context, fidelity, history, and privacy. This, in turn, leads to design recommendations regarding how to increase comfort with sensing technologies by revisiting the mechanisms by which user preferences and policies are enforced in situations where the infrastructure itself is not trusted.

Abstract

Massive Open Online Course (MOOC) platforms have scaled online education to unprecedented enrollments, but remain limited by their rigid, predetermined curricula. Increasingly, professionals consume this content to augment or update specific skills rather than complete degree or certification programs. To better address the needs of this emergent user population, we describe a visual recommender system called MOOCex. The system recommends lecture videos {\em across} multiple courses and content platforms to provide a choice of perspectives on topics. The recommendation engine considers both video content and sequential inter-topic relationships mined from course syllabi. Furthermore, it allows for interactive visual exploration of the semantic space of recommendations within a learner's current context.

Abstract

An enormous amount of conversation occurs online every day, including on chat platforms where multiple conversations may take place concurrently.
Interleaved conversations lead to difficulties in not only following discussions but also retrieving relevant information from simultaneous messages.
Conversation disentanglement aims to separate overlapping messages into detached conversations.
In this paper, we propose to leverage representation learning for conversation disentanglement. A Siamese Hierarchical Convolutional Neural Network (SHCNN), which integrates local and more global representations of a message, is first presented to estimate the conversation-level similarity between closely posted messages. With the estimated similarity scores, our algorithm for Conversation Identification by SImilarity Ranking (CISIR) then derives conversations based on high-confidence message pairs and pairwise redundancy.
Experiments were conducted with four publicly available datasets of conversations from Reddit and IRC channels. The experimental results show that our approach significantly outperforms comparative baselines in both pairwise similarity estimation and conversation disentanglement.

Abstract

Conversational agents stand to play an important role in supporting behavior change and well-being in many domains. With users able to interact with conversational agents through both text and voice, understanding how designing for these channels supports behavior change is important. To begin answering this question, we designed a conversational agent for the workplace that supports workers’ activity journaling and self-learning through reflection. Our agent, named Robota, combines chat-based communication as a Slack Bot and voice interaction through a personal device using a custom Amazon Alexa
Skill. Through a 3-week controlled deployment, we examine how voice-based and chat-based interaction affect workers’ reflection and support self-learning. We demonstrate that, while many current technical limitations exist, adding dedicated mobile voice interaction separate from the already busy chat modality may further enable users to step back and reflect on their work. We conclude with discussion of the implications of our findings to design of workplace self-tracking systems specifically and to behavior-change systems in general.

Abstract

Convolutional Neural Networks (CNN) have successfully been utilized for localization using a single monocular image [1]. Most of the work to date has either focused on reducing the dimensionality of data for better learning of parameters during training or on developing different variations of CNN models to improve pose estimation. Many of the best performing works solely consider the content in a single image, while the context from historical images is ignored. In this paper, we propose a combined CNN-LSTM which is capable of incorporating contextual information from historical images to better estimate the current pose. Experimental results achieved using a dataset collected in an indoor office space improved the overall system results to 0.8 m & 2.5° at the third quartile of the cumulative distribution as compared with 1.5 m & 3.0° achieved by PoseNet [1]. Furthermore, we demonstrate how the temporal information exploited by the CNN-LSTM model assists in localizing the robot in situations where image content does not have sufficient features.

Abstract

In this paper, we propose a novel solution to optimize the deployment of (RF) beacons for the purpose of indoor localization. We propose a system that optimizes both the number of beacons and their placement in a given environment. We propose a novel cost-function, called CovBSM, that allows to simultaneously optimize the 3-coverage while maximizing the beacon spreading. Using this cost function, we propose a framework that maximize both the number of beacons and their placement in a given environment. The proposed solution accounts for the indoor infrastructure and its influence on the (RF) signal propagation by embedding a realistic simulator into the optimization process.

Abstract

Massive Open Online Course (MOOC) platforms have scaled online education to unprecedented enrollments, but remain limited by their rigid, predetermined curricula. This paper presents MOOCex, a technique that can offer a more flexible learning experience for MOOCs. MOOCex can recommend lecture videos across different courses with multiple perspectives, and considers both the video content and also sequential inter-topic relationships mined from course syllabi. MOOCex is also equipped with interactive visualization allowing learners to explore the semantic space of recommendations within their current learning context. The results of comparisons to traditional methods, including content-based recommendation and ranked list representation, indicate the effectiveness of MOOCex. Further, feedback from MOOC learners and instructors suggests that MOOCex enhances both MOOC-based learning and teaching.

Abstract

Understanding team communication and collaboration patterns is critical for improving work efficiency in organizations. This paper presents an interactive visualization system, T-Cal, that supports the analysis of conversation data from modern team messaging platforms (e.g., Slack). T-Cal employs a user-familiar visual interface, a calendar, to enable seamless multi-scale browsing of data from different perspectives. T-Cal also incorporates a number of analytical techniques for disentangling interleaving conversations, extracting keywords, and estimating sentiment. The design of T-Cal is based on an iterative user-centered design process including field studies, requirements gathering, initial prototypes demonstration, and evaluation with domain users. The resulting two case studies indicate the effectiveness and usefulness of T-Cal in real-world applications, including student group chats during a MOOC and daily conversations within an industry research lab.

Abstract

This paper describes the development of a multi-sensory clubbing experience which was deployed during two a two-day event within the context of the Amsterdam Dance Event in October 2016 in Amsterdam. We present how the entire experience was developed end-to-end and deployed at the event through the collaboration of several project partners from industries such as art and design, music, food, technology and research. Central to the system are smart textiles, namely wristbands equipped with Bluetooth LE sensors which were used to sense people attending the dance event. We describe the components of the system, the development process, collaboration between the involved entities and the event itself. To conclude the paper, we highlight insights gained from conducting a real world research deployment across many collaborators and stakeholders.

Abstract

Effective communication of activities and progress in the workplace is crucial for the success of many modern organizations. In this paper, we extend current research on workplace communication and uncover opportunities for technology to support effective work activity reporting. We report on three studies: With a survey of 68 knowledge workers followed by 14 in-depth interviews, we investigated the perceived benefits of different types of progress reports and an array of challenges at three stages: Collection, Composition, and Delivery. We show an important interplay between written and face-to-face reporting, and highlight the importance of tailoring a report to its audience. We then present results from an analysis of 722 reports composed by 361 U.S.-based knowledge workers, looking at the influence of the audience on a report’s language. We conclude by discussing opportunities for future technologies to assist both employees and managers in collecting, interpreting, and reporting progress in the workplace.

Abstract

Activity recognition is a core component of many intelligent and context-aware systems. In this paper, we present a solution for discreetly and unobtrusively recognizing common work activities above a work surface without using cameras. We demonstrate our approach, which utilizes an RF-radar sensor mounted under the work surface, in two work domains; recognizing work activities at a convenience-store counter (useful for post-hoc analytics) and recognizing common office deskwork activities (useful for real-time applications). We classify seven clerk activities with 94.9% accuracy using data collected in a lab environment, and recognize six common deskwork activities collected in real offices with 95.3% accuracy. We show that using multiple projections of RF signal leads to improved recognition accuracy. Finally, we show how smartwatches worn by users can be used to attribute an activity, recognized with the RF sensor, to a particular user in multi-user scenarios. We believe our solution can mitigate some of users’ privacy concerns associated with cameras and is useful for a wide range of intelligent systems.

Abstract

This paper examines content-based recommendation in domains exhibiting sequential topical structure. An example is educational video, including Massive Open Online Courses (MOOCs) in which knowledge builds within and across courses. Conventional content-based or collaborative filtering recommendation methods do not exploit courses' sequential nature. We describe a system for video recommendation that combines topic-based video representation with sequential pattern mining of inter-topic relationships. Unsupervised topic modeling provides a scalable and domain-independent representation. We mine inter-topic relationships from manually constructed syllabi that instructors provide to guide students through their courses. This approach also allows the inclusion of multi-video sequences among the recommendation results.
Integrating the resulting sequential information with content-level similarity provides relevant as well as diversified recommendations. Quantitative evaluation indicates that the proposed system, \textit{SeqSense}, recommends fewer redundant videos than baseline methods, and instead emphasizes results consistent with mined topic transitions.

Abstract

Traditional summarization initiatives have been focused on specific types of documents such as articles, reviews, videos, image feeds, or tweets, a practice which may result in pigeonholing the summarization task in the surrounding of modern, content-rich multimedia collections. Consequently, much of the research to date has revolved around mostly toy problems in narrow domains and working on single-source media types. We argue that summarization and story generation systems need to refocus the problem space in order to meet the information needs in the age of user-generated content in different formats and languages. Here we create a framework for flexible multimedia storytelling. Narratives, stories, and summaries carry a set of challenges in big data and dynamic multi-source media that give rise to new research in spatial-temporal representation, viewpoint generation, and explanation.

Abstract

Tutorials are one of the most fundamental means of conveying knowledge. Ideally when the task involves physical or digital objects, tutorials not only describe each step with text or via audio narration but show it as well using photos or animation. In most cases, online tutorial authors capture media from handheld mobile devices to compose these documents, but increasingly they use wearable devices as well. In this work, we explore the full life-cycle of online tutorial creation and viewing using head-mounted capture and displays. We developed a media-capture tool for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial's content rather than its capture. The capture tool is coupled with web-based authoring tools for creating annotatable videos and multimedia documents. In a study comparing standalone (camera on tripod) versus wearable capture (Google Glass) as well as two types of multimedia representation for authoring tutorials (video-based or document-based), we show that tutorial authors have a preference for wearable capture devices, especially when recording activities involving larger objects in non-desktop environments. Authors preferred document-based multimedia tutorials because they are more straightforward to compose and the step-based structure translates more directly to explaining a procedure. In addition, we explored using head-mounted displays (Google Glass) for accessing tutorials in comparison to lightweight computing devices such as tablets. Our study included tutorials recorded with the same capture methods as in our access study. We found that although authors preferred head-mounted capture, tutorial consumers preferred video recorded by a camera on tripod that provides a more stable image of the workspace. Head-mounted displays are good for glanceable information, however video demands more attention and our participants made more errors using Glass than when using a tablet, which was easier to ignore. Our findings point out several design implications for online tutorial authoring and access methods.

Abstract

Advances in small and low power electronics have created new opportunities for the Internet of Things (IoT), leading to an explosion of physical objects being connected to the Internet. However, there still lacks an indoor localization solution that can answer the needs of various location-based IoT applications with desired simplicity, robustness, accuracy, and responsiveness. We introduce Foglight, a visible light enabled indoor localization system for IoT devices that relies on unique spatial encoding produced when mechanical mirrors inside a projector are flipped based on gray-coded binary images. Foglight employs simple off-the-shelf light sensors that can be easily coupled with existing IoT devices - such as thermometers, gas meters, or light switches - making their location discoverable. Our sensor unit is computation efficient; it can perform high-accuracy localization with minimum signal processing overhead, allowing any low-power IoT device on which it rests to be able to locate itself. Additionally, results from our evaluation reveal that Foglight can locate a target device with an average accuracy of 1.7 millimeters and average refresh rate of 84 Hz with minimal latency, 31.46 milliseconds on WiFi and 23.2 milliseconds on serial communication. Two example applications are developed to demonstrate possible scenarios as proof of concept. We also discuss limitations, how they could be overcome, and propose next steps.

Abstract

We present a system for capturing ink strokes written with ordinary pen and paper using a fast camera with a frame rate comparable to a stylus digitizer. From the video frames, ink strokes are extracted and used as input to an online handwriting recognition engine. A key component in our system is a pen up/down detection model for detecting the contact of the pen-tip with the paper in the video frames. The proposed model consists of feature representation with convolutional neural networks and classification with a recurrent neural network. We also use a high speed tracker with kernelized correlation filters to track the pen-tip. For training and evaluation, we collected labeled video data of users writing English and Japanese phrases from public datasets, and we report on character accuracy scores for different frame rates in the two languages.

Abstract

Video telehealth is growing to allow more clinicians to see patients from afar. As a result, clinicians, typically trained for in-person visits, must learn to communicate both health information and non-verbal affective signals to patients through a digital medium. We introduce a system called ReflectLive that senses and provides real-time feedback about non-verbal communication behaviors to clinicians so they can improve their communication behaviors. A user evaluation with 10 clinicians showed that the real-time feedback helped clinicians maintain better eye contact with patients and was not overly distracting. Clinicians reported being more aware of their non-verbal communication behaviors and reacted positively to summaries of their conversational metrics, motivating them to want to improve. Using ReflectLive as a probe, we also discuss the benefits and concerns around automatically quantifying the “soft skills” and complexities of clinician-patient communication, the controllability of behaviors, and the design considerations for how to present real-time and summative feedback to clinicians.

Abstract

Humans are complex and their behaviors follow complex multimodal patterns, however to solve many social computing problems one often looks at complexity in large-scale yet single point data sources or methodologies. While single data/single method techniques, fueled by large scale data, enjoyed some success, it is not without fault. Often with one type of data and method, all the other aspects of human behavior are overlooked, discarded, or, worse, misrepresented. We identify this as two succinct problems. First, social computing problems that cannot be solved using a single data source and need intelligence from multiple modals and, second, social behavior that cannot be fully understood using only one form of methodology. Throughout this talk, we discuss these problems and their implications, illustrate examples, and propose new directives to properly approach in the social computing research in today’s age.

Abstract

Discovering and analyzing biclusters, i.e., two sets of related entities with close relationships, is a critical task in many real-world applications, such as exploring entity co-occurrences in intelligence analysis, and studying gene expression in bio-informatics. While the output of biclustering techniques can offer some initial low-level insights, visual approaches are required on top of that due to the algorithmic output complexity.This paper proposes a visualization technique, called BiDots, that allows analysts to interactively explore biclusters over multiple domains. BiDots overcomes several limitations of existing bicluster visualizations by encoding biclusters in a more compact and cluster-driven manner. A set of handy interactions is incorporated to support flexible analysis of biclustering results. More importantly, BiDots addresses the cases of weighted biclusters, which has been underexploited in the literature. The design of BiDots is grounded by a set of analytical tasks derived from previous work. We demonstrate its usefulness and effectiveness for exploring computed biclusters with an investigative document analysis task, in which suspicious people and activities are identified from a text corpus.

Abstract

During asynchronous collaborative analysis, handoff of partial findings is challenging because externalizations produced by analysts may not adequately communicate their investigative process. To address this challenge, we developed techniques to automatically capture and help encode tacit aspects of the investigative process based on an analyst’s interactions, and streamline explicit authoring of handoff annotations. We designed our techniques to mediate awareness of analysis coverage, support explicit communication of progress and uncertainty with annotation, and implicit communication through playback of investigation histories. To evaluate our techniques, we developed an interactive visual analysis system, KTGraph, that supports an asynchronous investigative document analysis task. We conducted a two-phase user study to characterize a set of handoff strategies and to compare investigative performance with and without our techniques. The results suggest that our techniques promote the use of more effective handoff strategies, help increase an awareness of prior investigative process and insights, as well as improve final investigative outcomes.

Abstract

Whether and how does the structure of family trees differ by ancestral traits over generations? This is a fundamental question regarding the structural heterogeneity of family trees for the multi-generational transmission research. However, previous work mostly focuses on parent-child scenarios due to the lack of proper tools to handle the complexity of extending the research to multi-generational processes. Through an iterative design study with social scientists and historians, we develop TreeEvo that assists users to generate and test empirical hypotheses for multi-generational research. TreeEvo summarizes and organizes family trees by structural features in a dynamic manner based on a traditional Sankey diagram. A pixel-based technique is further proposed to compactly encode trees with complex structures in each Sankey Node. Detailed information of trees is accessible through a space-efficient visualization with semantic zooming. Moreover, TreeEvo embeds Multinomial Logit Model (MLM) to examine statistical associations between tree structure and ancestral traits. We demonstrate the effectiveness and usefulness of TreeEvo through an in-depth case-study with domain experts using a real-world dataset (containing 54,128 family trees of 126,196 individuals).

Abstract

For tourists, interactions with digital public displays often depend on specific technologies that users may not be familiar with (QR codes, NFC, Bluetooth); may not have access to because of networking issues (SMS), may lack a required app (QR codes), or device technology (NFC); may not want to use because of time constraints (WiFi, Bluetooth); or may not want to use because they are worried about sharing their data with a third-party service (text, WiFi). In this demonstration, we introduce ItineraryScanner, a system that allows users to
seamlessly share content with a public travel kiosk system.

Abstract

Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network architecture that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of- the-art approaches in both individual tasks.

Abstract

In this paper, we describe DocHandles, a novel system that allows users to link to specific document parts in their chat applications. As users type a message, they can invoke the tool by referring to a specific part of a document, e.g., “@fig1 needs revision”. By combining text parsing and document layout analysis, DocHandles can find and present all the figures “1” inside previously shared documents, allowing users to explicitly link to the relevant “document handle”. Documents become first-class citizens inside the conversation stream where users can seamlessly integrate documents in their text-centric messaging application.

Abstract

It is increasingly possible to use cameras and sensors to detect and analyze human appearance for the purposes of personalizing user experiences. Such systems are already deployed in some public places to personalize advertisements and recommend items. However, since these technologies are not yet widespread, we do not have a good sense of the perceived benefits and drawbacks of public display systems that use face detection as an input for personalized recommendations. We conducted a user study with a system that inferred a user’s gender and age from a facial detection and analysis algorithm and used this to present recommendations in two scenarios (finding stores to visit in a mall and finding a pair of sunglasses to buy). This work provides an initial step towards understanding user reactions to a new and emerging form of implicit recommendation based on physical appearance.

Abstract

The availability of mobile access has shifted social media use. With that phenomenon, what users shared on social media and where they visited is naturally an excellent resource to learn their visiting behavior. Knowing visit behaviors would help market survey and customer relationship management, e.g., sending customers coupons of the businesses that they visit frequently. Most prior studies leverage meta-data e.g., check- in locations to profile visiting behavior but neglect important information from user-contributed content, e.g., images. This work addresses a novel use of image content for predicting the user visit behavior, i.e., the frequent and regular business venue categories that the content owner would visit. To collect training data, we propose a strategy to use geo-metadata associated with images for deriving the labels of an image owner’s visit behavior. Moreover, we model a user’s sequential images by using an end-to-end learning framework to reduce the optimization loss. That helps improve the prediction accuracy against the baseline as demonstrated in our experiments. The prediction is completely based on image content that is more available in social media than geo-metadata, and thus allows coverage in profiling a wider set of users.

Abstract

Video conferencing is widely used to help deliver
educational presentations, such as lectures or informational
webinars, to a distributed audience. While individuals in a
dyadic conversation may be able to use webcam streams to assess the engagement level of their interlocutor with some ease, as the size of the audience in a video conference setting increases, it becomes increasingly difficult to interpret how engaged the overall group may be. In this work, we use a mixed-methods approach to understand how presenters and attendees of online presentations use available cues to perceive and interpret audience behavior (such as how engaged the group is). Our results suggest that while webcams are seen as useful by presenters to increase audience visibility and encourage attention, audience members do not uniformly benefit from seeing others’ webcams; other interface cues such as chat may be more useful and informative engagement indicators for both parties. We conclude with design recommendations for future systems to improve what is sensed and presented.

Abstract

In this paper, we propose a real-time classification scheme to cope with noisy Radio Signal Strength Indicator (RSSI) measurements utilized in indoor positioning systems. RSSI values are often converted to distances for position estimation. However due to multipathing and shadowing effects, finding a unique sensor model using both parametric and nonparametric methods is highly challenging. We learn decision regions using the Gaussian Processes classification to accept measurements that are consistent with the operating sensor model. The proposed approach can perform online, does not rely on a particular sensor model or parameters, and is robust to sensor failures. The experimental results achieved using hardware show that available positioning algorithms can benefit from incorporating the classifier into their measurement model as a meta-sensor modeling technique.

Abstract

Observe at a person pointing out and describing something. Where is that person looking? Chances are good that this person also looks at what she is talking about and pointing at. Gaze is naturally coordinated with our speech and hand movements. By utilizing this tendency, we can create a natural interaction with computing devices and environments. In this chapter, we will first briefly discuss some basic properties of the gaze signal we can get from eye trackers, followed by a review of a multimodal system utilizing the gaze signal as one input modality. In Multimodal Gaze Interaction, data from eye trackers is used as an active input mode where for instance gaze is used as an alternative, or complimentary, pointing modality along with other input modalities. Using gaze as an active or explicit input method is challenging for several reasons. One of them being that eyes are primarily used for perceiving our environment, so knowing when a person selects an item with gaze versus just looking around is an issue. Researchers have tried to solve this by combining gaze with various input methods, such as manual pointing, speech, touch, etc. However, gaze information can also be used in interactive systems, for other purposes than explicit pointing since a user's gaze is a good indication of the user's attention. In passive gaze interaction, the gaze is not used as the primary input method, but as a supporting input method. In these kinds of systems, gaze is mainly used for inferring and reasoning about the user's cognitive state or activities in a way that can support the interaction. These kinds of multimodal systems often combine gaze with a multitude of input modalities.
In this chapter we focus on interactive systems, exploring the design space for gaze-informed multimodal interaction spanning from gaze as active input mode to passive and if the usage scenario is stationary (at e.g. a desk) or mobile. There are a number of studies aimed at describing, detecting or modeling specific behaviors or cognitive states. We will touch on some of these works since they can guide us in how to build gaze-informed multimodal interaction.

Abstract

Work breaks can play an important role in the mental and physical well-being of workers and contribute positively to productivity. In this paper we explore the use of activity-, physiological-, and indoor-location sensing to promote mobility during work-breaks. While the popularity of devices and applications to promote physical activity is growing, prior research highlights important constraints when designing for the workplace. With these constraints in mind, we developed BreakSense, a mobile application that uses a Bluetooth beacon infrastructure, a smartphone and a smartwatch to encourage mobility during breaks with a game-like design. We discuss constraints imposed by design for work and the workplace, and highlight challenges associated with the use of noisy sensors and methods to overcome them. We then describe a short deployment of BreakSense within our lab that examined bound vs. unbound augmented breaks and how they affect users’ sense of completion and readiness to work.

Abstract

Users often use social media to share their interest in products. We propose to identify purchase stages from Twitter data following the AIDA model (Awareness, Interest, Desire, Action). In particular, we define a task of classifying the purchase stage of each tweet in a user's tweet sequence. We introduce RCRNN, a Ranking Convolutional Recurrent Neural Network which computes tweet representations using convolution over word embeddings and models a tweet sequence with gated recurrent units. Also, we consider various methods to cope with the imbalanced label distribution in our data and show that a ranking layer outperforms class weights.

Abstract

We present Lift, a visible light-enabled finger tracking and object localization technique that allows users to perform freestyle multi-touch gestures on any object’s surface in an everyday environment. By projecting encoded visible patterns onto an object’s surface (e.g. paper, display, or table), and localizing the user’s fingers with light sensors, Lift offers users a richer interactive space than the device’s existing interfaces. Additionally, everyday objects can be augmented by attaching sensor units onto their surface to accept multi-touch gesture input. We also present two applications as a proof of concept. Finally, results from our experiments indicate that Lift can localize ten fingers simultaneously with accuracy of 0.9 mm and 1.8 mm on two axes respectively and an average refresh rate of 84 Hz with 16.7ms delay on WiFi and 12ms delay on serial, making gesture recognition on noninstrumented objects possible.

Abstract

This is a summary of our participation in the TRECVID 2016 video hyperlinking task (LNK). We submitted four runs in total. A baseline system combined on established vectorspace text indexing and cosine similarity. Our other runs explored the use of distributed word representations in combination with fine-grained inter-segment text similarity measures.

Abstract

Videos recorded with current mobile devices are increasingly geotagged at fine granularity and used in various location based applications and services. However, raw sensor data collected is often noisy, resulting in subsequent inaccurate geospatial analysis. In this study, we focus on the challenging correction of compass readings and present an automatic approach to reduce these metadata errors. Given the small geo-distance between consecutive video frames, image-based localization does not work due to the high ambiguity in the depth reconstruction of the scene. As an alternative, we collect geographic context from OpenStreetMap and estimate the absolute viewing direction by comparing the image scene to world projections obtained with different external camera parameters. To design a comprehensive model, we further incorporate smooth approximation and feature-based rotation estimation when formulating the error terms. Experimental results show that our proposed pyramid-based method outperforms its competitors and reduces orientation errors by an average of 58.8%. Hence, for downstream applications, improved results can be obtained with these more accurate geo-metadata. To illustrate, we present the performance gain in landmark retrieval and tag suggestion by utilizing the accuracy-enhanced geo-metadata.

Abstract

Accurate map matching has been a fundamental but challenging problem that has drawn great research attention in recent years. It aims to reduce the uncertainty in a trajectory by matching the GPS points to the road network on a digital map. Most existing work has focused on estimating the likelihood of a candidate path based on the GPS observations, while neglecting to model the probability of a route choice from the perspective of drivers. Here we propose a novel feature-based map matching algorithm that estimates the cost of a candidate path based on both GPS observations and human factors. To take human factors into consideration is very important especially when dealing with low sampling rate data where most of the movement details are lost. Additionally, we simultaneously analyze a subsequence of coherent GPS points by utilizing a new segment-based probabilistic map matching strategy, which is less susceptible to the noisiness of the positioning data. We have evaluated the proposed approach on a public large-scale GPS dataset, which consists of 100 trajectories distributed all over the world. The experimental results show that our method is robust to sparse data with large sampling intervals (e.g., 60 s to 300 s) and challenging track features (e.g., u-turns and loops). Compared with two state-of-the-art map matching algorithms, our method substantially reduces the route mismatch error by 6.4% to 32.3% and obtains the best map matching results in all the different combinations of sampling rates and challenging features.

Abstract

Improvements in sensor and wireless network enable accurate, automated, instant determination and dissemination of a user's or
objects position. The new enabler of location-based services (LBSs) apart from the current ubiquitous networking infrastructure is
the enrichment of the different systems with semantics information, such as time, location, individual capability, preference and
more. Such semantically enriched system-modeling aims at developing applications with enhanced functionality and advanced
reasoning capabilities. These systems are able to deliver more personalized services to users by domain knowledge with advanced
reasoning mechanisms, and provide solutions to problems that were otherwise infeasible. This approach also takes user's preference
and place property into consideration that can be utilized to achieve a comprehensive range of personalized services, such as
advertising, recommendations, or polling. This paper provides an overview of indoor localization technologies, popular models for
extracting semantics from location data, approaches for associating semantic information and location data, and applications that
may be enabled with location semantics. To make the presentation easy to understand, we will use a museum scenario to explain
pros and cons of different technologies and models. More specifically, we will first explore users' needs in a museum scenario.
Based on these needs, we will then discuss advantages and disadvantages of using different localization technologies to meet these
needs. From these discussions, we can highlight gaps between real application requirements and existing technologies, and point
out promising localization research directions. By identifying gaps between various models and real application requirements,
we can draw a road map for future location semantics research.

Abstract

We propose a robust pointing detection with virtual shadow representation for interacting with a public display. Using a depth camera, our shadow is generated by a model with an angled virtual sun light and detects the nearest point as a pointer. Position of the shadow becomes higher when user walks closer, which conveys the notion of correct distance to control the pointer and offers accessibility to the higher area of the display.

Abstract

The proliferation of workplace multimedia collaboration applications has meant on one hand more opportunities for group work but on the other more data locked away in proprietary interfaces. We are developing new tools to capture and access multimedia content from any source. In this demo, we focus primarily on new methods that allow users to rapidly reconstitute, enhance, and share document-based information.

Abstract

Adapting to personal needs and supporting correct posture are important in physiotherapy training. In this demo, we show a dual screen application (handheld and TV) that allows patients to view hypervideo training programs. Designed to guide their daily exercises, these programs can be adapted to daily needs. The dual screen concept offers the positional flexibility missing in single screen solutions.

Abstract

Dual screen concepts for hypervideo-based physiotherapy training are important in healthcare settings, but existing applications often cannot be adapted to personal needs and do not support correct posture. In this paper, we describe the design and implementation of a dual screen application (handheld and TV) that allows patients to view hypervideos designed to help them correctly perform their exercises. This approach lets patients adapt their training to their daily needs and their overall training progress. We evaluated this prototypical implementation in a user test with post-operative care prostate cancer patients. From our results, we derived design recommendations for dual screen physical training hypervideo applications.

Abstract

Different systems exist for the creation of hypervideos nowadays. However, the creation of the video scenes which are put together to a hypervideo is a tedious and time consuming job. Then again huge video databases like YouTube exist which already provide rich sources of video materials. Yet it is not allowed to download and re-purpose the videos from there legally, which requires a solution to link whole videos or parts of videos and play them from the platform in an embedded player. This work presents the SIVA Web Producer, a Chrome extension for the creation of hypervideos consisting of scenes from YouTube videos. After creating a project, the Chrome extension allows to import YouTube videos or parts thereof as video clips. These can than be linked to a scene graph. A preview is provided and finalized videos can be published on the SIVA Web Portal.

Abstract

In this paper we describe DocuGram, a novel tool to capture and share documents from any application. As users scroll through pages of their document inside the native application (Word, Google Docs, web browser), the system captures and analyses in real-time the video frames and reconstitutes the original document pages into an easy to view HTML-based representation. In addition to regenerating the document pages, a DocuGram also includes the interactions users had over them, e.g. mouse motions and voice comments. A DocuGram acts as a modern copy machine, allowing users to copy and share any document from any application.

Abstract

Most teleconferencing tools treat users in distributed meetings monolithically: all participants are meant to be connected to one another in the more-or-less the same manner. In reality, though, people connect to meetings in all manner of different contexts, sometimes sitting in front of a laptop or tablet giving their full attention, but at other times mobile and involved in other tasks or as a liminal participant in a larger group meeting. In this paper we present the design and evaluation of two applications, Penny and MeetingMate, designed to help users in non-standard contexts participate in meetings.

Abstract

The abundance of data posted to Twitter enables companies to extract useful information, such as Twitter users who are dissatisfied with a product. We endeavor to determine which Twitter users are potential customers for companies and would be receptive to product recommendations through the language they use in tweets after mentioning a product of interest. With Twitter's API, we collected tweets from users who tweeted about mobile devices or cameras. An expert annotator determined whether each tweet was relevant to customer purchase behavior and
whether a user, based on their tweets, eventually bought the product. For the relevance task, among four models, a feed-forward neural network yielded
the best cross-validation accuracy of over 80% per product. For customer purchase prediction of a product, we observed improved performance with the use of sequential input of tweets to recurrent models, with an LSTM model being best; we also observed the use of relevance predictions in our model to be more effective with less powerful RNNs and on more difficult tasks.

Abstract

Two related challenges with current teleoperated robotic systems are lack of peripheral vision and awareness, and difficulty or tedium of navigating through remote spaces. We address these challenges by providing an interface with a focus plus context (F+C) view of the robot location, and where the user can navigate simply by looking where they want to go, and clicking or drawing a path on the view to indicate the desired trajectory or destination. The F+C view provides an undistorted, perspectively correct central region surrounded by a wide field of view peripheral portion, and avoids the need for separate views. The navigation method is direct and intuitive in comparison to keyboard or joystick based navigation, which require the user to be in a control loop as the robot moves. Both the F+C views and the direct click navigation were evaluated in a preliminary user study.

Abstract

Mobile Telepresence Robots (MTR) are an emerging technology that extend the functionality of telepresence systems by adding mobility. MTRs nowadays, however, rely on stationary imaging systems such as a single narrow-view camera for vision, which can lead to reduced operator performance due to view-related deficiencies in situational awareness. We therefore developed an improved imaging and viewing platform that allows immersive telepresence using a Head Mounted Device (HMD) with head-tracked mono and stereoscopic video. Using a remote collaboration task to ground our research, we examine the effectiveness head-tracked HMD systems in comparison to a baseline monitor-based system.
We performed a user study where participants were divided into three groups: fixed camera monitor-based baseline condition (without HMD), HMD with head-tracked 2D camera and HMD with head-tracked stereo camera. Results showed the use of HMD reduces task error rates and improves perceived collaborative success and quality of view, compared to the baseline condition. No major difference was found, however, between stereo and 2D camera conditions for participants wearing an HMD.

Abstract

Social media offers potential opportunities for businesses to
extract business intelligence. This paper presents Tweetviz,
an interactive tool to help businesses extract actionable information from a large set of noisy Twitter messages. Tweetviz visualizes tweet sentiment of business locations, identifies other business venues that Twitter users visit, and estimates some simple demographics of the Twitter users frequenting a business. A user study to evaluate the system's ability indicates that Tweetviz can provide an overview of a business's issues and sentiment as well as information aiding users in creating customer profiles.

Abstract

Web videos are becoming more and more popular. Current web technologies make it simpler than ever to both stream videos and create complex constructs of interlinked videos with additional information (video, audio, images, and text); so called hypervideos. When viewers interact with hypervideos by clicking on links, new content has to be loaded. This may lead to excessive waiting times, interrupting the presentation -- especially when videos are loaded into the hypervideo player. In this work, we propose hypervideo pre-fetching strategies, which can be implemented in players to minimize waiting times. We examine the possibilities offered by the HTML5 tag as well as the Media Source Extensions (MSE). Both HTML5 and MSE allow element pre-fetching (video and additional information) up to a certain granularity. Depending on the strategy and technology used, beginning scene waiting times and the overall download volume may increase. The strategies presented in this paper allow the number of delays during playback and the overall waiting time of a video to be reduced significantly from an average of 8.1 breaks to less than one break. The overall waiting times can be reduced by one third, to less than 18 seconds improving the hypervideo watching experience.

Abstract

Mobile Audio Commander (MAC) is a mobile phone-based multimedia sensing system that facilitates the introduction of extra sensors to existing mobile robots for advanced capabilities. In this paper, we use MAC to introduce an accurate indoor positioning sensor to a robot to facilitate its indoor navigation. More specifically, we use a projector to send out position ID through light signal, use a light sensor and the audio channel on a mobile phone to decode the position ID, and send navigation commands to a target robot through audio output. With this setup, our system can simplify userÃÂ¢ÃÂÃÂs robot navigation. Users can define a robot navigation path on a phone, and our system will compare the navigation path with its accurate location sensor inputs and generate analog line-following signal, collision avoidance signal, and analog angular signal to adjust the robotÃÂ¢ÃÂÃÂs straight movements and turns. This paper describes two examples of using MAC and a positioning system to enable complicated robot navigation with proper user interface design, external circuit design and real sensor installations on existing robots.

Abstract

Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisement. Previous studies in image captioning either rely solely on image content or summarize multiple web documents related to image's location; both neglect users' activities. We propose business-aware latent topics as a new contextual cue for image captioning that represent user activities. The idea is to learn the typical activities of people who posted images from business venues with similar categories (e.g., fast food restaurants) to provide appropriate context for similar topics (e.g., burger) in new posts. User activities are modeled via a latent topic representation. In turn, the image captioning model can generate sentences that better reflect user activities at business venues. In our experiments, the business-aware latent topics are effective for adapting to captions to images captured in various businesses than the existing baselines. Moreover, they complement other contextual cues (image, time) in a multi-modal framework.

Abstract

We previously created the HyperMeeting system to support a chain of geographically and temporally distributed meetings in the form of a hypervideo. This paper focuses on playback plans that guide users through the recorded meeting content by automatically following available hyperlinks. Our system generates playback plans based on users' interests or prior meeting attendance and presents a dialog that lets users select the most appropriate plan. Prior experience with playback plans revealed users' confusion with automatic link following within a sequence of meetings. To address this issue, we designed three timeline visualizations of playback plans. A user study comparing the timeline designs indicated that different visualizations are preferred for different tasks, making switching among them important. The study also provided insights that will guide research of personalized hypervideo, both inside and outside a meeting context.

Abstract

It is difficult to adjust the content of traditional slide presentations to the knowledge level, interest and role of individuals. This might force presenters to include content that is irrelevant for part of the audience, which negatively affects the knowledge transfer of the presentation. In this work, we present a prototype that is able to eliminate non-pertinent information from slides by presenting annotations for individual attendees on optical head-mounted displays. We first create guidelines for creating optimal annotations by evaluating several types of annotations alongside different types of slides. Then we evaluate the knowledge acquisition of presentation attendees using the prototype versus traditional presentations. Our results show that annotations with a limited amount of information, such as text up to 5 words, can significantly increase the amount of knowledge gained from attending a group presentation. Additionally, presentations where part of the information is moved to annotations are judged more positively on attributes such as clarity and enjoyment.

Abstract

WSICC has established itself as a truly interactive workshop at EuroITV'13, TVX'14, and TVX'15 with three successful editions. The fourth edition of the WSICC workshop aims to bring together researchers and practitioners working on novel approaches for interactive multimedia content consumption. New technologies, devices, media formats, and consumption paradigms are emerging that allow for new types of interactivity. Examples include multi-panoramic video and object-based audio, increasingly available in live scenarios with content feeds from a multitude of sources.
All these recent advances have an impact on different aspects related to interactive content consumption, which the workshop categorizes into Enabling Technologies, Content, User Experience, and User Interaction. The resources from past editions of the workshop are available on the http://wsicc.net website.

Abstract

Hypervideo usage scenarios like physiotherapy trainings or instructions for manual tasks make it hard for users to use an input device like a mouse or touch screen on a hand-held device while they are performing an exercise or use both hands to perform a manual task. In this work, we are trying to overcome this issue by providing an alternative input method for hypervideo navigation using speech commands. In a user test, we evaluated two different speech recognition libraries, annyang (in combination with the Web Speech API) and PocketSphinx.js (in combination with the Web Audio API), for their usability to control hypervideo players. Test users spoke 18 words, either in German or English, which were recorded and then processed by both libraries. We found out that annyang shows better recognition results. However, depending on other factors of influence, like the occurrence of background noise (reliability), the availability of an internet connection, or the used browser, PocketSphinx.js may be a better fit.

Abstract

Hypervideo based physiotherapy trainings bear an opportunity to support patients in continuing their training after being released from a rehabilitation clinic. Many exercises require the patient to sit on the floor or a gymnastic ball, lie on a gymnastics mat, or do the exercises in other postures. Using a laptop or tablet with a stand to show the exercises is more helpful than for example just having some drawings on a leaflet. However, it may lead to incorrect execution of the exercises while maintaining eye contact with the screen or require the user to get up and select the next exercise if the devices is positioned for a better view. A dual screen application, where contents are shown on a TV screen and the flow of the video can be controlled from a mobile second device, allows patients to keep their correct posture and the same time view and select contents. In this paper we propose first studies for user interface designs for such apps. Initial paper prototypes are discussed and refined in two focus groups. The results are then presented to a broader range of users in a survey. Three prototypes for the mobile app and one prototype for the TV are identified for future user tests.

Abstract

The creation of hypervideos usually requires a lot of planning and is time consuming with respect to media content creation. However, when structure and media are put together to author a hypervideo, it may only require minor changes to make the hypervideo available in other languages or for another user group (like beginners versus experts). However, to make the translation of media and all navigation elements of a hypervideo efficient and manageable, the authoring tool needs a GUI that provides a good overview of elements that can be translated and of missing translations. In this work, we propose screen concepts that help authors to provide different versions (for example language and/or experience level) of a hypervideo. We analyzed different variants of GUI elements and evaluated them in a survey. We draw guidelines from the results that can help with the creation of similar systems in the future.

Abstract

The confluence of technologies such as telepresence, immersive imaging, model based virtual mirror worlds, mobile live streaming, etc. give rise to a capability for people anywhere to view and connect with present or past
events nearly anywhere on earth. This capability properly belongs to a public commons, available as a birthright of all humans, and can been seen as part of an evolutionary transition supporting a global collective mind. We describe examples and elements of this capability, and suggest how they can be better integrated through a tool we
call TeleViewer and a framework called WorldViews,
which supports easy sharing of views as well as connecting of providers and consumers of views all around the world.

Abstract

Most current mobile and wearable devices are equipped with inertial measurement units (IMU) that allow the detection of motion gestures, which can be used for interactive applications. A difficult problem to solve, however, is how to separate ambient motion from an actual motion gesture input. In this work, we explore the use of motion gesture data labeled with gesture execution phases for training supervised learning classifiers for gesture segmentation. We believe that using gesture execution phase data can significantly improve the accuracy of gesture segmentation algorithms. We define gesture execution phases as the start, middle and end of each gesture. Since labeling motion gesture data with gesture execution phase information is work intensive, we used crowd workers to perform the labeling. Using this labeled data set, we trained SVM-based classifiers to segment motion gestures from ambient movement of the device t. We describe initial results that indicate that gesture execution phase can be accurately recognized by SVM classifiers. Our main results show that training gesture segmentation classifiers with phase-labeled data substantially increases the accuracy of gesture segmentation: we achieved a gesture segmentation accuracy of 0.89 for simulated online segmentation using a sliding window approach.

Abstract

As workstyles change to include more dynamic contexts and denser spaces, connected objects and spaces in the workplace can play a bigger role in helping people get their work done, while also helping them navigate the continually blending boundaries between work- and home lives. In this talk, we argue that the workplace is particularly well-suited for realizing the "connected life" by including both company-initiated sensing in the workplace and personal tracking devices introduced by individual workers. We describe some examples of ubiquitous sensing in the workplace, and future opportunities as well as open technical and ethical issues for designing for the connected [work]life.

Abstract

Search log analysis has become a common practice to gain insights into user search behaviour, it helps gain an understanding of user needs and preferences, as well as how well a system supports such needs. Currently log analysis is typically focused on the low-level user actions, i.e. logged events such as issued queries and clicked results; and often only a selection of such events are logged and analysed. However, the types of logged events may differ widely from interface to interface, making comparison between systems difficult. Further, analysing a selection of events may lead to conclusions out of contextâ e.g. the statistics of observed query reformulations may be influenced by the existence of a relevance feedback component. Alternatively, in lab studies user activities can be analysed at a higher level, such as search tactics and strategies, abstracted away from detailed interface implementation. However, the required manual codings that map logged events to higher level interpretations prevent this type of analysis from going large scale. In this paper, we propose a new method for analysing search logs by (semi-)automatically identifying user search tactics from logged events, allowing large scale analysis that is comparable across search systems. We validate the efficiency and effectiveness of the proposed tactic identification method using logs of two reference search systems of different natures: a product search system and a video search system. With the identified tactics, we perform a series of novel log analyses in terms of entropy rate of user search tactic sequences, demonstrating how this type of analysis allows comparisons of user search behaviours across systems of different nature and design. This analysis provides insights not achievable with traditional log analysis.

Abstract

We propose a method for extractive summarization of audiovisual recordings focusing on topic-level segments. We first build a content similarity graph between all segments of all documents in the collection, using word vectors from the transcripts, and then select the most central segments for the summaries. We evaluate the method quantitatively on the AMI Meeting Corpus using gold standard reference summaries and the Rouge metric, and qualitatively on lecture recordings using a novel two-tiered approach with human judges. The results show that our method compares favorably with others in terms of Rouge, and outperforms the baselines for human scores, thus also validating our evaluation protocol.

Abstract

Many people post about their daily life on social media. These posts may include information about the purchase activity of people, and insights useful to companies can be derived from them: e.g. profile information of a user who mentioned something about their product. As a further advanced analysis, we consider extracting users who are likely to buy a product from the set of users who mentioned that the product is attractive.
In this paper, we report our methodology for building a corpus for Twitter user purchase behavior prediction. First, we collected Twitter users who posted a want phrase + product name: e.g. "want a Xperia" as candidate want users, and also candidate bought users in the same way. Then, we asked an annotator to judge whether a candidate user actually bought a product. We also annotated whether tweets randomly sampled from want/bought user timelines are relevant or not to purchase. In this annotation, 58% of want user tweets and 35% of bought user tweets were annotated as relevant. Our data indicate that information embedded in timeline tweets can be used to predict purchase behavior of tweeted products.

Abstract

The negative effect of lapses during a behavior-change program has been shown to increase the risk of repeated lapses and, ultimately, program abandonment. In this paper, we examine the potential of system-driven lapse management -- supporting users through lapses as part of a behavior-change tool. We first review lessons from domains such as dieting and addiction research and discuss the design space of lapse management. We then explore the value of one approach to lapse management -- the use of "cheat points" as a way to encourage sustained participation. In an online study, we first examine interpretations of progress that was reached through using cheat points. We then present findings from a deployment of lapse management in a two-week field study with 30 participants. Our results demonstrate the potential of this approach to motivate and change users' behavior. We discuss important open questions for the design of future technology-mediated behavior change programs.

Abstract

Taking breaks from work is an essential and universal practice. In this paper, we extend current research on productivity in the workplace to consider the break habits of knowledge workers and explore opportunities of break logging for personal informatics. We report on three studies. Through a survey of 147 U.S.-based knowledge workers, we investigate what activities respondents consider to be breaks from work, and offer an understanding of the benefit workers desire when they take breaks. We then present results from a two-week in-situ diary study with 28 participants in the U.S. who logged 800 breaks, offering insights into the effect of work breaks on productivity. We finally explore the space of information visualization of work breaks and productivity in a third study. We conclude with a discussion of implications for break recommendation systems, availability and interuptibility research, and the quantified workplace.

Abstract

We describe a novel thermal haptic output device, ThermoTouch, that provides a grid of thermal pixels. Unlike previous devices which mainly use Peltier elements for thermal output, ThermoTouch uses liquid cooling and electro-resistive heating to output thermal feedback at arbitrary grid locations. We describe the design of the prototype, highlight advantages and disadvantages of the technique and briefly discuss future improvements and research applications.

Abstract

Silicon Valley is home to many of the worldâs largest technology corporations, as well as thousands of small startups. Despite the development of other high-tech economic centers throughout the US and around the world, Silicon Valley continues to be a leading hub for high-tech innovation and development, in part because most of its companies and universities are within 20 miles of each other. Given the high concentration of multimedia researchers in Silicon Valley, and the high demand for information exchange, I was able to work with a team of researchers from various companies and organizations to start the Bay Area Multimedia Forum (BAMMF) series back in November 2013.

Abstract

With modern technologies, it is possible to create annotated interactive non-linear videos (a form of hypervideo) for the Web. These videos have a non-linear structure of linked scenes to which additional information (other media like images, text, audio, or additional videos) can be added. A variety of user interactions - like in- and between-scene navigation or zooming into additional information - are possible in players for this type of video. Like linear video, quality of experience (QoE) with annotated hypervideo experiences is tied to the temporal consistency of the video stream at the client end - its flow. Despite its interactive complexity, users expect this type of video experience to flow as seamlessly as simple linear video. However, the added hypermedia elements bog playback engines down.
Download and cache management systems address the flow issue, but their effectiveness is tied to numerous questions respecting user requirements, computational strategy, and evaluative metrics. In this work, we a) define QoE metrics, b) examine structural and behavioral patterns of interactive annotated non-linear video, c) propose download and cache management algorithms and strategies, d) describe the implementation of an evaluative simulation framework, and e) present the algorithm test results.

Abstract

We present a method for profiling businesses at specific locations that is based on mining information from social media. The method matches geo-tagged tweets from Twitter against venues from Foursquare to identify the specific business mentioned in a tweet. By linking geo-coordinates to places, the tweets associated with a business, such as a store, can then be used to profile that business. From these venue-located tweets, we create sentiment profiles for each of the stores in a chain. We present the results as heat maps showing how sentiment differs across stores in the same chain and how some chains have more positive sentiment than other chains. We also estimate social group size from photos and create profiles of social group size for businesses. Sample heat maps of these results illustrate how the average social group size can vary across businesses.

Abstract

We describe methods for analyzing and visualizing document metadata to provide insights about collaborations over time. We investigate the use of Latent Dirichlet Allocation (LDA) based topic modeling to compute areas of interest on which people collaborate. The topics are represented in a node-link force directed graph by persistent fixed nodes laid out with multidimensional scaling (MDS), and the people by transient movable nodes. The topics are also analyzed to detect bursts to highlight "hot" topics during a time interval. As the user manipulates a time interval slider, the people nodes and links are dynamically updated. We evaluate the results of LDA topic modeling for the visualization by comparing topic keywords against the submitted keywords from the InfoVis 2004 Contest, and we found that the additional terms provided by LDA-based keyword sets result in improved similarity between a topic keyword set and the documents in a corpus. We extended the InfoVis dataset from 8 to 20 years and collected publication metadata from our lab over a period of 21 years, and created interactive visualizations for exploring these larger datasets.

Abstract

The use of videoconferencing in the workplace has been steadily growing. While multitasking during video conferencing is often necessary, it is also viewed as impolite and sometimes unacceptable. One potential contributor to negative attitudes towards such multitasking is the disrupted sense of eye contact that occurs when an individual shifts their gaze away to another screen, for example, in a dual-monitor setup, common in office settings. We present a system to improve a sense of eye contact over videoconferencing in dual-monitor setups. Our system uses computer vision and desktop activity detection to dynamically choose the camera with the best view of a user's face. We describe two alternative implementations of our system (RGB-only, and a combination of RGB and RGB-D cameras). We then describe results from an online experiment that shows the potential of our approach to significantly improve perceptions of a person's politeness and engagement in the meeting.

Abstract

This paper presents a detailed examination of factors that affect perceptions of, and attitudes towards multitasking in dyadic video conferencing. We first report findings from interviews with 15 professional users of videoconferencing. We then report results from a controlled online experiment with 397 participants based in the United States. Our results show that the technology used for multitasking has a significant effect on others' assumptions of what secondary activity the multitasker is likely engaged in, and that this assumed activity in turn affects evaluations of politeness and appropriateness. We also describe how different layouts of the video conferencing UI may lead to better or worse ratings of engagement and in turn ratings of polite or impolite behavior. We then propose a model that captures our results and use the model to discuss implications for behavior and for the design of video communication tools.

Abstract

We present MixMeetWear, a smartwatch application that allows users to maintain awareness of the audio and visual content of a meeting while completing other tasks. Users of the system can listen to the audio of a meeting and also view, zoom, and pan webcam and shared content keyframes of other meeting participants' live streams in real time. Users can also provide input to the meeting via speech-to-text or predefined responses. A study showed that the system is useful for peripheral awareness of some meetings.

Abstract

Remote meetings are messy. There are an ever-increasing number of support tools available, and, as past work has shown, people will tend to select a subset of those tools to satisfy their own institutional, social, and personal needs. While video tools make it relatively easy to have conversations at a distance, they are less adapted to sharing and archiving multimedia content. In this paper we take a deeper look at how sharing multimedia content occurs before, during, and after distributed meetings. Our findings shed light on the decisions and rationales people use to select from the vast set of tools available to them to prepare for, conduct, and reconcile the results of a remote meeting.

Abstract

In recent years, there has been an explosion of services that lever- age location to provide users novel and engaging experiences. However, many applications fail to realize their full potential because of limitations in current location technologies. Current frameworks work well outdoors but fare poorly indoors. In this paper we present LoCo, a new framework that can provide highly accurate room-level indoor location. LoCo does not require users to carry specialized location hardwareÃÂ¢ÃÂÃÂit uses radios that are present in most contemporary devices and, combined with a boosting classification technique, provides a significant runtime performance improvement. We provide experiments that show the combined radio technique can achieve accuracy that improves on current state-of-the-art Wi-Fi only techniques. LoCo is designed to be easily deployed within an environment and readily leveraged by application developers. We believe LoCoÃÂ¢ÃÂÃÂs high accuracy and accessibility can drive a new wave of location-driven applications and services.

Abstract

Image localization is important for marketing and recommendation of local business; however, the level of granularity is still a critical issue. Given a consumer photo and its rough GPS information, we are interested in extracting the fine-grained location information (i.e. business venues) of the image. To this end, we propose a novel framework for business venue recognition. The framework mainly contains three parts. First, business aware visual concept discovery: we mine a set of concepts that are useful for business venue recognition based on three guidelines including business-awareness, visually detectable, and discriminative power. Second, business-aware concept detection by convolutional neural networks (BA-CNN): we pro- pose a new network architecture that can extract semantic concept features from input image. Third, multimodal business venue recognition: we extend visually detected concepts to multimodal feature representations that allow a test image to be associated with business reviews and images from social media for business venue recognition. The experiments results show the visual concepts detected by BA-CNN can achieve up to 22.5% relative improvement for business venue recognition compared to the state-of-the-art convolutional neural network features. Experiments also show that by leveraging multimodal information from social media we can further boost the performance, especially in the case when the database images belonging to each business venue are scarce.

Abstract

Hypervideos yield to different challenges in the area of navigation due to their underlying graph structure. Especially when used on tablets or by older people, a lack of clarity may lead to confusion and rejection of this type of medium. To avoid confusion, the hypervideo can be extended with a well known table of contents, which needs to be created separately by the authors due to an underlying graph structure. In this work, we present an extended presentation of a table of contents for hypervideos on mobile devices. The design was tested in a real world medical training scenario with the target group of people older than 45 which is the main target group of these applications. This user group is a particular challenge since they sometimes have limited experience in the use of mobile devices and physical deficiencies with growing age. Our user interface was designed in three steps. The findings of an expert group and a survey were used to create two different prototypical versions of the display, which were then tested against each other in a user test. This test revealed that a divided view is desired. The table of contents in an easy-to-touch version should be on the left side and previews of scenes should be on the right side of the view. These findings were implemented in the existing SIVA HTML5 open source player and tested with a second group of users. This test only lead to minor changes in the GUI.

Abstract

Indoor localization is challenging in terms of both the
accuracy and possible using scenarios. In this paper, we
introduce the design and implementation of a toy car
localization and navigation system, which demonstrates that a
projected light based localization technique allows multiple
devices to know and exchange their fine-grained location
information in an indoor environment. The projected light
consists of a sequence of gray code images which assigns each
pixel in the projection area a unique gray code so as to
distinguish their coordination. The light sensors installed on the
toy car and the potential “passenger” receive the light stream
from the projected light stream, based on which their locations
are computed. The toy car then utilizes A* algorithm to plan the
route based on its own location, orientation, the target’s location
and the map of available “roads”. The fast speed of localization
enables the toy car to adjust its own orientation while “driving”
and keep itself on “roads”. The toy car system demonstrates that
the localization technique can power other applications that
require fine-grained location information of multiple objects
simultaneously.

Abstract

In this paper, we analyze the association between a social media user's photo content and their interests. Visual content of photos is analyzed using state-of-the-art deep learning based automatic concept recognition. An aggregate visual concept signature is thereby computed for each user. User tags manually applied to their photos are also used to construct a tf-idf based signature per user. We also obtain social groups that users join to represent their social interests. In an effort to compare the visual-based versus tag-based user profiles with social interests, we compare corresponding similarity matrices with a reference similarity matrix based on users' group memberships. A random baseline is also included that groups users by random sampling while preserving the actual group sizes. A difference metric is proposed and it is shown that the combination of visual and text features better approximates the group-based similarity matrix than either modality individually. We also validate the visual analysis against the reference inter-user similarity using the Spearman rank correlation coefficient. Finally we cluster users by their visual signatures and rank clusters using a cluster uniqueness criteria.

Abstract

Knowing the geo-located venue of a tweet can facilitate better understanding of a user's geographic context, allowing apps to more precisely present information, recommend services, and target advertisements. However, due to privacy concerns, few users choose to enable geotagging of their tweets resulting in a small percentage of tweets being geotagged; furthermore, even if the geo-coordinates are available, the closest venue to the geo-location may be incorrect.
In this paper, we present a method for providing a ranked list of geo-located venues for a non-geotagged tweet, which simultaneously indicates the venue name and the geo-location at a very fine-grained granularity. In our proposed method for Venue Inference for Tweets ({\VIT}), we construct a heterogeneous social network in order to analyze the embedded social relations, and leverage available but limited geographic data to estimate the geo-located venue of tweets. A single classifier is trained to predict the probability of a tweet and a geo-located venue being linked, rather than training a separate model for each venue. We examine the performance of four types of social relation features and three types of geographic features embedded in a social network when predicting whether a tweet and a venue are linked, with a best accuracy of over 88%. We use the classifier probability estimates to rank the predicted geo-located venues of a non-geotagged tweet from over 19k possibilities, and observed an average top-5 accuracy of 29%.

Abstract

Establishing common ground is one of the key problems for any form of communication. The problem is particularly pronounced in remote meetings, in which participants can easily lose track of the details of dialogue for any number of reasons. In this demo we present a web-based tool, MixMeet, that allows teleconferencing participants to search the contents of live meetings so they can rapidly retrieve previously shared content to get on the same page, correct a misunderstanding, or discuss a new idea.

Abstract

New technology comes about in a number of different ways. It may come from advances in scientific research, through new combinations of existing technology, or by simply from imagining what might be possible in the future. This video describes the evolution of Tabletop Telepresence, a system for remote collaboration through desktop videoconferencing combined with a digital desk. Tabletop Telepresence provides a means to share paper documents between remote desktops, interact with documents and request services (such as translation), and communicate with a remote person through a teleconference. It was made possible by combining advances in camera/projector technology that enable a fully functional digital desk, embodied telepresence in video conferencing and concept art that imagines future workstyles.

Abstract

While synchronous meetings are an important part of collaboration, it is not always possible for all stakeholders to meet at the same time. We created the concept of hypermeetings to support meetings with asynchronous attendance. Hypermeetings consist of a chain of video-recorded meetings with hyperlinks for navigating through the video content. HyperMeeting supports the synchronized viewing of prior meetings during a videoconference. Natural viewing behavior such as pausing generates hyperlinks between the previously recorded meetings and the current video recording. During playback, automatic link-following guided by playback plans present the relevant content to users. Playback plans take into account the user's meeting attendance and viewing history and match them with features such as speaker segmentation. A user study showed that participants found hyperlinks useful but did not always understand where they would take them. The study results provide a good basis for future system improvements.

Abstract

A localization system is a coordinate system for describing the world, organizing the world, and controlling the world. Without a coordinate system, we cannot specify the world in mathematical forms; we cannot regulate processes that may involve spatial collisions; we cannot even automate a robot for physical actions. This paper provides an overview of indoor localization technologies, popular models for extracting semantics from location data, approaches for associating semantic information and location data, and applications that may be enabled with location semantics. To make the presentation easy to understand, we will use a museum scenario to explain pros and cons of different technologies and models. More specifically, we will first explore users' needs in a museum scenario. Based on these needs, we will then discuss advantages and disadvantages of using different localization technologies to meet these needs. From these discussions, we can highlight gaps between real application requirements and existing technologies, and point out promising localization research directions. Similarly, we will also discuss context information required by different applications and explore models and ontologies for connecting users, objects, and environment factors with semantics. By identifying gaps between various models and real application requirements, we can draw a road map for future location semantics research.

Abstract

To facilitate distributed communication in mobile settings, we developed a system for creating and sharing gaze anno-tations using head mounted displays, such as Google Glass. Gaze annotations make it possible to point out objects of interest within an image and add a verbal description to it. To create an annotation, the user simply looks at an object of interest in the image and speaks out the information connected to the object. The gaze location is recorded and inserted as a gaze marker and the voice is transcribed using speech recognition. After an annotation has been created, it can be shared with another person. We performed a user study that showed that users experienced that gaze annota-tions add precision and expressiveness compared to an annotation to the whole image.

Abstract

We present a novel system for detecting and capturing paper documents on a tabletop using a 4K video camera mounted overhead on pan-tilt servos. Our automated system first finds paper documents on a cluttered tabletop based on a text probability map, and then takes a sequence of high-resolution frames of the located document to reconstruct a high quality and fronto-parallel document page image. The quality of the resulting images enables OCR processing on the whole page. We performed a preliminary evaluation on a small set of 10 document pages and our proposed system achieved 98% accuracy with the open source Tesseract OCR engine.

Abstract

Web-based tools for remote collaboration are quickly becoming an established element of the modern workplace. During live meetings, people share web sites, edit presentation slides, and share code editors. It is common for participants to refer to previously spoken or shared content in the course of synchronous distributed collaboration. A simple approach is to index with Optical Character Recognition
(OCR) the video frames, or key-frames, being shared and let user retrieve them with text queries. Here we show that a complementary approach is to look at the actions users
take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection
of text and cursor motion, their implementation in our WebRTC-based system, and how users are better able to search live documents during a meeting based on these detected and indexed actions.