Abstract

Tutorials are one of the most fundamental means of conveying knowledge. Ideally when the task involves physical or digital objects, tutorials not only describe each step with text or via audio narration but show it as well using photos or animation. In most cases, online tutorial authors capture media from handheld mobile devices to compose these documents, but increasingly they use wearable devices as well. In this work, we explore the full life-cycle of online tutorial creation and viewing using head-mounted capture and displays. We developed a media-capture tool for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial's content rather than its capture. The capture tool is coupled with web-based authoring tools for creating annotatable videos and multimedia documents. In a study comparing standalone (camera on tripod) versus wearable capture (Google Glass) as well as two types of multimedia representation for authoring tutorials (video-based or document-based), we show that tutorial authors have a preference for wearable capture devices, especially when recording activities involving larger objects in non-desktop environments. Authors preferred document-based multimedia tutorials because they are more straightforward to compose and the step-based structure translates more directly to explaining a procedure. In addition, we explored using head-mounted displays (Google Glass) for accessing tutorials in comparison to lightweight computing devices such as tablets. Our study included tutorials recorded with the same capture methods as in our access study. We found that although authors preferred head-mounted capture, tutorial consumers preferred video recorded by a camera on tripod that provides a more stable image of the workspace. Head-mounted displays are good for glanceable information, however video demands more attention and our participants made more errors using Glass than when using a tablet, which was easier to ignore. Our findings point out several design implications for online tutorial authoring and access methods.

Abstract

Advances in small and low power electronics have created new opportunities for the Internet of Things (IoT), leading to an explosion of physical objects being connected to the Internet. However, there still lacks an indoor localization solution that can answer the needs of various location-based IoT applications with desired simplicity, robustness, accuracy, and responsiveness. We introduce Foglight, a visible light enabled indoor localization system for IoT devices that relies on unique spatial encoding produced when mechanical mirrors inside a projector are flipped based on gray-coded binary images. Foglight employs simple off-the-shelf light sensors that can be easily coupled with existing IoT devices - such as thermometers, gas meters, or light switches - making their location discoverable. Our sensor unit is computation efficient; it can perform high-accuracy localization with minimum signal processing overhead, allowing any low-power IoT device on which it rests to be able to locate itself. Additionally, results from our evaluation reveal that Foglight can locate a target device with an average accuracy of 1.7 millimeters and average refresh rate of 84 Hz with minimal latency, 31.46 milliseconds on WiFi and 23.2 milliseconds on serial communication. Two example applications are developed to demonstrate possible scenarios as proof of concept. We also discuss limitations, how they could be overcome, and propose next steps.

Abstract

We present a system for capturing ink strokes written with ordinary pen and paper using a fast camera with a frame rate comparable to a stylus digitizer. From the video frames, ink strokes are extracted and used as input to an online handwriting recognition engine. A key component in our system is a pen up/down detection model for detecting the contact of the pen-tip with the paper in the video frames. The proposed model consists of feature representation with convolutional neural networks and classification with a recurrent neural network. We also use a high speed tracker with kernelized correlation filters to track the pen-tip. For training and evaluation, we collected labeled video data of users writing English and Japanese phrases from public datasets, and we report on character accuracy scores for different frame rates in the two languages.

Abstract

Video telehealth is growing to allow more clinicians to see patients from afar. As a result, clinicians, typically trained for in-person visits, must learn to communicate both health information and non-verbal affective signals to patients through a digital medium. We introduce a system called ReflectLive that senses and provides real-time feedback about non-verbal communication behaviors to clinicians so they can improve their communication behaviors. A user evaluation with 10 clinicians showed that the real-time feedback helped clinicians maintain better eye contact with patients and was not overly distracting. Clinicians reported being more aware of their non-verbal communication behaviors and reacted positively to summaries of their conversational metrics, motivating them to want to improve. Using ReflectLive as a probe, we also discuss the benefits and concerns around automatically quantifying the “soft skills” and complexities of clinician-patient communication, the controllability of behaviors, and the design considerations for how to present real-time and summative feedback to clinicians.

Abstract

Humans are complex and their behaviors follow complex multimodal patterns, however to solve many social computing problems one often looks at complexity in large-scale yet single point data sources or methodologies. While single data/single method techniques, fueled by large scale data, enjoyed some success, it is not without fault. Often with one type of data and method, all the other aspects of human behavior are overlooked, discarded, or, worse, misrepresented. We identify this as two succinct problems. First, social computing problems that cannot be solved using a single data source and need intelligence from multiple modals and, second, social behavior that cannot be fully understood using only one form of methodology. Throughout this talk, we discuss these problems and their implications, illustrate examples, and propose new directives to properly approach in the social computing research in today’s age.

Abstract

Discovering and analyzing biclusters, i.e., two sets of related entities with close relationships, is a critical task in many real-world applications, such as exploring entity co-occurrences in intelligence analysis, and studying gene expression in bio-informatics. While the output of biclustering techniques can offer some initial low-level insights, visual approaches are required on top of that due to the algorithmic output complexity.This paper proposes a visualization technique, called BiDots, that allows analysts to interactively explore biclusters over multiple domains. BiDots overcomes several limitations of existing bicluster visualizations by encoding biclusters in a more compact and cluster-driven manner. A set of handy interactions is incorporated to support flexible analysis of biclustering results. More importantly, BiDots addresses the cases of weighted biclusters, which has been underexploited in the literature. The design of BiDots is grounded by a set of analytical tasks derived from previous work. We demonstrate its usefulness and effectiveness for exploring computed biclusters with an investigative document analysis task, in which suspicious people and activities are identified from a text corpus.

Abstract

During asynchronous collaborative analysis, handoff of partial findings is challenging because externalizations produced by analysts may not adequately communicate their investigative process. To address this challenge, we developed techniques to automatically capture and help encode tacit aspects of the investigative process based on an analyst’s interactions, and streamline explicit authoring of handoff annotations. We designed our techniques to mediate awareness of analysis coverage, support explicit communication of progress and uncertainty with annotation, and implicit communication through playback of investigation histories. To evaluate our techniques, we developed an interactive visual analysis system, KTGraph, that supports an asynchronous investigative document analysis task. We conducted a two-phase user study to characterize a set of handoff strategies and to compare investigative performance with and without our techniques. The results suggest that our techniques promote the use of more effective handoff strategies, help increase an awareness of prior investigative process and insights, as well as improve final investigative outcomes.

Abstract

Whether and how does the structure of family trees differ by ancestral traits over generations? This is a fundamental question regarding the structural heterogeneity of family trees for the multi-generational transmission research. However, previous work mostly focuses on parent-child scenarios due to the lack of proper tools to handle the complexity of extending the research to multi-generational processes. Through an iterative design study with social scientists and historians, we develop TreeEvo that assists users to generate and test empirical hypotheses for multi-generational research. TreeEvo summarizes and organizes family trees by structural features in a dynamic manner based on a traditional Sankey diagram. A pixel-based technique is further proposed to compactly encode trees with complex structures in each Sankey Node. Detailed information of trees is accessible through a space-efficient visualization with semantic zooming. Moreover, TreeEvo embeds Multinomial Logit Model (MLM) to examine statistical associations between tree structure and ancestral traits. We demonstrate the effectiveness and usefulness of TreeEvo through an in-depth case-study with domain experts using a real-world dataset (containing 54,128 family trees of 126,196 individuals).

Abstract

For tourists, interactions with digital public displays often depend on specific technologies that users may not be familiar with (QR codes, NFC, Bluetooth); may not have access to because of networking issues (SMS), may lack a required app (QR codes), or device technology (NFC); may not want to use because of time constraints (WiFi, Bluetooth); or may not want to use because they are worried about sharing their data with a third-party service (text, WiFi). In this demonstration, we introduce ItineraryScanner, a system that allows users to
seamlessly share content with a public travel kiosk system.

Abstract

Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network architecture that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of- the-art approaches in both individual tasks.

Abstract

In this paper, we describe DocHandles, a novel system that allows users to link to specific document parts in their chat applications. As users type a message, they can invoke the tool by referring to a specific part of a document, e.g., “@fig1 needs revision”. By combining text parsing and document layout analysis, DocHandles can find and present all the figures “1” inside previously shared documents, allowing users to explicitly link to the relevant “document handle”. Documents become first-class citizens inside the conversation stream where users can seamlessly integrate documents in their text-centric messaging application.

Abstract

It is increasingly possible to use cameras and sensors to detect and analyze human appearance for the purposes of personalizing user experiences. Such systems are already deployed in some public places to personalize advertisements and recommend items. However, since these technologies are not yet widespread, we do not have a good sense of the perceived benefits and drawbacks of public display systems that use face detection as an input for personalized recommendations. We conducted a user study with a system that inferred a user’s gender and age from a facial detection and analysis algorithm and used this to present recommendations in two scenarios (finding stores to visit in a mall and finding a pair of sunglasses to buy). This work provides an initial step towards understanding user reactions to a new and emerging form of implicit recommendation based on physical appearance.

Abstract

The availability of mobile access has shifted social media use. With that phenomenon, what users shared on social media and where they visited is naturally an excellent resource to learn their visiting behavior. Knowing visit behaviors would help market survey and customer relationship management, e.g., sending customers coupons of the businesses that they visit frequently. Most prior studies leverage meta-data e.g., check- in locations to profile visiting behavior but neglect important information from user-contributed content, e.g., images. This work addresses a novel use of image content for predicting the user visit behavior, i.e., the frequent and regular business venue categories that the content owner would visit. To collect training data, we propose a strategy to use geo-metadata associated with images for deriving the labels of an image owner’s visit behavior. Moreover, we model a user’s sequential images by using an end-to-end learning framework to reduce the optimization loss. That helps improve the prediction accuracy against the baseline as demonstrated in our experiments. The prediction is completely based on image content that is more available in social media than geo-metadata, and thus allows coverage in profiling a wider set of users.

Abstract

Video conferencing is widely used to help deliver
educational presentations, such as lectures or informational
webinars, to a distributed audience. While individuals in a
dyadic conversation may be able to use webcam streams to assess the engagement level of their interlocutor with some ease, as the size of the audience in a video conference setting increases, it becomes increasingly difficult to interpret how engaged the overall group may be. In this work, we use a mixed-methods approach to understand how presenters and attendees of online presentations use available cues to perceive and interpret audience behavior (such as how engaged the group is). Our results suggest that while webcams are seen as useful by presenters to increase audience visibility and encourage attention, audience members do not uniformly benefit from seeing others’ webcams; other interface cues such as chat may be more useful and informative engagement indicators for both parties. We conclude with design recommendations for future systems to improve what is sensed and presented.

Abstract

In this paper, we propose a real-time classification scheme to cope with noisy Radio Signal Strength Indicator (RSSI) measurements utilized in indoor positioning systems. RSSI values are often converted to distances for position estimation. However due to multipathing and shadowing effects, finding a unique sensor model using both parametric and nonparametric methods is highly challenging. We learn decision regions using the Gaussian Processes classification to accept measurements that are consistent with the operating sensor model. The proposed approach can perform online, does not rely on a particular sensor model or parameters, and is robust to sensor failures. The experimental results achieved using hardware show that available positioning algorithms can benefit from incorporating the classifier into their measurement model as a meta-sensor modeling technique.

Abstract

Observe at a person pointing out and describing something. Where is that person looking? Chances are good that this person also looks at what she is talking about and pointing at. Gaze is naturally coordinated with our speech and hand movements. By utilizing this tendency, we can create a natural interaction with computing devices and environments. In this chapter, we will first briefly discuss some basic properties of the gaze signal we can get from eye trackers, followed by a review of a multimodal system utilizing the gaze signal as one input modality. In Multimodal Gaze Interaction, data from eye trackers is used as an active input mode where for instance gaze is used as an alternative, or complimentary, pointing modality along with other input modalities. Using gaze as an active or explicit input method is challenging for several reasons. One of them being that eyes are primarily used for perceiving our environment, so knowing when a person selects an item with gaze versus just looking around is an issue. Researchers have tried to solve this by combining gaze with various input methods, such as manual pointing, speech, touch, etc. However, gaze information can also be used in interactive systems, for other purposes than explicit pointing since a user's gaze is a good indication of the user's attention. In passive gaze interaction, the gaze is not used as the primary input method, but as a supporting input method. In these kinds of systems, gaze is mainly used for inferring and reasoning about the user's cognitive state or activities in a way that can support the interaction. These kinds of multimodal systems often combine gaze with a multitude of input modalities.
In this chapter we focus on interactive systems, exploring the design space for gaze-informed multimodal interaction spanning from gaze as active input mode to passive and if the usage scenario is stationary (at e.g. a desk) or mobile. There are a number of studies aimed at describing, detecting or modeling specific behaviors or cognitive states. We will touch on some of these works since they can guide us in how to build gaze-informed multimodal interaction.

Abstract

Work breaks can play an important role in the mental and physical well-being of workers and contribute positively to productivity. In this paper we explore the use of activity-, physiological-, and indoor-location sensing to promote mobility during work-breaks. While the popularity of devices and applications to promote physical activity is growing, prior research highlights important constraints when designing for the workplace. With these constraints in mind, we developed BreakSense, a mobile application that uses a Bluetooth beacon infrastructure, a smartphone and a smartwatch to encourage mobility during breaks with a game-like design. We discuss constraints imposed by design for work and the workplace, and highlight challenges associated with the use of noisy sensors and methods to overcome them. We then describe a short deployment of BreakSense within our lab that examined bound vs. unbound augmented breaks and how they affect users’ sense of completion and readiness to work.

Abstract

Users often use social media to share their interest in products. We propose to identify purchase stages from Twitter data following the AIDA model (Awareness, Interest, Desire, Action). In particular, we define a task of classifying the purchase stage of each tweet in a user's tweet sequence. We introduce RCRNN, a Ranking Convolutional Recurrent Neural Network which computes tweet representations using convolution over word embeddings and models a tweet sequence with gated recurrent units. Also, we consider various methods to cope with the imbalanced label distribution in our data and show that a ranking layer outperforms class weights.

Abstract

We present Lift, a visible light-enabled finger tracking and object localization technique that allows users to perform freestyle multi-touch gestures on any object’s surface in an everyday environment. By projecting encoded visible patterns onto an object’s surface (e.g. paper, display, or table), and localizing the user’s fingers with light sensors, Lift offers users a richer interactive space than the device’s existing interfaces. Additionally, everyday objects can be augmented by attaching sensor units onto their surface to accept multi-touch gesture input. We also present two applications as a proof of concept. Finally, results from our experiments indicate that Lift can localize ten fingers simultaneously with accuracy of 0.9 mm and 1.8 mm on two axes respectively and an average refresh rate of 84 Hz with 16.7ms delay on WiFi and 12ms delay on serial, making gesture recognition on noninstrumented objects possible.

Abstract

This is a summary of our participation in the TRECVID 2016 video hyperlinking task (LNK). We submitted four runs in total. A baseline system combined on established vectorspace text indexing and cosine similarity. Our other runs explored the use of distributed word representations in combination with fine-grained inter-segment text similarity measures.

Abstract

Videos recorded with current mobile devices are increasingly geotagged at fine granularity and used in various location based applications and services. However, raw sensor data collected is often noisy, resulting in subsequent inaccurate geospatial analysis. In this study, we focus on the challenging correction of compass readings and present an automatic approach to reduce these metadata errors. Given the small geo-distance between consecutive video frames, image-based localization does not work due to the high ambiguity in the depth reconstruction of the scene. As an alternative, we collect geographic context from OpenStreetMap and estimate the absolute viewing direction by comparing the image scene to world projections obtained with different external camera parameters. To design a comprehensive model, we further incorporate smooth approximation and feature-based rotation estimation when formulating the error terms. Experimental results show that our proposed pyramid-based method outperforms its competitors and reduces orientation errors by an average of 58.8%. Hence, for downstream applications, improved results can be obtained with these more accurate geo-metadata. To illustrate, we present the performance gain in landmark retrieval and tag suggestion by utilizing the accuracy-enhanced geo-metadata.

Abstract

Accurate map matching has been a fundamental but challenging problem that has drawn great research attention in recent years. It aims to reduce the uncertainty in a trajectory by matching the GPS points to the road network on a digital map. Most existing work has focused on estimating the likelihood of a candidate path based on the GPS observations, while neglecting to model the probability of a route choice from the perspective of drivers. Here we propose a novel feature-based map matching algorithm that estimates the cost of a candidate path based on both GPS observations and human factors. To take human factors into consideration is very important especially when dealing with low sampling rate data where most of the movement details are lost. Additionally, we simultaneously analyze a subsequence of coherent GPS points by utilizing a new segment-based probabilistic map matching strategy, which is less susceptible to the noisiness of the positioning data. We have evaluated the proposed approach on a public large-scale GPS dataset, which consists of 100 trajectories distributed all over the world. The experimental results show that our method is robust to sparse data with large sampling intervals (e.g., 60 s to 300 s) and challenging track features (e.g., u-turns and loops). Compared with two state-of-the-art map matching algorithms, our method substantially reduces the route mismatch error by 6.4% to 32.3% and obtains the best map matching results in all the different combinations of sampling rates and challenging features.

Abstract

Improvements in sensor and wireless network enable accurate, automated, instant determination and dissemination of a user's or
objects position. The new enabler of location-based services (LBSs) apart from the current ubiquitous networking infrastructure is
the enrichment of the different systems with semantics information, such as time, location, individual capability, preference and
more. Such semantically enriched system-modeling aims at developing applications with enhanced functionality and advanced
reasoning capabilities. These systems are able to deliver more personalized services to users by domain knowledge with advanced
reasoning mechanisms, and provide solutions to problems that were otherwise infeasible. This approach also takes user's preference
and place property into consideration that can be utilized to achieve a comprehensive range of personalized services, such as
advertising, recommendations, or polling. This paper provides an overview of indoor localization technologies, popular models for
extracting semantics from location data, approaches for associating semantic information and location data, and applications that
may be enabled with location semantics. To make the presentation easy to understand, we will use a museum scenario to explain
pros and cons of different technologies and models. More specifically, we will first explore users' needs in a museum scenario.
Based on these needs, we will then discuss advantages and disadvantages of using different localization technologies to meet these
needs. From these discussions, we can highlight gaps between real application requirements and existing technologies, and point
out promising localization research directions. By identifying gaps between various models and real application requirements,
we can draw a road map for future location semantics research.

Abstract

We propose a robust pointing detection with virtual shadow representation for interacting with a public display. Using a depth camera, our shadow is generated by a model with an angled virtual sun light and detects the nearest point as a pointer. Position of the shadow becomes higher when user walks closer, which conveys the notion of correct distance to control the pointer and offers accessibility to the higher area of the display.

Abstract

The proliferation of workplace multimedia collaboration applications has meant on one hand more opportunities for group work but on the other more data locked away in proprietary interfaces. We are developing new tools to capture and access multimedia content from any source. In this demo, we focus primarily on new methods that allow users to rapidly reconstitute, enhance, and share document-based information.

Abstract

Adapting to personal needs and supporting correct posture are important in physiotherapy training. In this demo, we show a dual screen application (handheld and TV) that allows patients to view hypervideo training programs. Designed to guide their daily exercises, these programs can be adapted to daily needs. The dual screen concept offers the positional flexibility missing in single screen solutions.

Abstract

Dual screen concepts for hypervideo-based physiotherapy training are important in healthcare settings, but existing applications often cannot be adapted to personal needs and do not support correct posture. In this paper, we describe the design and implementation of a dual screen application (handheld and TV) that allows patients to view hypervideos designed to help them correctly perform their exercises. This approach lets patients adapt their training to their daily needs and their overall training progress. We evaluated this prototypical implementation in a user test with post-operative care prostate cancer patients. From our results, we derived design recommendations for dual screen physical training hypervideo applications.

Abstract

Different systems exist for the creation of hypervideos nowadays. However, the creation of the video scenes which are put together to a hypervideo is a tedious and time consuming job. Then again huge video databases like YouTube exist which already provide rich sources of video materials. Yet it is not allowed to download and re-purpose the videos from there legally, which requires a solution to link whole videos or parts of videos and play them from the platform in an embedded player. This work presents the SIVA Web Producer, a Chrome extension for the creation of hypervideos consisting of scenes from YouTube videos. After creating a project, the Chrome extension allows to import YouTube videos or parts thereof as video clips. These can than be linked to a scene graph. A preview is provided and finalized videos can be published on the SIVA Web Portal.

Abstract

In this paper we describe DocuGram, a novel tool to capture and share documents from any application. As users scroll through pages of their document inside the native application (Word, Google Docs, web browser), the system captures and analyses in real-time the video frames and reconstitutes the original document pages into an easy to view HTML-based representation. In addition to regenerating the document pages, a DocuGram also includes the interactions users had over them, e.g. mouse motions and voice comments. A DocuGram acts as a modern copy machine, allowing users to copy and share any document from any application.

Abstract

Most teleconferencing tools treat users in distributed meetings monolithically: all participants are meant to be connected to one another in the more-or-less the same manner. In reality, though, people connect to meetings in all manner of different contexts, sometimes sitting in front of a laptop or tablet giving their full attention, but at other times mobile and involved in other tasks or as a liminal participant in a larger group meeting. In this paper we present the design and evaluation of two applications, Penny and MeetingMate, designed to help users in non-standard contexts participate in meetings.

Abstract

The abundance of data posted to Twitter enables companies to extract useful information, such as Twitter users who are dissatisfied with a product. We endeavor to determine which Twitter users are potential customers for companies and would be receptive to product recommendations through the language they use in tweets after mentioning a product of interest. With Twitter's API, we collected tweets from users who tweeted about mobile devices or cameras. An expert annotator determined whether each tweet was relevant to customer purchase behavior and
whether a user, based on their tweets, eventually bought the product. For the relevance task, among four models, a feed-forward neural network yielded
the best cross-validation accuracy of over 80% per product. For customer purchase prediction of a product, we observed improved performance with the use of sequential input of tweets to recurrent models, with an LSTM model being best; we also observed the use of relevance predictions in our model to be more effective with less powerful RNNs and on more difficult tasks.

Abstract

Two related challenges with current teleoperated robotic systems are lack of peripheral vision and awareness, and difficulty or tedium of navigating through remote spaces. We address these challenges by providing an interface with a focus plus context (F+C) view of the robot location, and where the user can navigate simply by looking where they want to go, and clicking or drawing a path on the view to indicate the desired trajectory or destination. The F+C view provides an undistorted, perspectively correct central region surrounded by a wide field of view peripheral portion, and avoids the need for separate views. The navigation method is direct and intuitive in comparison to keyboard or joystick based navigation, which require the user to be in a control loop as the robot moves. Both the F+C views and the direct click navigation were evaluated in a preliminary user study.

Abstract

Mobile Telepresence Robots (MTR) are an emerging technology that extend the functionality of telepresence systems by adding mobility. MTRs nowadays, however, rely on stationary imaging systems such as a single narrow-view camera for vision, which can lead to reduced operator performance due to view-related deficiencies in situational awareness. We therefore developed an improved imaging and viewing platform that allows immersive telepresence using a Head Mounted Device (HMD) with head-tracked mono and stereoscopic video. Using a remote collaboration task to ground our research, we examine the effectiveness head-tracked HMD systems in comparison to a baseline monitor-based system.
We performed a user study where participants were divided into three groups: fixed camera monitor-based baseline condition (without HMD), HMD with head-tracked 2D camera and HMD with head-tracked stereo camera. Results showed the use of HMD reduces task error rates and improves perceived collaborative success and quality of view, compared to the baseline condition. No major difference was found, however, between stereo and 2D camera conditions for participants wearing an HMD.

Abstract

Social media offers potential opportunities for businesses to
extract business intelligence. This paper presents Tweetviz,
an interactive tool to help businesses extract actionable information from a large set of noisy Twitter messages. Tweetviz visualizes tweet sentiment of business locations, identifies other business venues that Twitter users visit, and estimates some simple demographics of the Twitter users frequenting a business. A user study to evaluate the system's ability indicates that Tweetviz can provide an overview of a business's issues and sentiment as well as information aiding users in creating customer profiles.

Abstract

Web videos are becoming more and more popular. Current web technologies make it simpler than ever to both stream videos and create complex constructs of interlinked videos with additional information (video, audio, images, and text); so called hypervideos. When viewers interact with hypervideos by clicking on links, new content has to be loaded. This may lead to excessive waiting times, interrupting the presentation -- especially when videos are loaded into the hypervideo player. In this work, we propose hypervideo pre-fetching strategies, which can be implemented in players to minimize waiting times. We examine the possibilities offered by the HTML5 tag as well as the Media Source Extensions (MSE). Both HTML5 and MSE allow element pre-fetching (video and additional information) up to a certain granularity. Depending on the strategy and technology used, beginning scene waiting times and the overall download volume may increase. The strategies presented in this paper allow the number of delays during playback and the overall waiting time of a video to be reduced significantly from an average of 8.1 breaks to less than one break. The overall waiting times can be reduced by one third, to less than 18 seconds improving the hypervideo watching experience.

Abstract

Mobile Audio Commander (MAC) is a mobile phone-based multimedia sensing system that facilitates the introduction of extra sensors to existing mobile robots for advanced capabilities. In this paper, we use MAC to introduce an accurate indoor positioning sensor to a robot to facilitate its indoor navigation. More specifically, we use a projector to send out position ID through light signal, use a light sensor and the audio channel on a mobile phone to decode the position ID, and send navigation commands to a target robot through audio output. With this setup, our system can simplify userÃÂ¢ÃÂÃÂs robot navigation. Users can define a robot navigation path on a phone, and our system will compare the navigation path with its accurate location sensor inputs and generate analog line-following signal, collision avoidance signal, and analog angular signal to adjust the robotÃÂ¢ÃÂÃÂs straight movements and turns. This paper describes two examples of using MAC and a positioning system to enable complicated robot navigation with proper user interface design, external circuit design and real sensor installations on existing robots.

Abstract

Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisement. Previous studies in image captioning either rely solely on image content or summarize multiple web documents related to image's location; both neglect users' activities. We propose business-aware latent topics as a new contextual cue for image captioning that represent user activities. The idea is to learn the typical activities of people who posted images from business venues with similar categories (e.g., fast food restaurants) to provide appropriate context for similar topics (e.g., burger) in new posts. User activities are modeled via a latent topic representation. In turn, the image captioning model can generate sentences that better reflect user activities at business venues. In our experiments, the business-aware latent topics are effective for adapting to captions to images captured in various businesses than the existing baselines. Moreover, they complement other contextual cues (image, time) in a multi-modal framework.

Abstract

We previously created the HyperMeeting system to support a chain of geographically and temporally distributed meetings in the form of a hypervideo. This paper focuses on playback plans that guide users through the recorded meeting content by automatically following available hyperlinks. Our system generates playback plans based on users' interests or prior meeting attendance and presents a dialog that lets users select the most appropriate plan. Prior experience with playback plans revealed users' confusion with automatic link following within a sequence of meetings. To address this issue, we designed three timeline visualizations of playback plans. A user study comparing the timeline designs indicated that different visualizations are preferred for different tasks, making switching among them important. The study also provided insights that will guide research of personalized hypervideo, both inside and outside a meeting context.

Abstract

It is difficult to adjust the content of traditional slide presentations to the knowledge level, interest and role of individuals. This might force presenters to include content that is irrelevant for part of the audience, which negatively affects the knowledge transfer of the presentation. In this work, we present a prototype that is able to eliminate non-pertinent information from slides by presenting annotations for individual attendees on optical head-mounted displays. We first create guidelines for creating optimal annotations by evaluating several types of annotations alongside different types of slides. Then we evaluate the knowledge acquisition of presentation attendees using the prototype versus traditional presentations. Our results show that annotations with a limited amount of information, such as text up to 5 words, can significantly increase the amount of knowledge gained from attending a group presentation. Additionally, presentations where part of the information is moved to annotations are judged more positively on attributes such as clarity and enjoyment.

Abstract

WSICC has established itself as a truly interactive workshop at EuroITV'13, TVX'14, and TVX'15 with three successful editions. The fourth edition of the WSICC workshop aims to bring together researchers and practitioners working on novel approaches for interactive multimedia content consumption. New technologies, devices, media formats, and consumption paradigms are emerging that allow for new types of interactivity. Examples include multi-panoramic video and object-based audio, increasingly available in live scenarios with content feeds from a multitude of sources.
All these recent advances have an impact on different aspects related to interactive content consumption, which the workshop categorizes into Enabling Technologies, Content, User Experience, and User Interaction. The resources from past editions of the workshop are available on the http://wsicc.net website.

Abstract

Hypervideo usage scenarios like physiotherapy trainings or instructions for manual tasks make it hard for users to use an input device like a mouse or touch screen on a hand-held device while they are performing an exercise or use both hands to perform a manual task. In this work, we are trying to overcome this issue by providing an alternative input method for hypervideo navigation using speech commands. In a user test, we evaluated two different speech recognition libraries, annyang (in combination with the Web Speech API) and PocketSphinx.js (in combination with the Web Audio API), for their usability to control hypervideo players. Test users spoke 18 words, either in German or English, which were recorded and then processed by both libraries. We found out that annyang shows better recognition results. However, depending on other factors of influence, like the occurrence of background noise (reliability), the availability of an internet connection, or the used browser, PocketSphinx.js may be a better fit.

Abstract

Hypervideo based physiotherapy trainings bear an opportunity to support patients in continuing their training after being released from a rehabilitation clinic. Many exercises require the patient to sit on the floor or a gymnastic ball, lie on a gymnastics mat, or do the exercises in other postures. Using a laptop or tablet with a stand to show the exercises is more helpful than for example just having some drawings on a leaflet. However, it may lead to incorrect execution of the exercises while maintaining eye contact with the screen or require the user to get up and select the next exercise if the devices is positioned for a better view. A dual screen application, where contents are shown on a TV screen and the flow of the video can be controlled from a mobile second device, allows patients to keep their correct posture and the same time view and select contents. In this paper we propose first studies for user interface designs for such apps. Initial paper prototypes are discussed and refined in two focus groups. The results are then presented to a broader range of users in a survey. Three prototypes for the mobile app and one prototype for the TV are identified for future user tests.

Abstract

The creation of hypervideos usually requires a lot of planning and is time consuming with respect to media content creation. However, when structure and media are put together to author a hypervideo, it may only require minor changes to make the hypervideo available in other languages or for another user group (like beginners versus experts). However, to make the translation of media and all navigation elements of a hypervideo efficient and manageable, the authoring tool needs a GUI that provides a good overview of elements that can be translated and of missing translations. In this work, we propose screen concepts that help authors to provide different versions (for example language and/or experience level) of a hypervideo. We analyzed different variants of GUI elements and evaluated them in a survey. We draw guidelines from the results that can help with the creation of similar systems in the future.

Abstract

The confluence of technologies such as telepresence, immersive imaging, model based virtual mirror worlds, mobile live streaming, etc. give rise to a capability for people anywhere to view and connect with present or past
events nearly anywhere on earth. This capability properly belongs to a public commons, available as a birthright of all humans, and can been seen as part of an evolutionary transition supporting a global collective mind. We describe examples and elements of this capability, and suggest how they can be better integrated through a tool we
call TeleViewer and a framework called WorldViews,
which supports easy sharing of views as well as connecting of providers and consumers of views all around the world.

Abstract

Most current mobile and wearable devices are equipped with inertial measurement units (IMU) that allow the detection of motion gestures, which can be used for interactive applications. A difficult problem to solve, however, is how to separate ambient motion from an actual motion gesture input. In this work, we explore the use of motion gesture data labeled with gesture execution phases for training supervised learning classifiers for gesture segmentation. We believe that using gesture execution phase data can significantly improve the accuracy of gesture segmentation algorithms. We define gesture execution phases as the start, middle and end of each gesture. Since labeling motion gesture data with gesture execution phase information is work intensive, we used crowd workers to perform the labeling. Using this labeled data set, we trained SVM-based classifiers to segment motion gestures from ambient movement of the device t. We describe initial results that indicate that gesture execution phase can be accurately recognized by SVM classifiers. Our main results show that training gesture segmentation classifiers with phase-labeled data substantially increases the accuracy of gesture segmentation: we achieved a gesture segmentation accuracy of 0.89 for simulated online segmentation using a sliding window approach.

Abstract

Search log analysis has become a common practice to gain insights into user search behaviour, it helps gain an understanding of user needs and preferences, as well as how well a system supports such needs. Currently log analysis is typically focused on the low-level user actions, i.e. logged events such as issued queries and clicked results; and often only a selection of such events are logged and analysed. However, the types of logged events may differ widely from interface to interface, making comparison between systems difficult. Further, analysing a selection of events may lead to conclusions out of contextâ e.g. the statistics of observed query reformulations may be influenced by the existence of a relevance feedback component. Alternatively, in lab studies user activities can be analysed at a higher level, such as search tactics and strategies, abstracted away from detailed interface implementation. However, the required manual codings that map logged events to higher level interpretations prevent this type of analysis from going large scale. In this paper, we propose a new method for analysing search logs by (semi-)automatically identifying user search tactics from logged events, allowing large scale analysis that is comparable across search systems. We validate the efficiency and effectiveness of the proposed tactic identification method using logs of two reference search systems of different natures: a product search system and a video search system. With the identified tactics, we perform a series of novel log analyses in terms of entropy rate of user search tactic sequences, demonstrating how this type of analysis allows comparisons of user search behaviours across systems of different nature and design. This analysis provides insights not achievable with traditional log analysis.

Abstract

We propose a method for extractive summarization of audiovisual recordings focusing on topic-level segments. We first build a content similarity graph between all segments of all documents in the collection, using word vectors from the transcripts, and then select the most central segments for the summaries. We evaluate the method quantitatively on the AMI Meeting Corpus using gold standard reference summaries and the Rouge metric, and qualitatively on lecture recordings using a novel two-tiered approach with human judges. The results show that our method compares favorably with others in terms of Rouge, and outperforms the baselines for human scores, thus also validating our evaluation protocol.

Abstract

Many people post about their daily life on social media. These posts may include information about the purchase activity of people, and insights useful to companies can be derived from them: e.g. profile information of a user who mentioned something about their product. As a further advanced analysis, we consider extracting users who are likely to buy a product from the set of users who mentioned that the product is attractive.
In this paper, we report our methodology for building a corpus for Twitter user purchase behavior prediction. First, we collected Twitter users who posted a want phrase + product name: e.g. "want a Xperia" as candidate want users, and also candidate bought users in the same way. Then, we asked an annotator to judge whether a candidate user actually bought a product. We also annotated whether tweets randomly sampled from want/bought user timelines are relevant or not to purchase. In this annotation, 58% of want user tweets and 35% of bought user tweets were annotated as relevant. Our data indicate that information embedded in timeline tweets can be used to predict purchase behavior of tweeted products.

Abstract

The negative effect of lapses during a behavior-change program has been shown to increase the risk of repeated lapses and, ultimately, program abandonment. In this paper, we examine the potential of system-driven lapse management -- supporting users through lapses as part of a behavior-change tool. We first review lessons from domains such as dieting and addiction research and discuss the design space of lapse management. We then explore the value of one approach to lapse management -- the use of "cheat points" as a way to encourage sustained participation. In an online study, we first examine interpretations of progress that was reached through using cheat points. We then present findings from a deployment of lapse management in a two-week field study with 30 participants. Our results demonstrate the potential of this approach to motivate and change users' behavior. We discuss important open questions for the design of future technology-mediated behavior change programs.