Blog Category: human-computer interaction

This year FXPAL is happy to present four papers (three long papers and one case study) at CHI 2018 in Montreal. Our featured work this year investigates the themes of Human-Centered Workstyle, Information Visualization, and Internet of Things.

When clinicians communicate with patients via video conferencing, they must not only exchange information but also convey a sense of sympathy, sensitivity, and attentiveness. However, video-mediated communication often is less effective than in-person communication because it is challenging to convey and perceive essential non-verbal behaviors, such as eye contact, vocal tone, and body posture. Moreover, non-verbal behaviors that may be acceptable in in-person business meetings such as looking away at notes may be perceived as being rude or inattentive in a video meeting (patients already feel disengaged when clinicians frequently look at medical records instead of at them during in-person visits).

Prior work shows that in video visits, clinicians tend to speak more, being more dominant in the conversation and less empathetic toward patients, which can lead to poorer patient satisfaction and incomplete information gathering. Further, few clinicians are trained to communicate over a video visit, and many are not always aware of how they present themselves to patients over video.

In our paper, I Should Listen More: Real-time Sensing and Feedback of Non-Verbal Communication in Video Telehealth, we describe the design and evaluation of ReflectLive, a system that senses and provides realtime feedback about clinicians’ communication behaviors during video consultations with patients. Furthermore, our user tests showed that real-time sensing and feedback has the potential to train clinicians to maintain better eye contact with patients and be more aware of their non-verbal behaviors.

The ReflectLive video meeting system, with the visualization dashboard on the right showing real-time metrics about non-verbal behaviors. Heather (in the thumbnail) is looking to the left. A red bar flashes on the left of her window as she looks to the side to remind her that her gaze is not centered on the other speaker. A counter shows the number of seconds and direction she is looking away.

The FXPAL robotics research group has recently explored technologies for improving the usability of mobile telepresence robots. We evaluated a prototype head-tracked stereoscopic (HTS) teleoperation interface for a remote collaboration task. The results of this study indicate that using a HTS systems reduces task errors and improves the perceived collaboration success and
viewing experience.

We also developed a new focus plus context viewing technique for mobile robot teleoperation. This allows us to use wide-angle camera images
that proved rich contextual visual awareness of the robot’s surroundings while at the same time preserving a distortion-free region
in the middle of the camera view.

To this, we added a semi-automatic robot control method that allows operators to navigate the telepresence robot via a pointing and clicking directly on
the camera image feed. This through-the-screen interaction paradigm has the advantage of decoupling operators from the robot control loop, freeing them for
other tasks besides driving the robot.

Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project. One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content. We are working towards this goal by exploiting two key tools. First, we want to use real-time content analysis to expose useful structure within multimedia content. Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation. Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week. Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines. When a user clicks on a frame, a live cursor appears next to the nearest word. At this point, users can alter text directly using the keyboard. When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams. We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

In the recent paper published at SUI 2014,”Exploring Gestural Interaction in Smart Spaces using Head-Mounted Devices with Ego-Centric Sensing”, co-authored with Barry Kollee and Tony Dunnigan, we studied a prototype Head Mounted Device (HMD) that allows the interaction with external displays by input through spatial gestures.

In the paper, one of our goals was to expand the scope of interaction possibilities on HMDs, which are currently severely limited, if we consider Google Glass as a baseline. Glass only has a small touch pad, which is placed at an awkward position on the devices rim, at the user’s temple. The other input modalities Glass offers are eye blink input and voice recognition. While eye blink can be effective as a binary input mechanism, in many situations it is rather limited and could be considered socially awkward. Voice input suffers from recognition errors for non-native speakers of the input language and has considerable lag, as current Android-based devices, such as Google Glass, perform text-to-speech in the cloud. These problems were also observed in the main study of our paper.

We thus proposed three gestural selection techniques in order to extend the input capabilities of HMDs: (1) a head nod gesture, (2) a hand movement gesture and (3) a hand grasping gesture.

The following mock-up video shows the three proposed gestures used in a scenario depicting a material selection session in a (hypothetical) smart space used by architects:

We discounted the head nod gesture after a preliminary study showed a low user preference for such an input method. In a main study, we found that the two gestural techniques achieved performance similar to a baseline technique using the touch pad on Google Glass. However, we hypothesize that the spatial gestural techniques using direct manipulation may outperform the touch pad for larger numbers of selectable targets (in our study we had 12 targets in total), as secondary GUI navigation activities (i.e., scrolling a list view) are not required when using gestures.

In the paper, we also present some possibilities for ad-hoc control of large displays and automated indoor systems:

Ambient light control using spatial gestures tracked by via an HMD.

Considering the larger picture, our paper touches on the broader question of ego-centric vs exo-centric tracking: past work in smart spaces has mainly relied on external (exo-centric) tracking techniques, e.g., using depth sensors such as the Kinect for user tracking and interaction. As wearable devices get increasingly powerful and as depth sensor technology shrinks, it may, in the future, become more practical to users to bring their own sensors to a smart space. This has advantages in scalability: more users can be tracked in larger spaces, without additional investments in fixed tracking systems. Also, a larger number of spaces can be made interactive, as the users carry their sensing equipment from place to place.

At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.

Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.

People often use more than one query when searching for information. We revisit search results to re-find information and build an understanding of our search need through iterative explorations of query formulation. Unfortunately, these tasks are not well supported by search interfaces and web browsers. The only indication of our search process we get is a different colored link to pages we already have visited. In our previous research, we found that a simple query preview widget helped people formulate more successful queries and more efficiently explore the search results. However, the query preview widget would not work with regular search engines since it required back-end support. To bring support for exploratory search to common search engines, such as Google, Bing or Yahoo, we designed and built a Chrome browser plug-in, SearchPanel.

SearchPanel collects and visualizes information about the web pages retrieved in small panel next to the search results. With a glance, a searcher can see which web pages have been previously retrieved, visited and bookmarked. If a web page has a favicon, it is included in the bar (2) to help scanning and navigation of the search results. Each search result is represented as a bar in SearchPanel. The color of the bar (3) indicates retrieval status (teal = new, light blue = previously retrieved but not viewed, and dark blue = previously retrieved and viewed web page). The length of the bar (5) indicates how many times a web page has been visited; shorter bar indicates more visits. If a web page in the results list have previously been bookmarked, a yellow star is shown next to the bar (6). Users can easily re-run the same query with a different search engine by selecting one of the search engine buttons (1). When the user navigates to a web page linked in the search results, a white circle (4) is shown next to the bar representing that search result. This circle persists even if the user continues to follow links away from the web page linked in the search results.

When moving away from the search page, SearchPanel stays put and provides a short cut for accessing the search results. The search result being explored is indicated in SearchPanel by a circle. Moving the mouse over a bar in SearchPanel when not on the search page, displays the search result snippet.

We evaluated SearchPanel in a real world deployment and found that appears to have been primarily used for complex information needs, in search sessions with long durations and high numbers of queries. For search session with single queries, we found very little use of SearchPanel. Based on our evaluation, we conclude that SearchPanel appears to be used in the way it was designed; when it is not needed it is out of the way and not used, but when one simple query does not answer the search need, SearchPanel is used for supporting the information seeking process. More details about SearchPanel can be found in our SIGIR 2014 paper.

Previous work has shown that passwords or PINs as an authentication mechanism have usability issues that ultimately lead to a compromise in security. For instance, as the number of services to authenticate to grows, users use variations of basic passwords, which are easier to remember, thus making their accounts susceptible to attack if one is compromised.

AirAuth addresses these issues by replacing password entry with a gesture. Motor memory makes it a simple task for most users to remember their gesture. Furthermore, since we track multiple points on the user’s hands, we do obtain tracking information that is unique to the physical appearance of the legitimate user, so there is an implicit biometric built into AirAuth. Smudge attacks are averted due to the touchless gesture entry and a user study we conducted shows that AirAuth is also quite resistant towards camera-based shoulder surfing attacks.

Our demo at CHI showed the enrollment and authentication phases of our system. We gave attendees the opportunity to enroll in our system and check AirAuth’s capabilities to recognize their gestures. We got great responses from the attendees and obtained enrollment gestures from a number of them. We plan to use these enrollment gestures to evaluate AirAuth’s accuracy in field conditions.

Touch input is now the preferred input method on mobile devices such as smartphones or tablets. Touch is also gaining traction in the desktop segment and is also common for interaction with large table or wall-based displays. At present, the majority of touch displays can detect solely the touch location of a user input. Some capacitive touch screens can also report the contact area of a touch, but usually, no further information about individual touch inputs is available to developers of mobile applications.

It would, however, be beneficial to capture further properties of the user’s touch, for instance the finger’s rotation around the vertical axis (i.e., the axis orthogonal to the plane of the touch screen) as well as its tilt (see images above). Obtaining rotation and tilt information for a touch would allow for expressive localized input gestures as well as new types of on-screen widgets that make use of the additional local input degrees of freedom.

Having finger pose information together with touches adds additional local degrees of freedom of input for each touch location. This, for instance, allows the user interface designer to remap established multi-touch gestures such as pinch-to-zoom to other user interface functions or to free up screen space by allowing input (e.g., adjusting a slider value, scrolling a list, panning a map view, enlarging a picture) to be performed at a single touch location that usually need (multi-) touch gestures that require a significant amount of screen space. New graphical user interface widgets that make use of finger pose information, such as rolling context menus, hidden flaps or occlusion-aware widgets have also been suggested.

Our PointPose prototype performs finger pose estimation at the location of touch using a short-range depth sensor viewing the touch screen of a mobile device. We use the point cloud generated by the depth sensor for finger pose estimation. PointPose estimates the finger pose of a user touch by fitting a cylindrical model to the subset of the point that corresponds to the user’s finger. We use the spatial location of the user’s touch to seed the search for the subset of the point cloud representing the user’s finger.

One advantage of our approach is that it does not require complex external tracking hardware (as in related work), and external computation is unnecessary as the finger pose extraction algorithm is efficient enough to run directly on the mobile device. This makes PointPose ideal for prototyping and developing novel mobile user interfaces that use finger pose estimation.

It is reasonably well-known that people who examine search results often don’t go past the first few hits, perhaps stopping at the “fold” or at the end of the first page. It’s a habit we’ve acquired due to high-quality results to precision-oriented information needs. Google has trained us well.

But this habit may not always be useful when confronted with uncommon, recall-oriented, information needs. That is, when doing research. Looking only at the top few documents places too much trust in the ranking algorithm. In our SIGIR 2013 paper, we investigated what happens when a light-weight preview mechanism gives searchers a glimpse at the distribution of documents — new, re-retrieved but not seen, and seen — in the query they are about to execute.