Archive

Blog Category: multimedia

Teleconferencing is now a nearly ubiquitous aspect of modern work. We routinely use apps such as Google Hangouts or Skype to present work or discuss documents with remote colleagues. Unfortunately, sharing source documents is not always as seamless. For example, a meeting participant might share content via screencast that she has access to, but that the remote participant does not. Remote participants may also not have the right software to open the source document, or the content shared might be only a small section of a large document that is difficult to share.

Later this week in Vienna, we will present our work at DocEng on DocuGram, a tool we developed at FXPAL to help address these issues. DocuGram can capture and analyze shared screen content to automatically reconstitute documents. Furthermore, it can capture and integrate annotations and voice notes made as the content is shared.

The first video below describes DocuGram, and the second shows how we have integrated it into our teleconferencing tool, MixMeet. Check it out, and be sure to catch our talk on Friday, September 16th at 10:00AM.

Knowledge work is changing fast. Recent trends in increased teleconferencing bandwidth, the ubiquitous integration of “pads and tabs” into workaday life, and new expectations of workplace flexibility have precipitated an explosion of applications designed to help people collaborate from different places, times, and situations.

Over the last several months the MixMeet team observed and interviewed members of many different work teams in small-to-medium sized businesses that rely on remote collaboration technologies. In work we will present at ACM CSCW 2016, we found that despite the widespread adoption of frameworks designed to integrate information from a medley of devices and apps (such as Slack), employees utilize a surprisingly diverse but unintegrated set of tools to collaborate and get work done. People will hold meetings in one app while relying on another to share documents, or share some content live during a meeting while using other tools to put together multimedia documents to share later. In our CSCW paper, we highlight many reasons for this increasing diversification of work practice. But one issue that stands out is that videoconferencing tools tend not to support archiving and retrieving disparate information. Furthermore, tools that do offer archiving do not provide mechanisms for highlighting and finding the most important information.

In work we will present later this fall at ACM MM 2015 and ACM DocEng 2015, we describe new MixMeet features that address some of these concerns so that users can browse and search the contents of live meetings to retrieve rapidly previously shared content. These new features take advantage of MixMeet’s live processing pipeline to determine actions users take inside live document streams. In particular, the system monitors text and cursor motion in order to detect text edits, selections, and mouse gestures. MixMeet applies these extra signals to user searches to improve the quality of retrieved results and allow users to quickly filter a large archive of recorded meeting data to find relevant information.

In our ACM MM paper (and toward the end of the above video) we also describe how MixMeet supports table-top videoconferencing devices, such as Kubi. In current work, we are developing multiple tools to extend our support to other devices and meeting situations. Publications describing these new efforts are in the pipeline: stay tuned.

Google Glass’ semi-demise has become a topic of considerable interest lately. Alexander Sommer at WT-Vox takes the view that it was a courageous “public beta” and “a PR nightmare” but also well received in specialized situations where the application suits the device, as in Scott’s post below. IMO, a pretty good summary.

In the AAAI 2015 conference, we presented the work “Visually Interpreting Names as Demographic Attributes by Exploiting Click-Through Data,” a collaboration with a research team in National Taiwan University. This study aims to automatically associate a name and its likely demographic attributes, e.g., gender and ethnicity. More specifically, the associations are driven by web-scale search logs that are collected via a search engine when internet users retrieve images.

Demographic attributes are vital to semantically characterize a person or a community. This makes it valuable for marketing, personalization, face retrieval, social computing and more human-centric research. Since users tend to keep their online profiles private, name is the most reachable piece of personal information among these contexts. The problem we address is – given a name, associating and predicting its likely demographic attributes. For example, given a person named “Amy Liu,” the person is likely an Asian female. Name makes the first impression of a person because naming conventions are strongly influenced by culture, e.g., first name and gender, last name and location of origin. Typically, the associations between names and the two attributes are made by referring to demographics maintained by governments or by manually labeling attributes based on the given personal information (e.g., photo). The former is limited in regional census data. The latter has major concerns in time and cost when it adapts to large-scale data.

Different from prior approaches, we propose to exploit click-throughs between text queries and retrieved face images in web search logs, where the names are extracted from queries and the attributes are detected from face images automatically. In this paper, a click-through means when one of the URLs returned by a text query has been clicked by a user to view a web image it directs to. The mechanism delivers two messages, (1) the association between a query and an image is based on viewers’ clicks, that is, human intelligence from web-scale users; (2) users may have considerable knowledge to the associations because they might be partially aware of what they are looking for and search engines are getting much better at satisfying user intent. Both characteristics of click-throughs reduce concerns of incorrect associations. Moreover, the Internet users’ knowledge enables discovering name-attribute associations with high generality to more countries.

In the experiments, the proposed name-attribute associations are demonstrated with competitive accuracy compared to using manual labeling. It also benefits profiling social media users and keyword-based face image retrieval, especially the adaption to unseen names. This is the first work to interpret a name to demographic attributes in visual-data-driven manner using web search logs. In the future, we are going to extend the visual interpretation of an abstract name to more targets for which naming conventions are highly influenced by visual appearance.

The authors found that Google Glass was helpful for the more difficult task, enabling better and more frequent communication, while for the simpler task the results were mixed. This more-or-less agrees with our findings: HMDs are helpful for capturing and communicating complicated tasks but less so for table-top tasks.

Another key difference between this work and ours is that the authors relied on Google Hangouts to stream videos. However, as the authors write, “the HMD interface of Google Hangouts used in our study did not offer [live preview feedback],” a key feature for any media capture application.

At FXPAL, we build systems when we are limited by off-the-shelf technology. So when we discovered a related capture feedback issue in early pilots we were able to quickly fix it in our tool. Of course in our case the technology was much simpler because we did not need to implement video streaming. However, since this paper was published we have developed mechanisms to stream video from Glass, or any Android device, using open WebRTC protocols. More than that, our framework can analyze incoming frames and then stream out arbitrary image data, potentially allowing us to implement many of the design implications the authors describe in the paper’s discussion section.

Creating multimedia tutorials requires two distinct steps: capture and editing. While editing, authors have the opportunity to devote their full attention to the task at hand. Capture is different. In the best case, capture should be completely unobtrusive so that the author can focus exclusively on the task being captured. But this can be difficult to achieve with handheld devices, especially if the task requires that the tutorial author move around an object and use both hands simultaneously (e.g., showing how to replace a bike derailleur).

For this reason, we extended our ShowHow multimedia tutorial system to support head-mounted capture. Our first approach was simple: a modified pair of glasses with a Looxcie camera and laser guide attached. While this approach interfered with the user’s vision less than other solutions, such as a full augmented reality system, it nonetheless suffered from an array of problems: it was bulky, it was difficult to control, and without a display feedback of the captured area it was hard to frame videos and photos.

Our first head-mounted capture prototype

Luckily, Google Glass launched around this time. With an onboard camera, a touch panel, and display, it seemed an excellent choice for head-mounted capture.

allowfullscreen=”allowfullscreen” width=”560″ height=”315″>Our video application to the Glass Explorers program

To test this, we built an app for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial content. In our paper, we describe a study comparing standalone capture (camera on tripod) versus head-mounted (Google Glass) capture. Details are in the paper, but in short we found that tutorial authors prefer wearable capture devices, especially when recording activities involving larger objects in non-tabletop environments.

Finally, based on the success of Glass for capture we built and tested an access app as well. A detailed description of the tool, as well as another study we ran testing its efficacy for viewing multimedia tutorials, is the subject of an upcoming paper. Stay tuned.

Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project. One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content. We are working towards this goal by exploiting two key tools. First, we want to use real-time content analysis to expose useful structure within multimedia content. Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation. Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week. Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines. When a user clicks on a frame, a live cursor appears next to the nearest word. At this point, users can alter text directly using the keyboard. When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams. We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

In an ACM interactions piece published this month we introduce our latest work in multimedia document research. Cemint (for Component Extraction from Media for Interaction, Navigation, and Transformation) is a set of tools to support seamless intermedia synthesis and interaction. In our interactions piece we argue that authoring and reuse tools for dynamic, visual media should match the power and ease of use of their static textual media analogues. Our goal with this work is to allow people to use familiar metaphors, such as copy-and-paste, to construct and interact with multimedia documents.

Cemint applications will span a range of communication methods. Our early work focused on support for asynchronous media extraction and navigation, but we are currently building a tool using these techniques that can support live, web-based meetings. We will present this new tool at DocEng 2014 — stay tuned!

Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation based on local image descriptors. You know it’s established when it’s in Wikipedia. There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization. The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database. Each descriptor cluster is represented by one element in a visual vocabulary (codebook). In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Building on previous work on high accuracy scalable visual search, a recent FXPAL paper at ACM ICMR 2014 proposes Vector Quantization Free (VQF) search using projective hashing in combination with binary valued local image descriptors. Recent years have seen the development of binary descriptors such as ORB or BRIEF that improve efficiency with negligible loss of accuracy in various matching scenarios. Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing. Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table, as below.

By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image.

The paper presents experimental results on image databases that validate the expected significant increase in efficiency and scalability using the VQF approach. The results also show improved performance over some competitive baselines in near duplicate image search. There remain some interesting questions for future work to understand tradeoffs around the size of the hash tables (governed by the number of bits projected) and the number of tables required to deliver a desired level of performance.

This week at the ACM Conference on Document Engineering, Laurent and Scott are presenting new work on direct manipulation of video. The ShowHow project is our latest activity involving expository or “how to” video creation and use. While watching videos of this genre, it is helpful to create annotations that identify useful frames or shots using ShowHow’s annotation capability directly, or by creating a separate multimedia notes document. The primary purpose of such annotation is for later reference, or incorporation into other videos or documents. While browser history might be able to get you back to a specific video you watched previously, it won’t readily get you to a specific portion of much longer source video efficiently, or provide you with the broader context in which you found that portion of the video noteworthy. ShowHow enables users to create rich annotations around expository video that optionally include image, audio, or text to preserve this contextual information.

For creating these annotations, copy and paste functionality from the source video is desirable. This could be selecting a (sub)frame as an image or even selecting text shown in the video. Also, we demonstrate capturing dynamic activity across frames in a simple animated GIF for easy copy and paste from video to the clipboard. There are interaction design challenges here, and especially as more content is viewed on mobile/touch devices, direct manipulation provides a natural means for fine control of selection.

Under the hood, content analysis is required to identify events in the video to help drive the user interaction. In this case, the analysis is implemented in javascript and runs in the browser on which the video is being played. So efficient means of standard image analysis tools such as region segmentation, edge detection, and region tracking are required. There’s a natural tradeoff between robustness and efficiency here that constrains the content processing techniques.

The interaction enabled by the system is probably best described in the video below: