Abstract

We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers.

Abstract

Browsing and searching for documents in large, online enterprise document repositories are common activities. While internet search produces satisfying results for most user queries, enterprise search has not been as successful because of differences in document types and user requirements. To support users in finding the information they need in their online enterprise repository, we created DocuBrowse, a faceted document browsing and search system.
Search results are presented within the user-created document hierarchy, showing only directories and documents matching selected facets and containing text query terms. In addition to file properties such as date and file size, automatically detected document types, or genres, serve as one of the search facets. Highlighting draws the user’s attention to the most promising directories and documents while thumbnail images and automatically identified keyphrases help select appropriate documents. DocuBrowse utilizes document similarities, browsing histories, and recommender system techniques to suggest additional promising documents for the current facet and content filters.

Abstract

Browsing and searching for documents in large, online enterprise document repositories is an increasingly common problem. While users are familiar and usually satisfied with Internet search results for information, enterprise search has not been as successful because of differences in data types and user requirements. To support users in finding the information they need from electronic and scanned documents in their online enterprise repository, we created an automatic detector for genres such as papers, slides, tables, and photos. Several of those genres correspond roughly to file name extensions but are identified automatically using features of the document. This genre identifier plays an important role in our faceted document browsing and search system. The system presents documents in a hierarchy as typically found in enterprise document collections. Documents and directories are filtered to show only documents matching selected facets and containing optional query terms and to highlight promising directories. Thumbnail images and automatically identified keyphrases help select desired documents.

Abstract

Audio monitoring has many applications but also raises pri-
vacy concerns. In an attempt to help alleviate these con-
cerns, we have developed a method for reducing the intelli-
gibility of speech while preserving intonation and the ability to recognize most environmental sounds. The method is based on identifying vocalic regions and replacing the vocal tract transfer function of these regions with the transfer function from prerecorded vowels, where the identity of the replacement vowel is independent of the identity of the spoken syllable. The audio signal is then re-synthesized using the original pitch and energy, but with the modied vocal tract transfer function. We performed an intelligibility study which showed that environmental sounds remained recognizable but speech intelligibility can be dramatically reduced to a 7% word recognition rate.

Abstract

In this paper we describe methods for video summarization
in the context of the TRECVID 2008 BBC Rushes Summarization task.
Color, motion, and audio features are used to segment, filter, and cluster the video.
We experiment with varying the segment similarity measure to improve the
joint clustering of segments with and without camera motion.
Compared to our previous effort for TRECVID 2007 we have reduced the
complexity of the summarization process as well as the visual complexity of the
summaries themselves. We find our objective (inclusion) performance to be competitive with systems
exhibiting similar subjective performance.

Abstract

Current approaches to pose estimation and tracking can
be classified into two categories: generative and discriminative. While generative approaches can accurately determine human pose from image observations, they are computationally intractable due to search in the high dimensional human pose space. On the other hand, discriminative approaches do not generalize well, but are computationally efficient. We present a hybrid model that combines the strengths of the two in an integrated learning and inference framework. We extend the Gaussian process latent variable model (GPLVM) to include an embedding from
observation space (the space of image features) to the latent space. GPLVM is a generative model, but the inclusion
of this mapping provides a discriminative component,
making the model observation driven. Observation Driven
GPLVM (OD-GPLVM) not only provides a faster inference
approach, but also more accurate estimates (compared to
GPLVM) in cases where dynamics are not sufficient for the
initialization of search in the latent space.
We also extend OD-GPLVM to learn and estimate poses
from parameterized actions/gestures. Parameterized gestures
are actions which exhibit large systematic variation
in joint angle space for different instances due to difference in contextual variables. For example, the joint angles in a forehand tennis shot are function of the height of the ball (Figure 2). We learn these systematic variations as a function of the contextual variables. We then present an approach to use information from scene/object to provide
context for human pose estimation for such parameterized
actions.

Abstract

In 2007 FXPAL submitted results for two tasks: rushes summarization and interactive search. The rushes summarization task has been described at the ACM Multimedia workshop. Interested readers are referred to that publication for details. We describe our interactive search experiments in this notebook paper.

Abstract

DOTS (Dynamic Object Tracking System) is an indoor, real-time, multi-camera surveillance system, deployed in a real office setting. DOTS combines video analysis and user interface components to enable security personnel to effectively monitor views of interest and to perform tasks such as tracking a person. The video analysis component performs feature-level foreground segmentation with reliable results even under complex conditions. It incorporates an efficient greedy-search approach for tracking multiple people through occlusion and combines results from individual cameras into multi-camera trajectories. The user interface draws the users' attention to important events that are indexed for easy reference. Different views within the user interface provide spatial information for easier navigation. DOTS, with over twenty video cameras installed in hallways and other public spaces in our office building, has been in constant use for a year. Our experiences led to many changes that improved performance in all system components.

Abstract

This paper describes a system for selecting excerpts from unedited video and presenting the excerpts in a short sum- mary video for eciently understanding the video contents. Color and motion features are used to divide the video into segments where the color distribution and camera motion are similar. Segments with and without camera motion are clustered separately to identify redundant video. Audio fea- tures are used to identify clapboard appearances for exclu- sion. Representative segments from each cluster are selected for presentation. To increase the original material contained within the summary and reduce the time required to view the summary, selected segments are played back at a higher rate based on the amount of detected camera motion in the segment. Pitch-preserving audio processing is used to bet- ter capture the sense of the original audio. Metadata about each segment is overlayed on the summary to help the viewer understand the context of the summary segments in the orig- inal video.

Abstract

DOTS (Dynamic Object Tracking System) is an indoor, real-time, multi-camera surveillance system, deployed in a real office setting. DOTS combines video analysis and user interface components to enable security personnel to effectively monitor views of interest and to perform tasks such as tracking a person. The video analysis component performs feature-level foreground segmentation with reliable results even under complex conditions. It incorporates an efficient greedy-search approach for tracking multiple people through occlusion and combines results from individual cameras into multi-camera trajectories. The user interface draws the users' attention to important events that are indexed for easy reference. Different views within the user interface provide spatial information for easier navigation. DOTS, with over twenty video cameras installed in hallways and other public spaces in our office building, has been in constant use for a year. Our experiences led to many changes that improved performance in all system components.

Abstract

This paper describes FXPAL's interactive video search application, "MediaMagic". FXPAL has participated in the TRECVID interactive search task since 2004. In our search application we employ a rich set of redundant visual cues to help the searcher quickly sift through the video collection. A central element of the interface and underlying search engine is a segmentation of the video into stories, which allows the user to quickly navigate and evaluate the relevance of moderately-sized, semantically-related chunks.

Abstract

In this paper we describe the analysis component of an indoor, real-time, multi-camera surveillance system. The analysis includes: (1) a novel feature-level foreground segmentation method which achieves efficient and reliable segmentation results even under complex conditions, (2) an efficient greedy search based approach for tracking multiple people through occlusion, and (3) a method for multi-camera handoff that associates individual trajectories in adjacent cameras. The analysis is used for an 18 camera surveillance system that has been running continuously in an indoor business over the past several months. Our experiments demonstrate that the processing method for people detection and tracking across multiple cameras is fast and robust.

Abstract

In this chapter, we discuss mixed-media access, an information access paradigm for multimedia data in which the media type of a query may differ from that of the data. This allows a single query to be used to retrieve information from data consisting of multiple types of media. In addition, multiple queries formulated in different media types can be used to more accurately specify the data to be retrieved. The types of media considered in this paper are speech, images of text, and full-length text. Some examples of metadata for mixed-media access are locations of keywords in speech and images, identification of speakers, locations of emphasized regions in speech, and locations of topic boundaries in text. Algorithms for automatically generating this metadata are described, including word spotting, speaker segmentation, emphatic speech detection, and subtopic boundary location. We illustrate the use of mixed-media access with an example of information access from multimedia data surrounding a formal presentation.