Francine Chen, Ph.D.

Francine’s research interests are in information access and machine learning, with a focus around developing methods for extracting, organizing and finding information in different types of media, including audio, imaged text, images, text, and video. She was a Member of Research Staff and Manager of the Quantitative Content Analysis area at PARC (formerly Xerox PARC).

Abstract

Discovering and analyzing biclusters, i.e., two sets of related entities with close relationships, is a critical task in many real-world applications, such as exploring entity co-occurrences in intelligence analysis, and studying gene expression in bio-informatics. While the output of biclustering techniques can offer some initial low-level insights, visual approaches are required on top of that due to the algorithmic output complexity.This paper proposes a visualization technique, called BiDots, that allows analysts to interactively explore biclusters over multiple domains. BiDots overcomes several limitations of existing bicluster visualizations by encoding biclusters in a more compact and cluster-driven manner. A set of handy interactions is incorporated to support flexible analysis of biclustering results. More importantly, BiDots addresses the cases of weighted biclusters, which has been underexploited in the literature. The design of BiDots is grounded by a set of analytical tasks derived from previous work. We demonstrate its usefulness and effectiveness for exploring computed biclusters with an investigative document analysis task, in which suspicious people and activities are identified from a text corpus.

Abstract

Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network architecture that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of- the-art approaches in both individual tasks.

Abstract

The availability of mobile access has shifted social media use. With that phenomenon, what users shared on social media and where they visited is naturally an excellent resource to learn their visiting behavior. Knowing visit behaviors would help market survey and customer relationship management, e.g., sending customers coupons of the businesses that they visit frequently. Most prior studies leverage meta-data e.g., check- in locations to profile visiting behavior but neglect important information from user-contributed content, e.g., images. This work addresses a novel use of image content for predicting the user visit behavior, i.e., the frequent and regular business venue categories that the content owner would visit. To collect training data, we propose a strategy to use geo-metadata associated with images for deriving the labels of an image owner’s visit behavior. Moreover, we model a user’s sequential images by using an end-to-end learning framework to reduce the optimization loss. That helps improve the prediction accuracy against the baseline as demonstrated in our experiments. The prediction is completely based on image content that is more available in social media than geo-metadata, and thus allows coverage in profiling a wider set of users.

Abstract

Users often use social media to share their interest in products. We propose to identify purchase stages from Twitter data following the AIDA model (Awareness, Interest, Desire, Action). In particular, we define a task of classifying the purchase stage of each tweet in a user's tweet sequence. We introduce RCRNN, a Ranking Convolutional Recurrent Neural Network which computes tweet representations using convolution over word embeddings and models a tweet sequence with gated recurrent units. Also, we consider various methods to cope with the imbalanced label distribution in our data and show that a ranking layer outperforms class weights.

Abstract

The abundance of data posted to Twitter enables companies to extract useful information, such as Twitter users who are dissatisfied with a product. We endeavor to determine which Twitter users are potential customers for companies and would be receptive to product recommendations through the language they use in tweets after mentioning a product of interest. With Twitter's API, we collected tweets from users who tweeted about mobile devices or cameras. An expert annotator determined whether each tweet was relevant to customer purchase behavior and
whether a user, based on their tweets, eventually bought the product. For the relevance task, among four models, a feed-forward neural network yielded
the best cross-validation accuracy of over 80% per product. For customer purchase prediction of a product, we observed improved performance with the use of sequential input of tweets to recurrent models, with an LSTM model being best; we also observed the use of relevance predictions in our model to be more effective with less powerful RNNs and on more difficult tasks.

Abstract

Social media offers potential opportunities for businesses to
extract business intelligence. This paper presents Tweetviz,
an interactive tool to help businesses extract actionable information from a large set of noisy Twitter messages. Tweetviz visualizes tweet sentiment of business locations, identifies other business venues that Twitter users visit, and estimates some simple demographics of the Twitter users frequenting a business. A user study to evaluate the system's ability indicates that Tweetviz can provide an overview of a business's issues and sentiment as well as information aiding users in creating customer profiles.

Abstract

Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisement. Previous studies in image captioning either rely solely on image content or summarize multiple web documents related to image's location; both neglect users' activities. We propose business-aware latent topics as a new contextual cue for image captioning that represent user activities. The idea is to learn the typical activities of people who posted images from business venues with similar categories (e.g., fast food restaurants) to provide appropriate context for similar topics (e.g., burger) in new posts. User activities are modeled via a latent topic representation. In turn, the image captioning model can generate sentences that better reflect user activities at business venues. In our experiments, the business-aware latent topics are effective for adapting to captions to images captured in various businesses than the existing baselines. Moreover, they complement other contextual cues (image, time) in a multi-modal framework.

Abstract

Many people post about their daily life on social media. These posts may include information about the purchase activity of people, and insights useful to companies can be derived from them: e.g. profile information of a user who mentioned something about their product. As a further advanced analysis, we consider extracting users who are likely to buy a product from the set of users who mentioned that the product is attractive.
In this paper, we report our methodology for building a corpus for Twitter user purchase behavior prediction. First, we collected Twitter users who posted a want phrase + product name: e.g. "want a Xperia" as candidate want users, and also candidate bought users in the same way. Then, we asked an annotator to judge whether a candidate user actually bought a product. We also annotated whether tweets randomly sampled from want/bought user timelines are relevant or not to purchase. In this annotation, 58% of want user tweets and 35% of bought user tweets were annotated as relevant. Our data indicate that information embedded in timeline tweets can be used to predict purchase behavior of tweeted products.

Abstract

We present a method for profiling businesses at specific locations that is based on mining information from social media. The method matches geo-tagged tweets from Twitter against venues from Foursquare to identify the specific business mentioned in a tweet. By linking geo-coordinates to places, the tweets associated with a business, such as a store, can then be used to profile that business. From these venue-located tweets, we create sentiment profiles for each of the stores in a chain. We present the results as heat maps showing how sentiment differs across stores in the same chain and how some chains have more positive sentiment than other chains. We also estimate social group size from photos and create profiles of social group size for businesses. Sample heat maps of these results illustrate how the average social group size can vary across businesses.

Abstract

We describe methods for analyzing and visualizing document metadata to provide insights about collaborations over time. We investigate the use of Latent Dirichlet Allocation (LDA) based topic modeling to compute areas of interest on which people collaborate. The topics are represented in a node-link force directed graph by persistent fixed nodes laid out with multidimensional scaling (MDS), and the people by transient movable nodes. The topics are also analyzed to detect bursts to highlight "hot" topics during a time interval. As the user manipulates a time interval slider, the people nodes and links are dynamically updated. We evaluate the results of LDA topic modeling for the visualization by comparing topic keywords against the submitted keywords from the InfoVis 2004 Contest, and we found that the additional terms provided by LDA-based keyword sets result in improved similarity between a topic keyword set and the documents in a corpus. We extended the InfoVis dataset from 8 to 20 years and collected publication metadata from our lab over a period of 21 years, and created interactive visualizations for exploring these larger datasets.

Abstract

Image localization is important for marketing and recommendation of local business; however, the level of granularity is still a critical issue. Given a consumer photo and its rough GPS information, we are interested in extracting the fine-grained location information (i.e. business venues) of the image. To this end, we propose a novel framework for business venue recognition. The framework mainly contains three parts. First, business aware visual concept discovery: we mine a set of concepts that are useful for business venue recognition based on three guidelines including business-awareness, visually detectable, and discriminative power. Second, business-aware concept detection by convolutional neural networks (BA-CNN): we pro- pose a new network architecture that can extract semantic concept features from input image. Third, multimodal business venue recognition: we extend visually detected concepts to multimodal feature representations that allow a test image to be associated with business reviews and images from social media for business venue recognition. The experiments results show the visual concepts detected by BA-CNN can achieve up to 22.5% relative improvement for business venue recognition compared to the state-of-the-art convolutional neural network features. Experiments also show that by leveraging multimodal information from social media we can further boost the performance, especially in the case when the database images belonging to each business venue are scarce.

Abstract

In this paper, we analyze the association between a social media user's photo content and their interests. Visual content of photos is analyzed using state-of-the-art deep learning based automatic concept recognition. An aggregate visual concept signature is thereby computed for each user. User tags manually applied to their photos are also used to construct a tf-idf based signature per user. We also obtain social groups that users join to represent their social interests. In an effort to compare the visual-based versus tag-based user profiles with social interests, we compare corresponding similarity matrices with a reference similarity matrix based on users' group memberships. A random baseline is also included that groups users by random sampling while preserving the actual group sizes. A difference metric is proposed and it is shown that the combination of visual and text features better approximates the group-based similarity matrix than either modality individually. We also validate the visual analysis against the reference inter-user similarity using the Spearman rank correlation coefficient. Finally we cluster users by their visual signatures and rank clusters using a cluster uniqueness criteria.

Abstract

Knowing the geo-located venue of a tweet can facilitate better understanding of a user's geographic context, allowing apps to more precisely present information, recommend services, and target advertisements. However, due to privacy concerns, few users choose to enable geotagging of their tweets resulting in a small percentage of tweets being geotagged; furthermore, even if the geo-coordinates are available, the closest venue to the geo-location may be incorrect.
In this paper, we present a method for providing a ranked list of geo-located venues for a non-geotagged tweet, which simultaneously indicates the venue name and the geo-location at a very fine-grained granularity. In our proposed method for Venue Inference for Tweets ({\VIT}), we construct a heterogeneous social network in order to analyze the embedded social relations, and leverage available but limited geographic data to estimate the geo-located venue of tweets. A single classifier is trained to predict the probability of a tweet and a geo-located venue being linked, rather than training a separate model for each venue. We examine the performance of four types of social relation features and three types of geographic features embedded in a social network when predicting whether a tweet and a venue are linked, with a best accuracy of over 88%. We use the classifier probability estimates to rank the predicted geo-located venues of a non-geotagged tweet from over 19k possibilities, and observed an average top-5 accuracy of 29%.

Abstract

We present a method for profiling businesses at specific locations that is based on mining information from social media. The method matches geo-tagged tweets from Twitter against venues from Foursquare to identify the specific business mentioned in a tweet. By linking geo-coordinates to places, the tweets associated with a business, such as a store, can then be used to profile that business. We used a sentiment estimator developed for tweets to create sentiment profiles of the stores in a chain, computing the average sentiment of tweets associated with each store. We present the results as heatmaps which show how sentiment differs across stores in the same chain and how some chains have more positive sentiment than other chains. We also created profiles of social group size for businesses and show sample heatmaps illustrating how the size of a social group can vary.

Abstract

We examine the use of clustering to identify selfies in a social media user's photos for use in estimating demographic information such as age, gender, and race. Faces are first detected within a user's photos followed by clustering using visual similarity. We define a cluster scoring scheme that uses a combination of within-cluster visual similarity and average face size in a cluster to rank potential selfie-clusters. Finally, we evaluate this ranking approach over a collection of Twitter users and discuss methods that can be used for improving performance in the future.

Abstract

A topic-independent sentiment model is commonly used to estimate sentiment in microblogs. But for movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance. We investigated the utility of topic-dependent polarity estimation models for microblogs. We examined both a model trained on Twitter tweets containing a target keyword and a model trained on an enlarged set of tweets containing terms related to a topic. Comparing the performance of the topic-dependent models to a topic-independent model trained on a general sample of tweets, we noted that for some topics, topic-dependent models performed better. We then propose a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used.

Abstract

Motivated by scalable partial-duplicate visual search, there has been growing interest on a wealth of compact and efficient binary feature descriptors (e.g. ORB, FREAK, BRISK). Typically, binary descriptors are clustered into codewords and quantized with Hamming distance, which follows conventional bag-of-words strategy. However, such codewords formulated in Hamming space did not present obvious indexing and search performance improvement as compared to the Euclidean ones. In this paper, without explicit codeword construction, we explore to utilize binary descriptors as direct codebook indices (addresses). We propose a novel approach to build multiple index tables which parallelly check the collision of same hash values. The evaluation is performed on two public image datasets: DupImage and Holidays. The experimental results demonstrate the index efficiency and retrieval accuracy of our approach.

Abstract

People frequently capture photos with their smartphones, and some are starting to capture images of documents. However, the quality of captured document images is often lower than expected, even when applications that perform post-processing to improve the image are used. To improve the quality of captured images before post-processing, we developed a Smart Document Capture (SmartDCap) application that provides real-time feedback to users about the likely quality of a captured image. The quality measures capture the sharpness and framing of a page or regions on a page, such as a set of one or more columns, a part of a column, a figure, or a table. Using our approach, while users adjust the camera position, the application automatically determines when to take a picture of a document to produce a good quality result. We performed a subjective evaluation comparing SmartDCap and the Android Ice Cream Sandwich (ICS) camera application; we also used raters to evaluate the quality of the captured images. Our results indicate that users find SmartDCap to be as easy to use as the standard ICS camera application. Additionally, images captured using SmartDCap are sharper and better framed on average than images using the ICS camera application.

Abstract

Images of document pages have different characteristics
than images of natural scenes, and so the
sharpness measures developed for natural scene images
do not necessarily extend to document images
primarily composed of text. We present an efficient
and simple method for effectively estimating the sharpness/
blurriness of document images that also performs
well on natural scenes. Our method can be used to
predict the sharpness in scenarios where images are
blurred due to camera-motion (or hand-shake), defocus,
or inherent properties of the imaging system. The
proposed method outperforms the perceptually-based,
no-reference sharpness work of [1] and [4], which was
shown to perform better than 14 other no-reference
sharpness measures on the LIVE dataset.

Abstract

When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting
features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve genre identification performance. In the open-set identification of four office document genres, our experiments show that when combined with image-based
features, text-based features do not significantly influence performance. These results provide support for a
topic-independent approach to genre identification of office documents. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone.
We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.

Abstract

For document visualization, folding techniques provide a focus-plus-context approach with fairly high legibility on flat sections. To enable richer interaction, we explore the design space of multi-touch document folding. We discuss several design considerations for simple modeless gesturing and compatibility with standard Drag and Pinch gestures, and categorize gesture models along the characteristics of Symmetric/Asymmetric and Sequential/Parallel, which yields three gesture models. We built a prototype document workspace application that integrates folding and standard gestures, and a prototype for experimenting with the gesture models. A user study was conducted to compare the three models and to analyze the factors of fold direction, target symmetry, and target tolerance in user performance of folding a document to a specific shape. Our results indicate that all three factors were significant for task times, and parallelism was greater for symmetric targets.

Abstract

While there are many commercial systems designed to help people browse and compare products, these interfaces are typically product centric. To help users more efficiently identify products that match their needs, we instead focus on building a task centric interface and system. With this approach, users initially answer questions about the types of situations in which they expect to use the product. The interface reveals the types of products that match their needs and exposes high-level product features related to the kinds of tasks in which they have expressed an interest. As users explore the interface, they can reveal how those high-level features are linked to actual product data, including customer reviews and product specifications. We developed semi-automatic methods to extract the high-level features used by the system from online product data. These methods identify and group product features, mine and summarize opinions about those features, and identify product uses. User studies verified our focus on high-level features for browsing and low-level features and specifications for comparison.

Abstract

FACT is an interactive paper system for fine-grained interaction with documents across the boundary between paper and computers. It consists of a small camera-projector unit, a laptop, and ordinary paper documents. With the camera-projector unit pointing to a paper document, the system allows a user to issue pen gestures on the paper document for selecting fine-grained content and applying various digital functions. For example, the user can choose individual words, symbols, figures, and arbitrary regions for keyword search, copy and paste, web search, and remote sharing. FACT thus enables a computer-like user experience on paper. This paper interaction can be integrated with laptop interaction for cross-media manipulations on multiple documents and views. We present the infrastructure, supporting techniques and interaction design, and demonstrate the feasibility via a quantitative experiment. We also propose applications such as document manipulation, map navigation and remote collaboration.

Abstract

We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers.