Text Analytics Techniques (2014 Year in Review)

Text Analytics Techniques
(2014 Year in Review)

Text analytics refers to linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for intelligence, exploratory data analysis, research, or investigation. The research cited here focuses on large volumes of text mined to identify insider threats, intrusions, and malware detection. The works cited were published in 2014.

Dey, L.; Mahajan, D.; Gupta, H., "Obtaining Technology Insights from Large and Heterogeneous Document Collections," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on, vol. 1, no., pp. 102, 109, 11-14 Aug. 2014. doi: 10.1109/WI-IAT.2014.22 Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.

Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T., "Word Cloud Explorer: Text Analytics Based on Word Clouds," System Sciences (HICSS), 2014 47th Hawaii International Conference on, pp. 1833, 1842, 6-9 Jan. 2014. doi: 10.1109/HICSS.2014.231 Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.

Mukkamala, R.R.; Hussain, A.; Vatrapu, R., "Towards a Set Theoretical Approach to Big Data Analytics," Big Data (BigData Congress), 2014 IEEE International Congress on, pp. 629, 636, June 27 2014-July 2 2014. doi: 10.1109/BigData.Congress.2014.96 Formal methods, models and tools for social big data analytics are largely limited to graph theoretical approaches such as social network analysis (SNA) informed by relational sociology. There are no other unified modeling approaches to social big data that integrate the conceptual, formal and software realms. In this paper, we first present and discuss a theory and conceptual model of social data. Second, we outline a formal model based on set theory and discuss the semantics of the formal model with a real-world social data example from Facebook. Third, we briefly present and discuss the Social Data Analytics Tool (SODATO) that realizes the conceptual model in software and provisions social data analysis based on the conceptual and formal models. Fourth and last, based on the formal model and sentiment analysis of text, we present a method for profiling of artifacts and actors and apply this technique to the data analysis of big social data collected from Facebook page of the fast fashion company, H&M.

Craig, P.; Roa Seiler, N.; Olvera Cervantes, A.D., "Animated Geo-temporal Clusters for Exploratory Search in Event Data Document Collections," Information Visualisation (IV), 2014 18th International Conference on, pp.157,163, 16-18 July 2014. doi: 10.1109/IV.2014.69 This paper presents a novel visual analytics technique developed to support exploratory search tasks for event data document collections. The technique supports discovery and exploration by clustering results and overlaying cluster summaries onto coordinated timeline and map views. Users can also explore and interact with search results by selecting clusters to filter and re-cluster the data with animation used to smooth the transition between views. The technique demonstrates a number of advantages over alternative methods for displaying and exploring geo-referenced search results and spatio-temporal data. Firstly, cluster summaries can be presented in a manner that makes them easy to read and scan. Listing representative events from each cluster also helps the process of discovery by preserving the diversity of results. Also, clicking on visual representations of geo-temporal clusters provides a quick and intuitive way to navigate across space and time simultaneously. This removes the need to overload users with the display of too many event labels at any one time. The technique was evaluated with a group of nineteen users and compared with an equivalent text based exploratory search engine.

Eun Hee Ko; Klabjan, D., "Semantic Properties of Customer Sentiment in Tweets," Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on, pp.657,663, 13-16 May 2014. doi: 10.1109/WAINA.2014.151 An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discovesr semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.

Conglei Shi; Yingcai Wu; Shixia Liu; Hong Zhou; Huamin Qu, "LoyalTracker: Visualizing Loyalty Dynamics in Search Engines," Visualization and Computer Graphics, IEEE Transactions on, vol.20, no.12, pp. 1733, 1742, Dec. 31 2014. doi: 10.1109/TVCG.2014.2346912 The huge amount of user log data collected by search engine providers creates new opportunities to understand user loyalty and defection behavior at an unprecedented scale. However, this also poses a great challenge to analyze the behavior and glean insights into the complex, large data. In this paper, we introduce LoyalTracker, a visual analytics system to track user loyalty and switching behavior towards multiple search engines from the vast amount of user log data. We propose a new interactive visualization technique (flow view) based on a flow metaphor, which conveys a proper visual summary of the dynamics of user loyalty of thousands of users over time. Two other visualization techniques, a density map and a word cloud, are integrated to enable analysts to gain further insights into the patterns identified by the flow view. Case studies and the interview with domain experts are conducted to demonstrate the usefulness of our technique in understanding user loyalty and switching behavior in search engines.

Babour, A.; Khan, J.I., "Tweet Sentiment Analytics with Context Sensitive Tone-Word Lexicon," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on, vol. 1, pp.392,399, 11-14 Aug. 2014. doi: 10.1109/WI-IAT.2014.61 In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.

Vantigodi, S.; Babu, R.V., "Entropy Constrained Exemplar-Based Image Inpainting," Signal Processing and Communications (SPCOM), 2014 International Conference on, pp.1,5, 22-25 July 2014. doi: 10.1109/SPCOM.2014.6984013 Image inpainting is the process of filling the unwanted region in an image marked by the user. It is used for restoring old paintings and photographs, removal of red eyes from pictures, etc. In this paper, we propose an efficient inpainting algorithm which takes care of false edge propagation. We use the classical exemplar based technique to find out the priority term for each patch. To ensure that the edge content of the nearest neighbor patch found by minimizing L2 distance between patches, we impose an additional constraint that the entropy of the patches be similar. Entropy of the patch acts as a good measure of edge content. Additionally, we fill the image by considering overlapping patches to ensure smoothness in the output. We use structural similarity index as the measure of similarity between ground truth and inpainted image. The results of the proposed approach on a number of examples on real and synthetic images show the effectiveness of our algorithm in removing objects and thin scratches or text written on image. It is also shown that the proposed approach is robust to the shape of the manually selected target. Our results compare favorably to those obtained by existing techniques.

Baughman, A.K.; Chuang, W.; Dixon, K.R.; Benz, Z.; Basilico, J., "DeepQA Jeopardy! Gamification: A Machine-Learning Perspective," Computational Intelligence and AI in Games, IEEE Transactions on, vol.6, no. 1, pp. 55, 66, March 2014. doi: 10.1109/TCIAIG.2013.2285651 DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems.

Zadeh, B.Q.; Handschuh, S., "Random Manhattan Indexing," Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on, pp.203,208, 1-5 Sept. 2014. doi: 10.1109/DEXA.2014.51 Vector space models (VSMs) are mathematically well-defined frameworks that have been widely used in text processing. In these models, high-dimensional, often sparse vectors represent text units. In an application, the similarity of vectors -- and hence the text units that they represent -- is computed by a distance formula. The high dimensionality of vectors, however, is a barrier to the performance of methods that employ VSMs. Consequently, a dimensionality reduction technique is employed to alleviate this problem. This paper introduces a new method, called Random Manhattan Indexing (RMI), for the construction of L1 normed VSMs at reduced dimensionality. RMI combines the construction of a VSM and dimension reduction into an incremental, and thus scalable, procedure. In order to attain its goal, RMI employs the sparse Cauchy random projections.

Koch, S.; John, M.; Worner, M.; Muller, A.; Ertl, T., "VarifocalReader — In-Depth Visual Analysis of Large Text Documents," Visualization and Computer Graphics, IEEE Transactions on, vol. 20, no.12, pp. 1723, 1732, Dec. 31 2014. doi: 10.1109/TVCG.2014.2346677 Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.

Lomotey, R.K.; Deters, R., "Terms Mining in Document-Based NoSQL: Response to Unstructured Data," Big Data (BigData Congress), 2014 IEEE International Congress on, pp. 661, 668, June 27 2014-July 2 2014. doi: 10.1109/BigData.Congress.2014.99 Unstructured data mining has become topical recently due to the availability of high-dimensional and voluminous digital content (known as "Big Data") across the enterprise spectrum. The Relational Database Management Systems (RDBMS) have been employed over the past decades for content storage and management, but, the ever-growing heterogeneity in today's data calls for a new storage approach. Thus, the NoSQL database has emerged as the preferred storage facility nowadays since the facility supports unstructured data storage. This creates the need to explore efficient data mining techniques from such NoSQL systems since the available tools and frameworks which are designed for RDBMS are often not directly applicable. In this paper, we focused on topics and terms mining, based on clustering, in document-based NoSQL. This is achieved by adapting the architectural design of an analytics-as-a-service framework and the proposal of the Viterbi algorithm to enhance the accuracy of the terms classification in the system. The results from the pilot testing of our work show higher accuracy in comparison to some previously proposed techniques such as the parallel search.

Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to news@scienceofsecurity.net for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.