Dissertation Proposal

Understanding the Logical and Semantic
Structure of Large Documents

11:00-1:00 Monday, 12 December 2016, ITE325b, UMBC

Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.

We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.

Dissertation Proposal

Dealing with Dubious Facts
in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.

An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.

Large Document Understanding

Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.

We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.

Dissertation Proposal

Cognitive Analytics Framework to Secure Internet of Things

1:00-3:30pm, Monday, 28 November 2016, ITE 325b

Recent years have seen the rapid growth and widespread adoption of Internet of Things in a wide range of domains including smart homes, healthcare, automotive, smart farming and smart grids. The IoT ecosystem consists of devices like sensors, actuators and control systems connected over heterogeneous networks. The connected devices can be from different vendors with different capabilities in terms of power requirements, processing capabilities, etc. As such, many security features aren’t implemented on devices with lesser processing capabilities. The level of security practices followed during their development can also be different. Lack of over the air update for firmware also pose a very big security threat considering their long-term deployment requirements. Device malfunctioning is yet another threat which should be considered. Hence, it is imperative to have an external entity which monitors the ecosystem and detect attacks and anomalies.

In this thesis, we propose a security framework for IoTs using cognitive techniques. While anomaly detection has been employed in various domains, some challenges like online approach, resource constraints, heterogeneity, distributed data collection etc. are unique to IoTs and their predecessors like wireless sensor networks. Our framework will have an underlying knowledge base which has the domain-specific information, a hybrid context generation module which generates complex contexts and a fast reasoning engine which does logical reasoning to detect anomalous activities. When raw sensor data arrives, the hybrid context generation module queries the knowledge base and generates different simple local contexts using various statistical and machine learning models. The inferencing engine will then infer global complex contexts and detects anomalous activities using knowledge from streaming facts and and domain specific rules encoded in the Ontology we will create. We will evaluate our techniques by realizing and validating them in the vehicular domain.

Topic Modeling for Analyzing Document Collection

Mitsunori Ogihara
Computer Science, University of Miami

11:00am Monday, 16 May 2016, ITE 325b, UMBC

Topic modeling (in particular, Latent Dirichlet Analysis) is a technique for analyzing a large collection of documents. In topic modeling we view each document as a frequency vector over a vocabulary and each topic as a static distribution over the vocabulary. Given a desired number, K, of document classes, a topic modeling algorithm attempts to estimate concurrently K static distributions and for each document how much each K class contributes. Mathematically, this is the problem of approximating the matrix generated by stacking the frequency vectors into the product of two non-negative matrices, where both the column dimension of the first matrix and the row dimension of the second matrix are equal to K. Topic modeling is gaining popularity recently, for analyzing large collections of documents.

In this talk I will present some examples of applying topic modeling: (1) a small sentiment analysis of a small collection of short patient surveys, (2) exploratory content analysis of a large collection of letters, (3) document classification based upon topics and other linguistic features, and (4) exploratory analysis of a large collection of literally works. I will speak not only the exact topic modeling steps but also all the preprocessing steps for preparing the documents for topic modeling.

Mitsunori Ogihara is a Professor of Computer Science at the University of Miami, Coral Gables, Florida. There he directs the Data Mining Group in the Center for Computational Science, a university-wide organization for providing resources and consultation for large-scale computation. He has published three books and approximately 190 papers in conferences and journals. He is on the editorial board for Theory of Computing Systems and International Journal of Foundations of Computer Science. Ogihara received a Ph.D. in Information Sciences from Tokyo Institute of Technology in 1993 and was a tenure-track/tenured faculty member in the Department of Computer Science at the University of Rochester from 1994 to 2007.

Vehicles can be considered as a specialized form of Cyber Physical Systems with sensors, ECU’s and actuators working together to produce a coherent behavior. With the advent of external connectivity, a larger attack surface has opened up which not only affects the passengers inside vehicles, but also people around them. One of the main causes of this increased attack surface is because of the advanced systems built on top of old and less secure common bus frameworks which lacks basic authentication mechanisms. To make such systems more secure, we approach this issue as a data analytic problem that can detect anomalous states. To accomplish that we collected data flowing between different components from real vehicles and using a Hidden Markov Model, we detect malicious behaviors and issue alerts, while a vehicle is in operation. Our evaluations using single parameter and two parameters together provide enough evidence that such techniques could be successfully used to detect anomalies in vehicles. Moreover our method could be used in new vehicles as well as older ones.

Down the rabbit hole: An Android system call study

Prajit Kumar Das

10:30 am, Monday, March 28, 2016 ITE 346

App permissions and application sandboxing are the fundamental security mechanisms that protects user data on mobile platforms. We have worked on permission analytics before and come to a conclusion that just studying an app’s requested access rights (permissions) isn’t enough to understand potential data breaches. Techniques like privilege escalation have been previously used to gain further access to user and her data on mobile platforms like Android. Static code analysis and dynamic code execution may be studied to gather further insight into an app’s behavior. However, there is a need to study such a behavior at the lowest level of code execution and that is system calls. The system call is the fundamental interface between an application and the Linux kernel. In our current project, we are studying system calls made by apps for gathering a better understanding of their behavior.

Image description using deep neural networks

Sunil Gandhi
10:30 am, Monday, February 29, 2016 ITE 346

With the explosion of image data on the internet, there has been a need for automatic generation of image descriptions. In this project we use deep neural networks for extracting vectors from images and we use them to generate text that describes the image. The model that we built makes use of the pre-trained VGGNET- a model for image classification and a recurrent neural network (RNN) for language modelling. The combination of the two neural networks provides a multimodal embedding between image vectors and word vectors. We trained the model on 8000 images from the Flickr8k dataset and we present our results on test images downloaded from the Internet. We provide a web-service for image description generation that takes the image URL as input and provides image description and image categories as output. Through our service, a user can correct the description automatically generated by the system so that we can improve our model using corrected description.

Ramin Ayanzadeh
10:30am, Monday, 22 February 2016, ITE 346

A Memetic algorithm, as a hybrid strategy, is an intelligent optimization method in problem solving. These algorithms are similar in nature to genetic algorithms as they follow evolutionary strategies, but they also incorporate a refinement phase during which they learn about the problem and search space. The efficiency of these algorithms depends on the nature and architecture of the imitation operator used. In this presentation, after a brief introduction, pros and cons of employing memetic algorithms would be discussed. Afterwards, developmental memetic algorithms will be proposed as an approach for subsiding the costs of using standard memetic algorithms. Developmental memetic algorithm is an adaptive memetic algorithm that has been developed in which the influence factor of environment on the learning abilities of each individual is set adaptively. This translates into a level of autonomous behavior, after a while that individuals gain some experience. Simulation results on benchmark function proved that this adaptive approach can increase the quality of the results and decrease the computation time simultaneously. The adaptive memetic algorithm also shows better stability when compared with the classic memetic algorithm.

Vehicles are becoming more and more connected, this opens up a larger attack surface which not only affects the passengers inside vehicles, but also people around them. These vulnerabilities exist because modern systems are built on the comparatively less secure and old CAN bus framework which lacks even basic authentication. Since a new protocol can only help future vehicles and not older vehicles, our approach tries to solve the issue as a data analytics problem and use machine learning techniques to secure cars. We develop a hidden markov model to detect anomalous states from real data collected from vehicles. Using this model, while a vehicle is in operation, we are able to detect and issue alerts. Our model could be integrated as a plug-n-play device in all new and old cars.

Aditi Gupta

10:30am, Monday 30 November 2015, ITE 346

Online social media is a powerful platform for dissemination of information during real world events. Beyond the challenges of volume, variety and velocity of content generated on online social media, veracity poses a much greater challenge for effective utilization of this content by citizens, organizations, and authorities. Veracity of information refers to the trustworthiness /credibility / accuracy / completeness of the content. This work addressed the challenge of veracity or trustworthiness of content posted on social media. We focus our work on Twitter, which is one of the most popular microblogging web service today. We provided an in-depth analysis of misinformation spread on Twitter during real world events. We showed effectiveness of automated techniques to detect misinformation on Twitter using a combination of content, meta-data, network, user profile and temporal features. We developed and deployed a novel framework, TweetCred for providing indication of trustworthiness / credibility of tweets posted during events. TweetCred, which was available as a browser plug-in, was installed and used by real Twitter users.

Dr. Aditi Gupta is a research associate in the Computer Science and Electrical Engineering Department at UMBC. She received her Ph.D. from the Indraprastha Institute of Information Technology, Delhi (IIIT-Delhi) in 2105 for her dissertation on designing and evaluating techniques to mitigate misinformation spread on microblogging web services.

Log files comprise a record of different events happening in various applications, operating systems and even in network devices. Originally they were used to record information for diagnostic and debugging purposes. Nowadays, logs are also used to track events which can be used in auditing and forensics in case of malicious activities or systems attacks. Various softwares like intrusion detection systems, web servers, anti-virus and anti-malware systems, firewalls and network devices generate logs with useful information, that can be used to protect against such system attacks. Analyzing log files can help in pro- actively avoiding attacks against the systems. While there are existing tools that do a good job when the format of log files is known, the challenge lies in cases where log files are from unknown devices and of unknown formats. We propose a framework that takes any log file and automatically gives out a semantic interpretation as a set of RDF Linked Data triples. The framework splits a log file into columns using regular expression-based or dictionary-based classifiers. Leveraging and modifying our existing work on inferring the semantics of tables, we identify every column from a log file and map it to concepts either from a general purpose KB like DBpedia or domain specific ontologies such as IDS. We also identify relationships between various columns in such log files. Converting large and verbose log files into such semantic representations will help in better search, integration and rich reasoning over the data.