Projects

Assembly code mining for malware analysis

Assembly code analysis is one of the critical processes for detecting
and proving software plagiarism and software patent infringements when
the source code is unavailable. It is also a common practice to discover
exploits and vulnerabilities in existing software. However, it is a
manually intensive and time-consuming process even for experienced
reverse engineers. An effective and efficient assembly code clone search
engine can greatly reduce the effort of this process, since it can
identify the cloned parts that have been previously analyzed. The
assembly code clone search problem belongs to the field of software
engineering. However, it strongly depends on practical nearest neighbor
search techniques in data mining and databases. By closely collaborating
with reverse engineers and Defence Research and Development Canada
(DRDC), we study the concerns and challenges that make existing
assembly code clone approaches not practically applicable from the
perspective of data mining. We propose a new variant of LSH scheme and
incorporate it with graph matching to address these challenges. We
implement an integrated assembly clone search engine called Kam1n0. It
is the first clone search engine that can efficiently identify the given
query assembly function’s subgraph clones from a large assembly code
repository. Kam1n0 is built upon the Apache Spark computation framework
and Cassandra-like key-value distributed storage. A deployed demo system
is publicly available. Extensive experimental results suggest that
Kam1n0 is accurate, efficient, and scalable for handling large volume of
assembly code. This software won the second prize in the
Hex-Rays Plug-In Contest 2015.

Privacy-preserving data publishing for data mining

Data mining is the process of extracting useful, interesting, and
previously unknown information from large datasets. The success of data
mining relies on the availability of high quality data and effective
information sharing. Since data mining is often a key component of many
systems of business information, national security, and monitoring and
surveillance, the public has acquired the negative impression that data
mining is a technique that intrudes on personal privacy. This lack of
trust in data mining has become an obstacle to the advancement of the
technology. To overcome this obstacle, our research on
privacy-preserving data publishing (PPDP) is concerned mainly with the
feasibility of anonymizing and publishing person-specific data for data
mining without compromising the privacy of individuals. The research is
also concerned with designing anonymization algorithms for large data
sets in various data publishing scenarios, including single party,
multiparty, and sequential data publishing.

Sensory and location-aware devices are used extensively in many
network systems, such as mass transportation, car navigation, and
healthcare management. The collected transaction, trajectory, and social
network data
capture detailed information of tagged objects, offering
tremendous opportunities for mining useful knowledge. However,
publishing the raw data would reveal specific sensitive information of
tagged objects or individuals. In this research thread, we have studied
the privacy threats in transaction, trajectory, and social network data publishing and
presented a family of scalable anonymization methods to tackle the
challenging properties of high dimensionality, sparseness, and
sequentiality.

Text mining for cybercrime investigation

As data collection techniques have improved over the last decade, the
volume of collected cybercrime data has grown at a tremendous rate. Yet,
extracting useful knowledge from such a large volume of textual data,
such as e-mails, web pages, blogs, chat room dialogues, and instant
messages, remains a challenging task to law enforcement. In this
research thread, our team has developed a collection of cyber forensics
software tools for writeprint analysis and criminal-networks mining. The research have
been reported by
media
worldwide.

Data mining for improving building energy performance

Identification of major determinants of building energy consumption, together with a thorough understanding of their impacts on energy consumption patterns, could help achieve the goals of improving building energy performance and reducing greenhouse gas emissions. One of the most important determinants is the behavior of the building occupants. The advancement of building automation and energy management systems enables building managers to collect a large volume of occupant behavior and movement data. This data can provide abundant practical information about interactions between building energy consumption and influencing factors. However, the data is rarely analyzed and useful knowledge is seldom extracted due to a lack of effective data analysis techniques and tools.

Our team, together with Prof. Fariborz Haghighat at Concordia
University, has developed the first comprehensive data mining framework and a family of customized data mining methods for identifying the associations and correlations between building operational data and occupant behavior data, thereby discovering practical knowledge about energy conservation. In order to demonstrate the applicability of the proposed method, the method was applied to the operational data of the air-conditioning system in a building located in Montreal. The proposed method was able to effectively identify the energy waste in the air-conditioning system as well as the faulty equipment in the HVAC system. The proposed data mining framework and methods could help building engineers and designers better understand building operation and provide further opportunities for energy conservation.