► Tremendous maintenance and operational expenditure are incurred due to unexpected failures, and inefficient maintenance and operational practices. Consequently, many capital-intensive assets used in the energy,…
(more)

▼ Tremendous maintenance and operational expenditure are incurred due to unexpected failures, and inefficient maintenance and operational practices. Consequently, many capital-intensive assets used in the energy, manufacturing, and service sectors are equipped with numerous sensors that generate large amounts of high-dimensional data related to the physical performance and the operational characteristics of the asset. One of the key Big Data challenges stems from the need to analyze high-dimensional data, in real-time, to detect faults and predict the future state-of-health of critical assets (i.e., predictive analytics). This doctoral dissertation focuses on addressing several key challenges in predictive analytics for asset management and optimization. The first research challenge revolves around the development of prognostic methodologies (for predicting asset health and remaining operational life) that can scale with the size and complexity of high-dimensional data. In contrast, most existing research focuses on single time-series data applications or multivariate applications where only small-sized time-series vectors are considered. Furthermore, the limited research efforts that involve more complex data structures like profile and/or image data are limited to fault detection, and do not extend to prognostics—two fundamentally different problems. The second research component focuses on computational efficiency of analytic models. Specifically, we pursue fundamental research aimed at speeding up matrix computations of conventional statistical methodologies that enable their application in real-time prognostic applications. Many of the existing models have been validated utilized small-sized data sets, and thus computational challenges have often been overlooked. The third challenge, one that has traditionally been neglected due to lack of real-world data, deals with data quality and its impact on the accuracy and fidelity of the resulting analytics. Harsh industrial environments have a significant impact on the quality of sensor data that range from missing and fragmented data observations to corrupt values and outliers that often result in significant false alarms—a problem that plagues many industries. In this thesis, we focus on developing models that are relatively robust to applications that exhibit poor data quality.
Advisors/Committee Members: Gebraeel, Nagi (advisor), Paynabar, Kamran (advisor), Shi, Jianjun (committee member), Zhao, Tuo (committee member), Chow, Edmond (committee member).

► While innovation has always been critical for competitiveness of businesses, fierce competition resulting from the global economy and constant waves of disruption has made innovation…
(more)

▼ While innovation has always been critical for competitiveness of businesses, fierce competition resulting from the global economy and constant waves of disruption has made innovation even more crucial for the survival of large organisations. Today, extremely large volumes of data from variety of sources are continuously created with immense speed. Containing deep information and insights into customer habits and needs, data has the potential to become a key enabler of competition and innovation. In the financial industry in particular, with no physical products, data is the most valuable asset that needs to be utilised to create competitive advantage and innovation.
However, studies in literature as well as primary qualitative research that I have conducted in collaboration with Ernst and Young, reveal that financial institutions are falling short of exploiting data and analytics’ full potential for innovation and competition. This is due to the failure to discover high-value problems that may be solved using dataanalytics, which can have the potential to result in significant value for customers and business.
The aim of this research is to develop and evaluate a holistic model that increases the probability of success of dataanalytics endeavours in large organisations, resulting in high value and innovative products and services. Using the research methodologies of interpretive case studies and grounded theory for data analysis, I derived the influencing factors for the success of developing innovation using dataanalytics and hence created a framework that mapped these success factors in a cohesive and clear way. The generated framework, referred to as Creative DataAnalytics (CDA), provides a methodology that consists of both creative and analytical techniques which enable organisations to develop an end to end roadmap to creative dataanalytics innovations. The CDA Framework integrates customer needs and predictive dataanalytics, and directs the investigation of data towards an innovative solution with a higher probability of solving an important real customer need or business problem. The validity of the CDA framework was evaluated by conducting action research using three projects involving dataanalytics at the Commonwealth Bank of Australia. These projects were conducted according to the CDA framework principles and the degree of innovation of the solutions derived from these projects were evaluated qualitatively by interviewing managers and innovation experts involved in the projects.

► In this report, I describe a query-driven system which helps in deciding which restaurant to invest in or which area is good to open a…
(more)

▼ In this report, I describe a query-driven system which
helps in deciding which restaurant to invest in or which area is
good to open a new restaurant in a specific place. Analysis is
performed on already existing businesses in every state. This is
based on certain factors such as the average star rating, the total
number of reviews associated with a specific restaurant, the price
range of the restaurant etc.
The results will give an idea of
successful restaurants in a city, which helps you decide where to
invest and what are the things to be kept in mind while starting a
new business.
The main scope of the project is to concentrate on
Analytics and Data Visualization.
Advisors/Committee Members: William H. Hsu.

▼ This dissertation develops sophisticated data analytic methods to analyze structural loads on, and power generation of, wind turbines. Wind turbines, which convert the kinetic energy in wind into electrical power, are operated within stochastic environments. To account for the influence of environmental factors, we employ a conditional approach by modeling the expectation or distribution of response of interest, be it the structural load or power output, conditional on a set of environmental factors. Because of the different nature associated with the two types of responses, our methods also come in different forms, conducted through two studies.
The first study presents a Bayesian parametric model for the purpose of estimating the extreme load on a wind turbine. The extreme load is the highest stress level that the turbine structure would experience during its service lifetime. A wind turbine should be designed to resist such a high load to avoid catastrophic structural failures. To assess the extreme load, turbine structural responses are evaluated by conducting field measurement campaigns or performing aeroelastic simulation studies. In general, data obtained in either case are not sufficient to represent various loading responses under all possible weather conditions. An appropriate extrapolation is necessary to characterize the structural loads in a turbine?s service life. This study devises a Bayesian spline method for this extrapolation purpose and applies the method to three sets of load response data to estimate the corresponding extreme loads at the roots of the turbine blades.
In the second study, we propose an additive multivariate kernel method as a new power curve model, which is able to incorporate a variety of environmental factors in addition to merely the wind speed. In the wind industry, a power curve refers to the functional relationship between the power output generated by a wind turbine and the wind speed at the time of power generation. Power curves are used in practice for a number of important tasks including predicting wind power production and assessing a turbine?s energy production efficiency. Nevertheless, actual wind power data indicate that the power output is affected by more than just wind speed. Several other environmental factors, such as wind direction, air density, humidity, turbulence intensity, and wind shears, have potential impact. Yet, in industry practice, as well as in the literature, current power curve models primarily consider wind speed and, with comparatively less frequency, wind speed and direction. Our model provides, conditional on a given environmental condition, both the point estimation and density estimation of the power output. It is able to capture the nonlinear relationships between environmental factors and wind power output, as well as the high-order inter- action effects among some of the environmental factors. To illustrate the application of the new power curve model, we conduct case studies that demonstrate how the new method can help with…
Advisors/Committee Members: Ding, Yu (advisor), Ntaimo, Lewis (committee member), Genton, Marc G. (committee member), Singh, Chanan (committee member).

► The problem of revealing accurate statistics about a population while maintaining privacy of individuals is extensively studied in several related disciplines. Statisticians, information security experts,…
(more)

▼ The problem of revealing accurate statistics about a population while maintaining privacy of individuals is extensively studied in several related disciplines. Statisticians, information security experts, and computational theory researchers, to name a few, have produced extensive bodies of work regarding privacy preservation.
Still the need to improve our ability to control the dissemination of potentially private information is driven home by an incessant rhythm of data breaches, data leaks, and privacy exposure. History has shown that both public and private sector organizations are not immune to loss of control over data due to lax handling, incidental leakage, or adversarial breaches. Prudent organizations should consider the sensitive nature of network security data and network operations performance data recorded as logged events. These logged events often contain data elements that are directly correlated with sensitive information about people and their activities – often at the same level of detail as sensor data.
Privacy preserving data publication has the potential to support reproducibility and exploration of new analytic techniques for network security. Providing sanitized data sets de-couples privacy protection efforts from analytic research. De-coupling privacy protections from analytical capabilities enables specialists to tease out the information and knowledge hidden in high dimensional data, while, at the same time, providing some degree of assurance that people's private information is not exposed unnecessarily.
In this research we propose methods that support a risk based approach to privacy preserving data publication for network security data. Our main research objective is the design and implementation of technical methods to support the appropriate release of network security data so it can be utilized to develop new analytic methods in an ethical manner. Our intent is to produce a database which holds network security data representative of a contextualized network and people's interaction with the network mid-points and end-points without the problems of identifiability.
Advisors/Committee Members: Tront, Joseph G. (committeechair), Butt, Ali (committee member), Ransbottom, Jeffrey Scot (committee member), Raymond, David Richard (committee member), Midkiff, Scott F. (committee member), Marathe, Madhav Vishnu (committee member).

► The thesis considers the general area of robust portfolio construction. In particular the thesis considers two techniques in this area that aim to improve portfolio…
(more)

▼ The thesis considers the general area of robust portfolio construction. In particular the thesis considers two techniques in this area that aim to improve portfolio construction, and consequently portfolio performance. The first technique focusses on estimation error in the sample covariance (one of portfolio optimisation inputs). In particular shrinkage techniques applied to the sample covariance matrix are considered and the merits thereof are assessed. The second technique considered in the thesis focusses on the portfolio construction/optimisation process itself. Here the thesis adopted the 'resampled efficiency' proposal of Michaud (1989) which utilises Monte Carlo simulation from the sampled distribution to generate a range of resampled efficient frontiers. Thereafter the thesis assesses the merits of combining these two techniques in the portfolio construction process. Portfolios are constructed using a quadratic programming algorithm requiring two inputs: (i) expected returns; and (ii) cross-sectional behaviour and individual risk (the covariance matrix). The output is a set of 'optimal' investment weights, one per each share who's returns were fed into the algorithm. This thesis looks at identifying and removing avoidable risk through a statistical robustification of the algorithms and attempting to improve upon the 'optimal' weights provided by the algorithms. The assessment of performance is done by comparing the out-of-period results with standard optimisation results, which highly sensitive and prone to sampling-error and extreme weightings. The methodology looks at applying various shrinkage techniques onto the historical covariance matrix; and then taking a resampling portfolio optimisation approach using the shrunken matrix. We use Monte-Carlo simulation techniques to replicate sets of statistically equivalent portfolios, find optimal weightings for each; and then through aggregation of these reduce the sensitivity to the historical time-series anomalies. We also consider the trade-off between sampling-error and specification-error of models.
Advisors/Committee Members: Bradfield, David (advisor).

► Sports dataanalytics has become a popular research area in recent years, with the advent of different ways to capture information about a game or…
(more)

▼ Sports dataanalytics has become a popular research area in recent years, with the advent of different
ways to capture information about a game or a player. Different statistical metrics have been
created to quantify the performance of a player/team. A popular application of sport data anaytics
is to generate a rating system for all the team/players involved in a tournament. The resulting rating
system can be used to predict the outcome of future games, assess player performances, or come
up with a tournament brackets.
A popular rating system is the Elo rating system. It started as a rating system for chess tournaments.
It’s known for its simple yet elegant way to assign a rating to a particular individual.
Over the last decade, several variations of the original Elo rating system have come into existence,
collectively know as Elo-based rating systems. This has been applied in a variety of sports like
baseball, basketball, football, etc. In this thesis, an Elo-based approach is employed to model an
individual basketball player strength based on the plus-minus score of the player. The plus-minus
score is a powerful metric because it quantifies the contribution of a player like good defense,
setting up screens, or sledging the opposite team, which are not reflected by metrics that are primarilybased on points. Then, the individual player ratings are combined to obtain a team rating,
Team rating are compared pairwise to obtain the probability of a win by each of the teams during
a matchup. This method not only predicts wins/losses, but offers more information than the Elo
rating system as ratings are assigned to each individual player instead of just considering teams.
This information includes for example, the effect of mid-season transfers or the impact of injuries
to team strengths; these items are overlooked by the standard Elo algorithm.
The performance of the proposed Elo-based rating system is compared to that of the standard
Elo rating system for basketball by using sythetic data. The rating systems are also compared by
running them over real-life data from past NBA seasons.
Advisors/Committee Members: Chamberland, Jean-Francois (advisor), Huff, Gregory H (committee member), Ioerger, Thomas R (committee member).

► There are many public and private databases of oil field properties the analysis of which could lead to insights in several areas. The recent trend…
(more)

▼ There are many public and private databases of oil field properties the analysis of which could lead to insights in several areas. The recent trend of Big Data has given rise to novel analytic methods to effectively handle multidimensional data, and to visualize them to discover new patterns. The main objective of this research is to apply some of the methods used in dataanalytics to datasets with reservoir data.
Abstract Abstract Using a commercial reservoir properties database, we created and tested three data analytic models to predict ultimate oil and gas recovery efficiencies, using the following methods borrowed from dataanalytics: linear regression, linear regression with feature selection, and Bayesian network. We also adopted similarity ranking with principal component analysis to create a reservoir analog recommender system, which recognizes and ranks reservoir analogs from the database.
Abstract Among the models designed to estimate recovery factors, the linear regression models created with variables selected with sequential feature selection method performed the best, showing strong positive correlations between actual and predicted values of reservoir recovery efficiencies. Compared to this model, Bayesian network model, and simple linear regression model performed poorly.
Abstract For the reservoir analog recommender system, an arbitrary reservoir is selected, and different distance metrics were used to rank analog reservoirs. Because no one distance metric (and hence the given reservoir analog list) is superior to the other, the reservoirs given in the recommended list are compared along with the characteristics of distance metrics.
Advisors/Committee Members: Lake, Larry W. (advisor), Mohanty, Kishore K (committee member).

► The causality of workforce forecast and university new degree program establishment is obvious. According to the report of McKinsey&Company and IDC in 2011, the shortage…
(more)

▼ The causality of workforce forecast and university new degree program establishment is obvious. According to the report of McKinsey&Company and IDC in 2011, the shortage of Big Data or Data Analytic talent would reach 190,000 for the analytical positions and 1.5 million for data-literate managers, which is about 50 to 60 percent higher than its regular supply in 2018.
Universities have found this trend and start to catch up. Within the last few years, many universities have launched new master programs in DataAnalytics, though their names are diverse to a large extent. Due to the cutting-edge nature of dataanalytics, there is no common course structure. This research engages in a survey of curriculums of about 80 DataAnalytics master programs that we found from their universities' Web Site. We classify the courses in these curriculums into five categories, namely Data Science, Capstone, Business, Information Technology and Application, and perform an analysis on the various curriculums. Our initial study indicates most curriculum imposes a mixed course structure, involving courses from several categories. Through the detailed analysis, we provide suggested curriculums with different focuses.
Advisors/Committee Members: Wan-Shiou Yang (chair), Shih-Chieh Hsu (chair), San-Yia Hwang (committee member).

► Organizations are increasingly collecting sensitive information about individuals. Extracting value from this data requires providing analysts with flexible access, typically in the form of databases…
(more)

▼ Organizations are increasingly collecting sensitive information about individuals. Extracting value from this data requires providing analysts with flexible access, typically in the form of databases that support SQL queries. Unfortunately, allowing access to data has been a major cause of privacy breaches.Traditional approaches for data security cannot protect privacy of individuals while providing flexible access for analytics. This presents a difficult trade-off. Overly restrictive policies result in underutilization and data siloing, while insufficient restrictions can lead to privacy breaches and data leaks.Differential privacy is widely recognized by experts as the most rigorous theoretical solution to this problem. Differential privacy provides a formal guarantee of privacy for individuals while allowing general statistical analysis of the data. Despite extensive academic research, differential privacy has not been widely adopted in practice. Additional work is needed to address performance and usability issues that arise when applying differential privacy to real-world environments.In this dissertation we develop empirical and theoretical advances towards practical differential privacy. We conduct a study using 8.1 million real-world queries to determine the requirements for practical differential privacy, and identify limitations of previous approaches in light of these requirements. We then propose a novel method for differential privacy that addresses key limitations of previous approaches.We present Chorus, an open-source system that automatically enforces differential privacy for statistical SQL queries. Chorus is the first system for differential privacy that is compatible with real databases, supports queries expressed in standard SQL, and integrates easily into existing data environments.Our evaluation demonstrates that Chorus supports 93.9% of real-world statistical queries, integrates with production databases without modifications to the database, and scales to hundreds of millions of records. Chorus is currently deployed at a large technology company for internal analytics and GDPR compliance. In this capacity, Chorus processes more than 10,000 queries per day.

► Organisations with high-performing data and analytics capabilities are more successful than organisations with lower analytics maturity. It is therefore necessary for organisations to assess…
(more)

▼ Organisations with high-performing data and analytics capabilities are more successful than organisations with lower analytics maturity. It is therefore necessary for organisations to assess their analytics capabilities and needs in order to identify and evaluate areas of improvement that need to be addressed. This was the purpose of this case study conducted on a region in a global B2B organisation, which has a centrally established analytics function on corporate level, wanting the use of analytics to be integrated in more of the region’s processes and analytical capabilities and resources being used as efficient as possible.To fulfil the thesis purpose, empirical data was collected through qualitative interviews with employees on corporate level, more quantitative interviews with regional employees and a questionnaire issued to regional employees. This was complemented with a thorough literature study which provided the analytics maturity models used for identifying the current capabilities on a holistic level of the region, as well as analytics setups, Lean Six Sigma and Knowledge Management. Results show a relatively low analytics maturity due to e.g. insufficient support from management, unclear responsibility of analytics, data not being used correctly or requested enough and various issues with competence, tools and sources.This study contributes to analytics research by identifying that analytics maturity models available free of charge only are good for inspiration and not full use when used in a large company. Furthermore, the study shows that complexities arise when having a central analytics function with low analytics maturity while other parts of the company face analytics problems but no indications are given on who and what to proceed on or not. This study therefore results in contributing with a proposition for companies wanting to increase its analytics maturity that this could be facilitated by establishing networks for analytics. Combining literature and empirics show that networks enable investigation of the analytics situation while at the same time enabling increased sharing, collaboration, innovation, coordination and dissemination. By making Lean Six Sigma a central part of the network analytics will be used more and better while at the same time increasing the success-rate of change and improvements projects.

► The rapid growth of e-commerce contributes to not only an increase in the number of online shoppers but also new changes in customer behaviour. Surveys…
(more)

▼ The rapid growth of e-commerce contributes to not only an increase in the number of online shoppers but also new changes in customer behaviour. Surveys have revealed that online shopper's brand loyalty and store loyalty are declining. Also the transparency of feedback affects customers' purchase intention. In the context of these changes, online sellers are faced with challenges in regard to their customer relationship managements (CRM). They are interested in identifying high-value customers from a mass of online shoppers, and knowing the factors that might have impacts on those high-value customers. This thesis aims to address these questions.
Our research is conducted based on an eBay dataset that includes transaction and associated feedback information during the second quarter of 2013. Focusing on the sellers and buyers in that dataset, we propose an approach for measuring the value for each seller-buyer pair so as to help sellers capture high-value customers. For a seller, the value of each of its customers has been obtained, and we create a customer value distribution for the seller so that the seller knows the majority of its customers' consumption abilities. Next, we categorize sellers based on their customer value distributions into four different groups, representing the majority of customers as being of high, medium, low, and balanced values, respectively. After this classification, we compare the performance of each group in terms of the sales, percentage of successful transactions, and the seller level labelled by the eBay system. Furthermore, we perform logistic regression and clustering to the sellers' feedback data in order to investigate whether a seller's reputation has an impact on the seller's customer value distribution. From the experiment results, we conclude that the effect of negative ratings is more significant than that of positive ratings on a seller's customer value distribution. Also higher ratings about "Item as Described" and "Shipping and Handling Charges" are more likely to help the seller attract more high-value buyers.
Advisors/Committee Members: Wu, Kui (supervisor).

Chen, J. (2016). Leveraging purchase history and customer feedback for CRM: a case study on eBay's "Buy It Now". (Masters Thesis). University of Victoria. Retrieved from http://hdl.handle.net/1828/7153

Chicago Manual of Style (16th Edition):

Chen, Jie. “Leveraging purchase history and customer feedback for CRM: a case study on eBay's "Buy It Now".” 2016. Masters Thesis, University of Victoria. Accessed May 25, 2019.
http://hdl.handle.net/1828/7153.

MLA Handbook (7th Edition):

Chen, Jie. “Leveraging purchase history and customer feedback for CRM: a case study on eBay's "Buy It Now".” 2016. Web. 25 May 2019.

Vancouver:

Chen J. Leveraging purchase history and customer feedback for CRM: a case study on eBay's "Buy It Now". [Internet] [Masters thesis]. University of Victoria; 2016. [cited 2019 May 25].
Available from: http://hdl.handle.net/1828/7153.

Council of Science Editors:

Chen J. Leveraging purchase history and customer feedback for CRM: a case study on eBay's "Buy It Now". [Masters Thesis]. University of Victoria; 2016. Available from: http://hdl.handle.net/1828/7153

The field of geovisual analytics focuses on visualization techniques to analyze spatial data by enhancing…
(more)

▼

Dissertação para obtenção do Grau de Mestre em
Engenharia Informática

The field of geovisual analytics focuses on visualization techniques to analyze spatial
data by enhancing human cognition. However, spatial data also has a temporal component
that is practically disregarded when using conventional geovisual analytic tools.
Some proposals have been made for techniques to analyze spatiotemporal data, but most
were made for specific use cases, and are hard to abstract for other situations. There was a need to create a method to describe and compare the existing techniques.
A catalog that provides a clear description of a set of techniques that deal with spatiotemporal data is proposed. This allows the identification of the most useful techniques depending on the required criteria. The description of a technique in the catalog relies on the two frameworks proposed. The first framework is used for describing spatiotemporal datasets resorting to data scenarios, a class of datasets. Twenty three data scenarios are described using this framework. The second framework is used for describing analytical tasks on spatiotemporal data, nine different tasks are described using this framework.
Also, in this document, is the proposal of two new geovisual analytical techniques that can be applied to spatiotemporal data: the attenuation & accumulation map technique and the overlapping spatiotemporal windows technique. A prototype was developed that implements both techniques as a proof of concept.

▼ Improvements in wearable sensor devices make it possible to constantly monitor physiological parameters such as electrocardiograph (ECG) signals for long periods. Remote patient monitoring with wearable sensors has an important role to play in health care, particularly given the prevalence of chronic conditions such as cardiovascular disease (CVD)—one of the prominent causes of morbidity and mortality worldwide. Approximately 4.2 million Australians suffer from long-term CVD with approximately one death every 12 minutes. The assessment of ECG features, especially heart rate variability (HRV), represents a non-invasive technique which provides an indication of the autonomic nervous system (ANS) function. Conditions such as sudden cardiac death, hypertension, heart failure, myocardial infarction, ischaemia, and coronary heart disease can be detected from HRV analysis. In addition, the analysis of ECG features can also be used to diagnose many types of life-threatening arrhythmias, including ventricular fibrillation and ventricular tachycardia. Non-cardiac conditions, such as diabetes, obesity, metabolic syndrome, insulin resistance, irritable bowel syndrome, dyspepsia, anorexia nervosa, anxiety, and major depressive disorder have also been shown to be associated with HRV. The analysis of ECG features from real time ECG signals generated from wearable sensors provides distinctive challenges. The sensors that receive and process the signals have limited power, storage and processing capacity. Consequently, algorithms that process ECG signals need to be lightweight, use minimal storage resources and accurately detect abnormalities so that alarms can be raised. The existing literature details only a few algorithms which operate within the constraints of wearable sensor networks. This research presents four novel techniques that enable ECG signals to be processed within the limitations of resource constraints on devices to detect some key abnormalities in heart function. - The first technique is a novel real-time ECG data reduction algorithm, which detects and transmits only those key points that are critical for the generation of ECG features for diagnoses. - The second technique accurately predicts the five-minute HRV measure using only three minutes of data with an algorithm that executes in real-time using minimal computational resources. - The third technique introduces a real-time ECG feature recognition system that can be applied to diagnose life threatening conditions such as premature ventricular contractions (PVCs). - The fourth technique advances a classification algorithm to enhance the performance of automated ECG classification to determine arrhythmic heart beats based on noisy ECG signals. The four novel techniques are evaluated in comparison with benchmark algorithms for each task on the standard MIT-BIH Arrhythmia Database and with data generated from patients in a major hospital using Shimmer3 wearable ECG sensors. The four techniques are integrated to demonstrate that remote patient monitoring of ECG…

► Email correspondence has become the predominant method of communication for businesses. If not for the inherent privacy concerns, this electronically searchable data could be used…
(more)

▼ Email correspondence has become the predominant method of communication for businesses.
If not for the inherent privacy concerns, this electronically searchable data could be used to better understand how employees interact.
After the Enron dataset was made available, researchers were able to provide great insight into employee behaviors based on the available data despite the many challenges with that dataset.
The work in this thesis demonstrates a suite of methods to an appropriately anonymized academic email dataset created from volunteers' email metadata.
This new dataset, from an internal email server, is first used to validate feature extraction and machine learning algorithms in order to generate insight into the interactions within the center.
Based solely on email metadata, a random forest approach models behavior patterns and predicts employee job titles with 96% accuracy.
This result represents classifier performance not only on participants in the study but also on other members of the center who were connected to participants through email.
Furthermore, the data revealed relationships not present in the center's formal operating structure.
The culmination of this work is an organic organizational chart, which contains a fuller understanding of the center's internal structure than can be found in the official organizational chart.
Advisors/Committee Members: McGwier, Robert W. (committeechair), Beex, Aloysius A. (committee member), Huang, Bert (committee member), Buehrer, Richard M. (committee member).

► The aim of the thesis is to identify a roadmap for well-established companies towards business model innovation to explore dataanalytics value. The business…
(more)

▼ The aim of the thesis is to identify a roadmap for well-established companies towards business model innovation to explore dataanalytics value. The business model innovation currently taking place at Caterpillar and Ericsson in order to explore dataanalytics value is presented to answer the question: “How do established companies explore dataanalytics to innovate their business models?” Initially, the problem discussion, formulation and purpose are given. Then, the relevant theory is presented covering the importance of dataanalytics, IT infrastructure challenges due to the increased volume of data created, dataanalytics methods currently being used, smart connected products and the Internet of Things. The meaning of business model innovation is given, followed by a well-structured business model process which includes the business model canvas for representation purposes. The business areas affected by dataanalytics value and the barriers of business model innovation are given as well. After that, the theory addressing business model innovation to explore dataanalytics value is presented and the main industries which are currently on this journey along with the required initial steps and the business models that can come out of this process are identified. The challenges and risks if the option of not following this route is chosen are also shown. The method section follows to explain the case study design, data collection method and way of analysis. The results cover all the information gathered from numerous sources including on-line available information, papers, interviews, videos, end of year reviews and most importantly current Caterpillar and Ericsson mid-level management employee answers to a questionnaire created and distributed by the authors. The business model canvas tool is used to aid the reader understanding Caterpillar’s and Ericsson’s business model innovation. Each company’s business model is given before and after dataanalytics adoption. Finally, the analysis of the results and the link with the theory is given in order to answer the thesis question.

► Technological advances have led to a proliferation of data characterized by a complex structure; namely, high-dimensional attribute information complemented by relationships between the objects or…
(more)

▼ Technological advances have led to a proliferation of data characterized by a complex structure; namely, high-dimensional attribute information complemented by relationships between the objects or even the attributes.
Classical data mining techniques usually explore the attribute space, while network analytic techniques focus on the relationships, usually
expressed in the form of a graph. However, visualization techniques offer the possibility to gain useful insight through appropriate graphical displays coupled with data mining and network analytic techniques.
In this thesis, we study various topics of the visual analytic process. Specifically, in chapter 2, we propose a visual analytic algebra geared towards attributed graphs. The algebra defines a universal language for graph data manipulations during the visual analytic process and allows documentation and reproducibility. In chapter 3, we extend the algebra framework to address the uncertain querying problem. The algebra's operators are illustrated on a number of synthetic and real data sets, implemented in an existing visualization system (Cytoscape) and validated through a small user study.
In chapter 4, we introduce a dimension reduction technique that through a regularization framework incorporates network information either on the objects or the attributes. The technique is illustrated on a number of real world applications.
Finally, in the last part of the thesis, we present a multi-task generalized linear model that improves the learning of a single task (problem) by utilizing information from connected/similar tasks through a shared representation. We present an algorithm for estimating the parameters of the problem efficiently and illustrate it on a movie ratings data set.
Advisors/Committee Members: Michailidis, George (committee member), Jagadish, Hosagrahar V. (committee member), Shedden, Kerby A. (committee member), Zhu, Ji (committee member).

► Isomer networks provide a mechanism to understand and interpret relationships between organic molecules with applications in medicinal chemistry and drug design. The extraction of isomer…
(more)

▼ Isomer networks provide a mechanism to understand and interpret relationships between organic molecules with applications in medicinal chemistry and drug design. The extraction of isomer networks is a time and data-intensive computation. The contributions of this dissertation are a variety of techniques to more efficiently (with respect to time and memory) compute isomers networks. Specifically, we describe our efforts to improve the network extraction process by 1) Using the symmetry present in most molecules to reduce run time and memory and streamlining the algorithm used for the detection of duplicate canonical names, a key step in determining the bond count distances between pairs of isomers. Together, these techniques result in reductions in memory of up to 60% and improvements in runtime of up to a factor of 100. 2) Developing an optimal grouping algorithm to subdivide an all-all computation with large memory requirements. The algorithm provides a solution to sub divide the "big data" problem that arises in the construction of isomer networks into several independent "small data" problems. Our results show that using the grouping algorithm can help divide large data sets into independent smaller ones that can be processed in parallel. 3) Generating the isomer network for 1,050,125 isomers of Nicotine (with a preliminary analysis of the same) using the cloud computing capabilities of Amazon Web Services and Microsoft Azure. These techniques can also be employed to successfully compute isomers networks for other chemical compounds.
Advisors/Committee Members: Mehta, Dinesh P. (advisor), Han, Qi (committee member), Wu, Bo (committee member), Ciobanu, Cristian V. (committee member).

► This thesis presents the development of methodologies for the hybridisation of an energy system with a solar Photovoltaic (PV), an Anaerobic Digestion biogas power plant…
(more)

▼ This thesis presents the development of methodologies for the hybridisation of an energy system with a solar Photovoltaic (PV), an Anaerobic Digestion biogas power plant (AD) and an electrical energy storage (EES). This includes the system sizing and operating regime. The key aspect is to evaluate the Levelized Cost of Electricity (LCOE) for the system and the generation assets. The challenge arises in analysing the economic projections on PV and EES hybrid systems. EES does not produce electricity as it is not a conventional generation source. Commonly, the cost of a generating asset or the power system is evaluated using LCOE. Levelized Cost of Delivery (LCOD) has been proposed to calculate the LCOE for the EES. A deterministic approach for sizing an off-grid PV and EES with AD biogas power plant to meet a proportional scaled-down demand of the national load has been implemented. The aim is to achieve a minimal LCOE for the system while minimising the energy imbalance between generation and demand due to both AD and PV generator constraints. To reduce the amount of irradiance data and to provide a standardised methodology for PV system design and analysis, feature extraction technique has been used to discover Clearness Index (CI) patterns and to construct centroids for the daily CI profiles. Fuzzy C-Means with Dynamic Time Warping has been used for daily CI profiles' classification. An operating regime has been proposed for the hybrid system, and the EES degradation cost has been considered via a capacity fade model. The work presented in this thesis has been supported with case studies from real-life solar irradiance and national load data. The proposed methodologies have contributed to a more efficient and cost-effective method for PV-AD-EES hybrid system sizing and operation.

► Recently, there is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large datasets. In the mean time,…
(more)

▼ Recently, there is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large datasets. In the mean time, in real-world applications, it is highly desirable to reduce the tedious, inefficient ETL (extract, transform, load) gap between tabular data processing systems and graph processing systems. Unfortunately, those challenges have not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow, as well as the separation of tabular data processing runtimes and graph processing runtimes.In this thesis, we explore the application of programming techniques and algorithms from the database systems world to the problem of scalable graph analysis. We first propose a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented, GC enabled languages and demonstrate that programming under this paradigm does not incur significant programming burden but obtains remarkable performance gains (e.g., 2.5X). Based on the design paradigm, we then build Pregelix, an open source distributed graph processing system which is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab). Finally, we integrate Pregelix with the open source Big Data management system AsterixDB to offer users a mix of a vertex-oriented programming model and a declarative query language for richer forms of Big Graph analytics with reduced ETL pains.

► Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the…
(more)

▼ Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors â CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered…
Advisors/Committee Members: Ailamaki, Anastasia.

► There is an increasing demand to visualize large datasets as human observable reports in order to quickly draw insights and gain timely awareness from the…
(more)

▼ There is an increasing demand to visualize large datasets as human observable reports in order to quickly draw insights and gain timely awareness from the data. An interactive user interface is an indispensable tool that allows users to analyze the data from different perspectives and to inspect the result from the global overview to the finest granularity. To enable this type of interactive user experience, the front-end can generate new requests on the fly, and the results must be computed and delivered within seconds. Big Data platforms can take tens or hundreds of seconds to complete an OLAP-style query, so there is a need for a solution that can meet the stringent latency requirement of interactive visualization frontends. In this thesis, we address the interactivity challenges from a middleware perspective to provide a generic solution that can utilize existing database systems as a "black box" to support various interactive visualization applications efficiently.We present Cloudberry, an open-source general-purpose middleware system to support interactive analytics and visualization on big data with various attributes. It can automatically create, maintain, and delete materialized views by analyzing each request and its results. We build an application called "TwitterMap" using Cloudberry to demonstrate its suitability to support interactive analytics and visualization on more than one billion tweets (about 2TB).We then present a query slicing technique in Cloudberry, called Drum, that can "slice" a query into small pieces (called "mini-queries") so that the middleware can send these mini-queries to the DBMS one by one and compute results progressively. Our experiments on a large, real dataset show that Drum technique can reduce the delay of delivering intermediate results to the user without much reduction of the overall speed.Finally, we present a method of using LSM filters to accelerate secondary-to-primary index search under the LSM storage setting. We have implemented it in Apache AsterixDB, and our experiments show that the new approach can reduce the query time by 20% to 70% for different queries.

► The increasing size of modern applications produces huge amounts of data, which in turn leads to a new challenge to data mining or big data…
(more)

▼ The increasing size of modern applications produces huge amounts of data, which in turn leads to a new challenge to data mining or big dataanalytics. Researchers often use the five V’s (Volume, Velocity, Variety, Veracity, and Value) to describe the features of big data. The interest of discovering patterns from a large collection of data has risen in both academic and industrial areas. Examples of rich sources of big data are on-line social networks like Facebook or Twitter. Embedded in these user online social activities are useful information and knowledge. Recently, although some algorithms have been proposed to mine a large scale of data, they mostly focused on the volume aspect. Unfortunately, not that many approaches have been focused on data variety which is also a critical criterion for mining process. The composition of a dataset could either be sparse or dense, or not evenly uniformly distributed. For example, a list of common friends in an on-line social network can be dense if two people share a lot of common friends; it could be sparse otherwise. For my MSc thesis, I design and implement a big data analytic algorithm that tackles both volume and variety aspects of big data.
Advisors/Committee Members: Leung, Carson K. (Computer Science) (supervisor), Wang, Yang (Computer Science).

► Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programsand applications. These frameworks…
(more)

▼ Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programsand applications. These frameworks significantly reduce the complexity of developing big data programs and applications. However, inreality, many real-world scenarios require pipelining and integration of multiple big data jobs. As the big data pipelines and applicationsbecome more and more complicated, it is almost impossible to manually optimize the performance for each component not to mentionthe whole pipeline/application. At the same time, there are also increasing requirements to facilitate interaction, composition andintegration for big dataanalytics applications in continuously evolving, integrating and delivering scenarios. In addition, with theemergence and development of cloud computing, mobile computing and the Internet of Things, data are increasingly collected andstored in highly distributed infrastructures (e.g. across data centres, clusters, racks and nodes).To deal with the challenges above and fill the gap in existing big data processing frameworks, we present the Hierarchically DistributedData Matrix (HDM) along with the system implementation to support the writing and execution of composable and integrable big dataapplications. HDM is a light-weight, functional and strongly-typed meta-data abstraction which contains complete information (such asdata format, locations, dependencies and functions between input and output) to support parallel execution of data-driven applications.Exploiting the functional nature of HDM enables deployed applications of HDM to be natively integrable and reusable by other programsand applications. In addition, by analysing the execution graph and functional semantics of HDMs, multiple automated optimizations areprovided to improve the execution performance of HDM data flows. Moreover, by extending the kernel of HDM, we propose a multi-cluster solution which enables HDM to support large scale dataanalytics among multi-cluster scenarios. Drawing on the comprehensiveinformation maintained by HDM graphs, the runtime execution engine of HDM is also able to provide provenance and historymanagement for submitted applications. We conduct comprehensive experiments to evaluate our solution compared with the currentstate-of-the-art big data processing framework – Apache Spark.
Advisors/Committee Members: Zhu, Liming, Computer Science & Engineering, Faculty of Engineering, UNSW, Sakr, Sherif, Computer Science & Engineering, Faculty of Engineering, UNSW, Paik, Helen, Computer Science & Engineering, Faculty of Engineering, UNSW, Liu, Anna, Amazon, Australia.

With our world becoming more interconnected and our activities more digital, data is more abundant, diverse, and available in real time. Organizations are taking advantage of these massive amounts of data for more precise adjustments of business systems, decision support and development of products and services. In this master’s thesis, we introduced characteristic of Big Data and opportunities and challenges that companies are facing with when analyzing Big Data. Our focus was on technologies of Big Dataanalytics and Big Data visualization – as an example, we represented solution SAS Visual Analytics. Organizations are using Big Data technologies to get answers on important
questions with data analysis in real-time, and don’t have to wait for results days, weeks or even months. The most important…

► Recent advances in technology have enabled people to add location information to social networks called Location-Based Social Networks (LBSNs) where people share their communication…
(more)

▼ Recent advances in technology have enabled people to add location information to social networks called Location-Based Social Networks (LBSNs) where people share their communication and whereabouts not only in their daily lives, but also during abnormal situations, such as crisis events. However, since the volume of the data exceeds the boundaries of human analytical capabilities, it is almost impossible to perform a straightforward qualitative analysis of the data. The emerging field of visual analytics has been introduced to tackle such challenges by integrating the approaches from statistical data analysis and human computer interaction into highly interactive visual environments. Based on the idea of visual analytics, this research contributes the techniques of knowledge discovery in social media data for providing comprehensive situational awareness. We extract valuable hidden information from the huge volume of unstructured social media data and model the extracted information for visualizing meaningful information along with user-centered interactive interfaces. We develop visual analytics techniques and systems for spatial decision support through coupling modeling of spatiotemporal social media data, with scalable and interactive visual environments. These systems allow analysts to detect and examine abnormal events within social media data by integrating automated analytical techniques and visual methods. We provide comprehensive analysis of public behavior response in disaster events through exploring and examining the spatial and temporal distribution of LBSNs. We also propose a trajectory-based visual analytics of LBSNs for anomalous human movement analysis during crises by incorporating a novel classification technique. Finally, we introduce a visual analytics approach for forecasting the overall flow of human crowds.
Advisors/Committee Members: David S. Ebert, David S. Ebert, Niklas Elmqvist, Yun Jang, Ji Soo Yi.

► High volume DDoS attacks continue to cause serious financial losses and damage to company reputations, despite years of research in preventing and mitigating them. Many…
(more)

▼ High volume DDoS attacks continue to cause serious financial losses and damage to company reputations, despite years of research in preventing and mitigating them. Many proposed techniques for handling these attacks assume that the attack has already been detected and its traffic properly characterized; yet, existing methods of detecting and characterizing such attacks have not been widely adopted, for various reasons. We describe a scalable real-time DDoS monitoring system that leverages modern big data technologies to effectively analyze high volume DDoS attacks. Evaluated on multiple large-scale traffic datasets that capture recent real-world DDoS attacks and synthetic traffic based on sophisticated attack characteristics, our approach detects and characterizes these attacks quickly and accurately. Furthermore, we show that our monitoring system 1) clearly justifies its decisions resulting from explainable analysis of input traffic volume metrics, thus increasing monitoring transparency and facilitating the diagnosis and debugging of monitoring performance for network security teams 2) leverages identified attack characteristics to separate benign from malicious traffic and send helpful defense recommendations, the identified attack characteristics and malicious traffic traces, to downstream DDoS traffic filtering systems.

As the Internet usage keeps increasing, the number of web sites and hence the number of web
pages also keeps increasing, so there is a need to align the user experience with the overall
websites purposes. Toward this requirement, the proposed recommendation systems suggest
the user pages that might be of its interest based on past navigation profiles of overall site usage.
Most of existing recommendation systems are based on association rules or based on keywords
(when content is considered). However, on usage data shortage or sparse data and if sequential
order is to be considered such traditional approaches may become unsuitable. Conversely, the
Web Analytics arena, assuming other paradigm, has experienced a considerable growth through
mature tools that allow the collection and analysis of internet data in order to understand and
optimize website efficiency and efficacy. This work proposes the development of a
recommendation system based on the Google Analytics tool. The prototype is constituted by two
main components which are: 1) a service responsible for the construction and associated logic
that underlies recommendations generation; 2) an embeddable library on any website that will
furnish website with a configurable recommendation widget. Preliminary evaluations had
showed that the implementation follows the logic of the proposed model.