Xing Wu

Quick Links

Introduction

I'm a thesis-based master student in Department of Electrical & Computer Engineering at Concordia University under the supervision of Dr. Yan Liu. My primary interest lies in the distributed system framework for big data processing issues, such as distributed stream processing system (e.g.: Apache Storm, Apache S4) and batch-oriented data processing system (e.g.: Apache Hadoop), especially those suit for recommendation scenario or some other artificial intelligence scenarios.

Prior to studying in Concordia University, I got my master degree of Computer Software and Theory in Wuhan University and worked in Tencent (QQ.COM, top 10 websites in the world ranked by Alexa) as a software engineer for three years.

Education

Concordia University, Montreal, Canada, Sept. 2013 - Now

Master of Applied Science in ECE

Wuhan University, Wuhan, China, Sept. 2008 - Jun. 2010

Master of Computer Software and Theory

Wuhan University, Wuhan, China, Sept. 2004 - Jun. 2008

Bachelor of Computer Science and Technology

Work Experience

Research Intern, Ericsson Inc, Ottawa, Canada, Aug. 2014 - Now

Mobile Big Data Analysis

Build a Hadoop-based data analysis platform to support ad-hoc queries on tens of terabytes of data. Develop ETL tools to load data from LTE base stations to HDFS in real-time.

Design and develop a real-time monitoring and performance prediction system for LTE stations.

Research Projects

Load Adaptive Optimization for Incremental Data Processing Platforms

Stream processing software frameworks enable real-time processing of continuous unbounded streams of data at a high speed. Leveraging the elasticity of cloud computing infrastructure, stream processing frameworks can become Platform as a Service for many domain applications that provides simplified development and run-time management. An issue of making such a PaaS scalability is to allocate data processing operators on nodes of clusters and balance the workload dynamically. Since the data volume and rate can be unpredictable, static mapping between operators and cluster resources open results in unbalanced operator load distribution. This projects proposes an optimization method that combines correlation of resource utilization of nodes and capacity of clusters. The associated software components form a layer between a streaming processing software framework and cloud clusters and nodes. This software layer allows dynamic transferring of an operator to different cluster nodes at runtime and keeps transparent to developers.

Processing large scale data is an increasing common and important problem for many domains. The de facto standard programming model MapReduce, and the associated run-time systems were originally adopted by Google. Subsequently, an open-source platform named Hadoop that supports the same programming model has also gained tremendous popularity. However, MapReduce was not designed to efficiently process small and independent updates. This means the MapReduce must be run again over both the newly updated data and the old data. Given enough computing resources, MapReduce’s scalability makes this approach feasible. However, reprocessing the entire data discards the work done in earlier runs and makes latency proportional to the size of entire data, rather than the size of an update.

S4 is a distributed computing platform for processing continuous unbounded streams of data. The motivation of S4 is to provide a highly scalable software solution (akin to Hadoop for batch data processing) to operate at high data rates and process massive amounts of data.

This research aims to present an empirical performance and cost evaluation of both Hadoop and S4 on processing continuous and incremental updated data streams.