Data Science Series

Data play intimate roles in all aspects of life, personal and professional, from individual online purchases to business and science. The deluge of data generated from many sources, or so-called “Big Data”, present enormous challenges in analyzing them, let alone deriving the insights contained within them. The knowledge required to analyze them are drawn from multiple disciplines, such as statistics, machine learning, information sciences, and computer science, which can take years to master.

The following series of Data Science courses provide hands-on learning and some theoretical backgrounds toward this end. By the end of the series, students should be well equipped to apply cutting edge Data Science techniques, such as Artificial Neural Network, Deep Learning, Reinforcement Learning, Random Forest, Regression of various kinds, Model selection, Predictive Modelling, as well as formulating real-world questions into series of analyses, all the way to some basics of result interpretation. Throughout the series, students will be analyzing real-world data of various fields, such as financial, education, public health, and genetic data, using freely available programs called R and Python.

The Data Science course series is organized into a three-year program. Below are some learning highlights from each course.

Data Science I(Year 1)

In year one students learn to program in R while focusing primarily upon different types of statistical analysis using a variety of datasets for context and interest. The statistical focus includes comparison vs. parametric statistics. Students learn about:

Problem formulation, different types of statistical models for analyzing data, and how to determine which model is most applicable to the problem posed. Modeling to be explored include:

linear regression

linear mixed effect model

logistic regression & survival analysis

prediction modeling

generalized linear regression

Analyzing outputs will require students to learn a series of analysis essentials and techniques, including:

T-test and the theory behind it as well as how to interpret other values (r, ANOVA)

Hazard Ratio

Kaplan Meier curve

Mann-Whitney U test

Fisher’s test

Kolmogorov-Smirnov test

Basics of Bayesian analysis and the theory behind it, including:

Bayesian method

Regression as a Bayesian problem

Acceptance-rejection method

Basics of Markov-Chain and/or random walk

Basics of Markov-Chain Monte-Carlo method

Missing data imputation

MICE method

Data Science II Preview(Year two)

In year two students program in Python and the course focuses on machine learning starting with the building blocks, including the basics of information theory, Expectation-Maximization (EM) theory and Bayesian method as applied to supervised learning. Students continue to explore different types of computational learning methods, e.g.,Principal Component Analysis (PCA), network approaches, clustering approaches, ensemble learning, structural equation modeling (SEM), etc. In so doing, students take the logical learning steps for building competencies to understand machine-learning theory and techniques, including: Ridge regression, LASSO, Elastic Net in which they explore various tradeoffs and analytical methods.

The final focus in the course is on Deep Learning, basic theory, applications (e.g., understanding tensor flow, and vision and natural language processing applications) and exploration of areas including:

Convoluted Neural Network

Deep Belief Network

Recurrent Neural Network

By the conclusion of the series students understand and can use different models, analytical tools and techniques.