This paper presents an example of how demographical characteristics of
patients influence their susceptibility to certain medical conditions. In this
paper, we investigate the association of health conditions to age of patients
in a heterogeneous population. We show that besides the symptoms a patients is
having, the age has the potential of aiding the diagnostic process in
hospitals. Working with Electronic Health Records (EHR), we show that medical
conditions group into clusters that share distinctive population age densities.
We use Electronic Health Records from Brazil for a period of 15 months from
March of 2013 to July of 2014. The number of patients in the data is 1.7
million patients and the number of records is 47 million records. The findings
has the potential of helping in a setting where an automated system undergoes
the task of predicting the condition of a patient given their symptoms and
demographical information.

Research on peer effects in sociology has been focused for long on social
influence power to investigate the social foundations for social interactions.
This paper extends Xu(2011)'s large--network--based game model by allowing for
social-influence-dependent peer effects. In a large network, we use the
Katz--Bonacich centrality to measure individuals' social influences. To solve
the computational burden when the data come from the equilibrium of a large
network, we extend Aguirregabiria and Mira (2007)'s nested pseudo likelihood
estimation (NPLE) approach to our large network game model. Using the Add
Health dataset, we investigate peer effects on conducting dangerous behaviors
of high school students. Our results show that peer effects are statistically
significant and positive. Moreover, a student benefits more (statistically
significant at the 5% level) from her conformity, or equivalently, pays more
for her disobedience, in terms of peer pressures, if friends have higher
social-influence status.

We study the online estimation of the optimal policy of a Markov decision
process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which
exploit the inherent minimax duality of Bellman equations. The SPD methods
update a few coordinates of the value and policy estimates as a new state
transition is observed. These methods use small storage and has low
computational complexity per iteration. The SPD methods find an
absolute-$\epsilon$-optimal policy, with high probability, using
$\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2\sigma^2
}{(1-\gamma)^6\epsilon^2} \right)$ iterations/samples for the infinite-horizon
discounted-reward MDP and $\mathcal{O}\left(\frac{|\mathcal{S}|^4
|\mathcal{A}|^2H^6\sigma^2 }{\epsilon^2} \right)$ for the finite-horizon MDP.

Variation in rates of terrorist activity over time is explained via contagion
or diffusion. Models for social contagion and diffusion are shown to be cases
of the cluster process representation of the Hawkes self-exciting process
model. Contagion and diffusion models exploring variations in endogenous and
exogenous effects are fitted to data from the Global Terrorism Database for
2000--2015. Model selection criteria are shown to differentiate between
contagion and diffusion process and events with high fatalities exhibit less
influence on the probability of future events. The practical applications of
these results include exploratory modelling and forecasting to inform policy
decisions.

Bagging is a device intended for reducing the prediction error of learning
algorithms. In its simplest form, bagging draws bootstrap samples from the
training sample, applies the learning algorithm to each bootstrap sample, and
then averages the resulting prediction rules.
We extend the definition of bagging from statistics to statistical
functionals and study the von Mises expansion of bagged statistical
functionals. We show that the expansion is related to the Efron-Stein ANOVA
expansion of the raw (unbagged) functional. The basic observation is that a
bagged functional is always smooth in the sense that the von Mises expansion
exists and is finite of length 1 + resample size $M$. This holds even if the
raw functional is rough or unstable. The resample size $M$ acts as a smoothing
parameter, where a smaller $M$ means more smoothing.

The paper is split in two parts: in the first part, we construct the exact
likelihood for a discretely observed rough differential equation, driven by a
piecewise linear path. In the second part, we use this likelihood in order to
construct an approximation of the likelihood for a discretely observed
differential equation driven by a general class of rough paths. Finally, we
study the behaviour of the approximate likelihood when the sampling frequency
tends to infinity.

Machine learning analysis of neuroimaging data can accurately predict
chronological age in healthy people and deviations from healthy brain ageing
have been associated with cognitive impairment and disease. Here we sought to
further establish the credentials of "brain-predicted age" as a biomarker of
individual differences in the brain ageing process, using a predictive
modelling approach based on deep learning, and specifically convolutional
neural networks (CNN), and applied to both pre-processed and raw T1-weighted
MRI data. Firstly, we aimed to demonstrate the accuracy of CNN brain-predicted
age using a large dataset of healthy adults (N = 2001). Next, we sought to
establish the heritability of brain-predicted age using a sample of monozygotic
and dizygotic female twins (N = 62). Thirdly, we examined the test-retest and
multi-centre reliability of brain-predicted age using two samples
(within-scanner N = 20; between-scanner N = 11). CNN brain-predicted ages were
generated and compared to a Gaussian Process Regression (GPR) approach, on all
datasets. Input data were grey matter (GM) or white matter (WM) volumetric maps
generated by Statistical Parametric Mapping (SPM) or raw data. Brain-predicted
age represents an accurate, highly reliable and genetically-valid phenotype,
that has potential to be used as a biomarker of brain ageing. Moreover, age
predictions can be accurately generated on raw T1-MRI data, substantially
reducing computation time for novel data, bringing the process closer to giving
real-time information on brain health in clinical settings.

Targeted therapies on the basis of genomic aberrations analysis of the tumor
have become a mainstream direction of cancer prognosis and treatment.
Regardless of cancer type, trials that match patients to targeted therapies for
their particular genomic aberrations, are well motivated. Therefore, finding
the subpopulation of patients who can most benefit from an aberration-specific
targeted therapy across multiple cancer types is important. We propose an
adaptive Bayesian clinical trial design for patient allocation and
subpopulation identification. We start with a decision theoretic approach,
including a utility function and a probability model across all possible
subpopulation models. The main features of the proposed design and population
finding methods are that we allow for variable sets of covariates to be
recorded by different patients, adjust for missing data, allow high order
interactions of covariates, and the adaptive allocation of each patient to
treatment arms using the posterior predictive probability of which arm is best
for each patient. The new method is demonstrated via extensive simulation
studies.

Monte Carlo methods to evaluate and maximize the likelihood function enable
the construction of confidence intervals and hypothesis tests, facilitating
scientific investigation using models for which the likelihood function is
intractable. When Monte Carlo error can be made small, by sufficiently
exhaustive computation, then the standard theory and practice of
likelihood-based inference applies. As data become larger, and models more
complex, situations arise where no reasonable amount of computation can render
Monte Carlo error negligible. We present profile likelihood methodology to
provide frequentist inferences that take into account Monte Carlo uncertainty.
We demonstrate our methodology in three situations, analyzing nonlinear dynamic
models for spatiotemporal data, panel data, and genetic sequence data.

Computational approaches to drug discovery can reduce the time and cost
associated with experimental assays and enable the screening of novel
chemotypes. Structure-based drug design methods rely on scoring functions to
rank and predict binding affinities and poses. The ever-expanding amount of
protein-ligand binding and structural data enables the use of deep machine
learning techniques for protein-ligand scoring.
We describe convolutional neural network (CNN) scoring functions that take as
input a comprehensive 3D representation of a protein-ligand interaction. A CNN
scoring function automatically learns the key features of protein-ligand
interactions that correlate with binding. We train and optimize our CNN scoring
functions to discriminate between correct and incorrect binding poses and known
binders and non-binders. We find that our CNN scoring function outperforms the
AutoDock Vina scoring function when ranking poses both for pose prediction and
virtual screening.

Many time series exhibit changes both in level and in variability. Generally,
it is more important to detect a change in the level, and changing or smoothly
evolving variability can confound existing tests. This paper develops a
framework for testing for shifts in the level of a series which accommodates
the possibility of changing variability. The resulting tests are robust both to
heteroskedasticity and serial dependence. They rely on a new functional central
limit theorem for dependent random variables whose variance can change or trend
in a substantial way. This new result is of independent interest as it can be
applied in many inferential contexts applicable to time series. Its application
to change point tests relies on a new approach which utilizes
Karhunen--Lo{\'e}ve expansions of the limit Gaussian processes. After
presenting the theory in the most commonly encountered setting of the detection
of a change point in the mean, we show how it can be extended to linear and
nonlinear regression. Finite sample performance is examined by means of a
simulation study and an application to yields on US treasury bonds.

Dengue is a mosquito-borne disease that threatens more than half of the
world's population. Despite being endemic to over 100 countries, government-led
efforts and mechanisms to timely identify and track the emergence of new
infections are still lacking in many affected areas. Multiple methodologies
that leverage the use of Internet-based data sources have been proposed as a
way to complement dengue surveillance efforts. Among these, the trends in
dengue-related Google searches have been shown to correlate with dengue
activity. We extend a methodological framework, initially proposed and
validated for flu surveillance, to produce near real-time estimates of dengue
cases in five countries/regions: Mexico, Brazil, Thailand, Singapore and
Taiwan. Our result shows that our modeling framework can be used to improve the
tracking of dengue activity in multiple locations around the world.

Providing accurate predictions is challenging for machine learning algorithms
when the number of features is larger than the number of samples in the data.
Prior knowledge can improve machine learning models by indicating relevant
variables and parameter values. Yet, this prior knowledge is often tacit and
only available from domain experts. We present a novel approach that uses
interactive visualization to elicit the tacit prior knowledge and uses it to
improve the accuracy of prediction models. The main component of our approach
is a user model that models the domain expert's knowledge of the relevance of
different features for a prediction task. In particular, based on the expert's
earlier input, the user model guides the selection of the features on which to
elicit user's knowledge next. The results of a controlled user study show that
the user model significantly improves prior knowledge elicitation and
prediction accuracy, when predicting the relative citation counts of scientific
documents in a specific domain.

Imputing incomplete medical tests and predicting patient outcomes are crucial
for guiding the decision making for therapy, such as after an Achilles Tendon
Rupture (ATR). We formulate the problem of data imputation and prediction for
ATR relevant medical measurements into a recommender system framework. By
applying MatchBox, which is a collaborative filtering approach, on a real
dataset collected from 374 ATR patients, we aim at offering personalized
medical data imputation and prediction. In this work, we show the feasibility
of this approach and discuss potential research directions by conducting
initial qualitative evaluations.

We consider the problem of predicting the next observation given a sequence
of past observations. We show that for any distribution over observations, if
the mutual information between past observations and future observations is
upper bounded by $I$, then a simple Markov model over the most recent
$I/\epsilon$ observations can obtain KL error $\epsilon$ with respect to the
optimal predictor with access to the entire past. For a Hidden Markov Model
with $n$ states, $I$ is bounded by $\log n$, a quantity that does not depend on
the mixing time. We also demonstrate that the simple Markov model cannot really
be improved upon: First, a window length of $I/\epsilon$ ($I/\epsilon^2$) is
information-theoretically necessary for KL error ($\ell_1$ error). Second, the
$d^{\Theta(I/\epsilon)}$ samples required to accurately estimate the Markov
model when observations are drawn from an alphabet of size $d$ is in fact
necessary for any computationally tractable algorithm, assuming the hardness of
strongly refuting a certain class of CSPs.

Given a sufficient statistic for a parametric family of distributions, one
can estimate the parameter without access to the data itself. However, the
memory or code size for storing the sufficient statistic may nonetheless still
be prohibitive. Indeed, for $n$ independent data samples drawn from a
$k$-nomial distribution with $d=k-1$ degrees of freedom, the length of the code
scales as $d\log n+O(1)$. In many applications though, we may not have a useful
notion of sufficient statistics (e.g., when the parametric family is not an
exponential family) and also may not need to reconstruct the generating
distribution exactly. By adopting a Shannon-theoretic approach in which we
consider allow a small error in estimating the generating distribution, we
construct various notions of {\em approximate sufficient statistics} and show
that the code length can be reduced to $\frac{d}{2}\log n+O(1)$. We also note
that the locality assumption that is used to describe the notion of local
approximate sufficient statistics when the parametric family is not an
exponential family can be dispensed of. We consider errors measured according
to the relative entropy and variational distance criteria. For the code
construction parts, we leverage Rissanen's minimum description length
principle, which yields a non-vanishing error measured using the relative
entropy. For the converse parts, we use Clarke and Barron's asymptotic
expansion for the relative entropy of a parametrized distribution and the
corresponding mixture distribution. The limitation of this method is that only
a weak converse for the variational distance can be shown. We develop new
techniques to achieve vanishing errors and we also prove strong converses for
all our statements. The latter means that even if the code is allowed to have a
non-vanishing error, its length must still be at least $\frac{d}{2}\log n$.

This research evaluates the performance of an Artificial Neural Network based
prediction system that was employed on the Shanghai Stock Exchange for the
period 21-Sep-2016 to 11-Oct-2016. It is a follow-up to a previous paper in
which the prices were predicted and published before September 21. Stock market
price prediction remains an important quest for investors and researchers. This
research used an Artificial Intelligence system, being an Artificial Neural
Network that is feedforward multi-layer perceptron with error backpropagation
for prediction, unlike other methods such as technical, fundamental or time
series analysis. While these alternative methods tend to guide on trends and
not the exact likely prices, neural networks on the other hand have the ability
to predict the real value prices, as was done on this research. Nonetheless,
determination of suitable network parameters remains a challenge in neural
network design, with this research settling on a configuration of 5:21:21:1
with 80% training data or 4-year of training data as a good enough model for
stock prediction, as already determined in a previous research by the author.
The comparative results indicate that neural network can predict typical stock
market prices with mean absolute percentage errors that are as low as 1.95%
over the ten prediction instances that was studied in this research.

The recently proposed Sequence-to-Sequence (seq2seq) framework advocates
replacing complex data processing pipelines, such as an entire automatic speech
recognition system, with a single neural network trained in an end-to-end
fashion. In this contribution, we analyse an attention-based seq2seq speech
recognition system that directly transcribes recordings into characters. We
observe two shortcomings: overconfidence in its predictions and a tendency to
produce incomplete transcriptions when language models are used. We propose
practical solutions to both problems achieving competitive speaker independent
word error rates on the Wall Street Journal dataset: without separate language
models we reach 10.6% WER, while together with a trigram language model, we
reach 6.7% WER.

Missing data is universal and methods to deal with it far ranging from simply
ignoring it to using complex modelling strategies such as multiple imputation
and maximum likelihood estimation.Missing data has only been effectively
imputed by machines via statistical/machine learning models. In this paper we
set to answer an important question "Can humans perform reasonably well to fill
in missing data, given information about the dataset?". We do so in a
crowdsourcing framework, where we first translate our missing data problem to a
survey question, which then can be easily completed by crowdworkers. We address
challenges that are inherent to crowdsourcing in our context and present the
evaluation on a real dataset. We compare human powered multiple imputation
outcomes with state-of-the-art model based imputation.

A typical viral marketing model identifies influential users in a social
network to maximize a single product adoption assuming unlimited user
attention, campaign budgets, and time. In reality, multiple products need
campaigns, users have limited attention, convincing users incurs costs, and
advertisers have limited budgets and expect the adoptions to be maximized soon.
Facing these user, monetary, and timing constraints, we formulate the problem
as a submodular maximization task in a continuous-time diffusion model under
the intersection of a matroid and multiple knapsack constraints. We propose a
randomized algorithm estimating the user influence in a network
($|\mathcal{V}|$ nodes, $|\mathcal{E}|$ edges) to an accuracy of $\epsilon$
with $n=\mathcal{O}(1/\epsilon^2)$ randomizations and
$\tilde{\mathcal{O}}(n|\mathcal{E}|+n|\mathcal{V}|)$ computations. By
exploiting the influence estimation algorithm as a subroutine, we develop an
adaptive threshold greedy algorithm achieving an approximation factor $k_a/(2+2
k)$ of the optimal when $k_a$ out of the $k$ knapsack constraints are active.
Extensive experiments on networks of millions of nodes demonstrate that the
proposed algorithms achieve the state-of-the-art in terms of effectiveness and
scalability.

We present a framework to understand GAN training as alternating density
ratio estimation and approximate divergence minimization. This provides an
interpretation for the mismatched GAN generator and discriminator objectives
often used in practice, and explains the problem of poor sample diversity. We
also derive a family of generator objectives that target arbitrary
$f$-divergences without minimizing a lower bound, and use them to train
generative image models that target either improved sample quality or greater
sample diversity.

We provide some new insights for analyzing dynamics of optimization
algorithms, which are popular in machine learning, based on differential
equation approaches. Our analysis reveals a natural connection from
optimization algorithm to physical systems, and is applicable to more general
algorithms and optimization problems beyond general convexity and strong
convexity.

In this paper, we study the problem of author identification under
double-blind review setting, which is to identify potential authors given
information of an anonymized paper. Different from existing approaches that
rely heavily on feature engineering, we propose to use network embedding
approach to address the problem, which can automatically represent nodes into
lower dimensional feature vectors. However, there are two major limitations in
recent studies on network embedding: (1) they are usually general-purpose
embedding methods, which are independent of the specific tasks; and (2) most of
these approaches can only deal with homogeneous networks, where the
heterogeneity of the network is ignored. Hence, challenges faced here are two
folds: (1) how to embed the network under the guidance of the author
identification task, and (2) how to select the best type of information due to
the heterogeneity of the network.
To address the challenges, we propose a task-guided and path-augmented
heterogeneous network embedding model. In our model, nodes are first embedded
as vectors in latent feature space. Embeddings are then shared and jointly
trained according to task-specific and network-general objectives. We extend
the existing unsupervised network embedding to incorporate meta paths in
heterogeneous networks, and select paths according to the specific task. The
guidance from author identification task for network embedding is provided both
explicitly in joint training and implicitly during meta path selection. Our
experiments demonstrate that by using path-augmented network embedding with
task guidance, our model can obtain significantly better accuracy at
identifying the true authors comparing to existing methods.