First Australian and New Zealand Users Group meeting: Abstracts

Sunday, 10 October 2004

Programming for further processing of postestimation results

Ian WatsonAustralian Centre for Industrial Relations Research and Training, University of Sydney

Abstract

Do you find yourself regularly cutting and pasting your postestimation
results, such as regression coefficients, into a spreadsheet? If so, you
should consider trying to program wherever possible. Using Stata's matrix
commands, this presentation will show you how to process postestimation
results, such as manipulating a vector of regression coefficients. By crafting
your own small ado-file, you can save yourself from the tedious and repetitive
job of using spreadsheets.

The presentation will be illustrated with an example based on decomp,
an ado-file by Ian Watson that decomposes earnings results. The spreadsheet
approach will be contrasted with the ado approach.

Australia's firm-level productivity — a new perspective

Robert BreunigResearch School of Social Sciences, Australian National University
Marn–Heong WongAustralia–Japan Research Centre, Australian National University

Abstract

Not all firms contributed to Australia's impressive productivity growth in the
1990s. Some performed better than others, and entrants arrived even as
incumbents exited. If firms make decisions on input demand and liquidation
based on their productivity, the latter known to them but unobserved by the
econometrician, this gives rise to simultaneity and selection problems that
bias the traditional estimators of production function coefficients. We apply
a semiparametric technique that endogenizes input choices and firm exit
decisions to obtain production function estimates on Australian firms.
Estimation is carried out using the Business Longitudinal Survey,
Australia's only business longitudinal micro-dataset that tracks firm entry
and exit.

Simulating two- and three-generation pedigree data for genetic epidemiology research

Jisheng S. CuiDepartment of Public Health, University of Melbourne, and Mathematical and
Information Sciences Division, Commonwealth Scientific and Industrial Research
Organisation, Clayton, Victoria

Abstract

Apart from collection of real pedigree data, it is also very important to have
simulated pedigree data in genetic epidemiology research. The simulated data
can be used to compare the efficiency of different statistical models and to
investigate some phenomena that cannot be answered by the real data. Using
Stata to simulate the pedigree data has advantages over using computer
languages (e.g. C++ or Fortran) because the random numbers of some common
probability distributions can be easily simulated by the software. Here we
introduce two Stata programs, simuped2 and simuped3, which can
be used to simulate two- and three-generation pedigree data, respectively.
Variables generated by these programs include family ID, individual ID,
generation, age, gender, and genotype.

Modeling intensive care unit outcome in a large data base: analysis of the institutional effect

Within the intensive care environment, large data-bases exist, recording
patient, ICU, and hospital details. For the last 20 years, a number of
competing algorithms have been developed to generate risk-adjusted outcomes
for patients; the most well known is the APACHE II (acute physiology and
chronic health care evaluation) algorithm. Standardized mortality rates (SMR)
for individual ICUs have subsequently been generated (the "league-tables"
paradigm). The method of calculation of SMR using say, the APACHE II
algorithm, whereby "mortality ratios are calculated by projecting the APACHE
II score-specific mortalities of the total group on case mix ...of individual
ICUs" amounts to an indirect standardization, which (quoting Yule and
Rothman), "is not fully a method of standardization at all". It has been
recommended (Fidler 1997) to use direct standardization by either: a.
logistic regression ... with separate intercepts for each ICU. The intercepts
are simply the logits of directly standardized mortality rates and can be used
for rankings. This approach assumes constant slopes for all ICUs... and can be
tested, or b. model the differences between ICUs as random effects (DeLong et
al 1997)

The above matters will be addressed using data from the ANZICS (Australia and
New Zealand Intensive Care Society) national data base, 1993-2003, recording
APACHE II data and hospital outcomes for 280,000 patients in 201 ICUs.
Implications for the use of the Stata will be illustrated.

Generalized partially linear models

Roberto GutierrezStataCorp

Abstract

Partially linear models are linear regression models where one component is
allowed to vary nonparametrically. Generalized partially linear models
generalize this case from linear regression to the quasi-likelihood setting of
standard GLIMs, thus encompassing a larger class models including logistic,
Poisson, and Gamma regression. Although estimation for these models is
possible in official Stata via fractional polynomials, this approach is
entirely nonparametric and uses a local-linear smooth to estimate the
"nonlinear" component. The Stata command gplm for fitting generalized
partially linear models is discussed and demonstrated.

The effect of missing data on covariates in survival analysis

Irit Aitkin,
Department of Psychology, University of Melbourne

Abstract

We deal with this problem in the context of survival analysis with missing
data on covariates. More specifically, we examine the factors affecting the
duration of breastfeeding in Western Australia. Duration was studied in 556
women delivering at two maternity hospitals in Perth, Australia. The study was
carried out over the period September 1992 to April 1993. 466 women breastfed
when they left the hospital. In a previous analysis, the Cox proportional
hazards model was fitted to determine the factors affecting duration of
breastfeeding. However, because of missing data, a covariate known to be
important, smoking, could not be used as it would have resulted in a loss of
almost 50% of the available sample. In this analysis, we incorporate the
incomplete data on smoking omitted from the previous analysis.

We deal with the missing data on covariates in survival analysis in two
ways — the first is by maximum likelihood and the second by multiple
imputation.

Direct maximization of the likelihood with missing data is complicated, and
most methods that perform maximum likelihood estimation (for example, the EM
algorithm) use some form of data augmentation, which augments the observed
data with latent (unobserved) data, so that very complicated calculations are
replaced by much simpler ones given the "complete data".

The distribution of response time for cases with smoking missing is no longer
a Cox model but a mixture of two such models, in proportions given by the
population proportions of smokers and non-smokers. The likelihood function is
therefore different for complete and incomplete cases, and so maximizing it is
more complicated in having to allow for this difference.

We carried out the ML analysis in Stata using GLLAMM (Generalized Linear
Latent And Mixed Models) routines (Rabe–Hesketh, Pickles, and Skrondal 2001).
In the GLLAMM procedure, a latent smoking variable is defined for the cases
with smoking missing, and the breastfeeding durations are regressed on the
explanatory variables and smoking — the covariate when it is observed and
the latent variable when not. The model for the smoking covariate is a
"measurement model" when the covariate is observed and a "structural model"
when it is not.

We compared ML using GLLAMM with multiple imputation using the program written
by J.L Schafer mainly for S-Plus/R. It is based on the data augmentation
algorithm (Tanner and Wong 1987).

Tools for using multiple imputation for missing data in Stata

John Carlin Departments of Paediatrics & Public Health, University of Melbourne, and
Murdoch Childrens Research Institute, Melbourne
Philip Greenwood, John Galati, and Joe Schafer Departments of Paediatrics & Public Health, University of Melbourne, and
Murdoch Childrens Research Institute, Melbourne

Abstract

A major analytic challenge in epidemiological studies is the threat to
validity and precision of conclusions raised by missing data. It is still
commonly accepted practice to analyze data containing missing values by
"complete-case" methods, where entire individuals are omitted from the
analysis if they have a missing value on any of the variables required for the
analysis in question. This approach can lead to biases in conclusions, by
excluding individuals in whom patterns of association may be different than
among those retained, and at best leads to loss of precision due to the
reduction in sample size available for analysis. The method of multiple
imputation is gaining popularity as an approach for dealing with missing data.
It involves the production of multiple complete datasets based on a
statistical model for the missing values given the observed data. Each of the
imputed datasets is then analyzed using standard methods, and valid inferences
are obtained by combining these estimates appropriately. Given tools for (a)
imputing the missing values, and (b) analyzing the multiple imputed datasets,
the method offers great flexibility. In this talk I will review currently
available tools for task (a), ranging from fully model-based methods provided
in software developed by Schafer and now available in packages such as SAS and
S-PLUS to more pragmatic but flexible techniques such as the use of chained
equations. Stata commands for performing the latter technique have recently
been developed by Patrick Royston, and we are working to develop Stata
interfaces for some of Schafer's methods. Tools for task (b) have been fairly
limited but we have recently published a flexible package of commands in
Stata, which allows a wide range of data manipulations as well as combined
analyses to be performed on multiple imputed datasets with minimal effort. We
have used multiple imputation to address missing data problems in the
Victorian Adolescent Health Cohort Study (VAHCS), which began in 1992 with
participants aged 15 and has recently completed an 8th wave of data
collection, and analyses of data from this study will be used in the talk to
illustrate the methods and to highlight outstanding issues, both statistical
and computational.

Analyzing multiply imputed datasets: separate or stacked

The method of multiple imputation provides an attractive approach to handling
missing data in large studies. A variety of software is now available to
produce multiply imputed (MI) datasets, and we have published a set of Stata
commands â"MI tools" that facilitate the manipulation and analysis of MI
datasets. MI datasets can be either a set of separate data files or a single
(stacked) data file with some extra information to index the datasets. For the
purpose of writing Stata commands to analyze these data, what are the benefits
of each format? The stacked format seems to offer greater efficiency and
elegance and can make better use of existing syntax structures. However,
separate data files seem to offer greater overall flexibility and some
important tasks can only be implemented in that format. It seems that a
combined approach might give the best of both worlds. This talk will describe
our current work on a revised version of MI tools.

Using plugins and COM servers in Stata for handling multiple datasets

Computational efficiency and flexibility in a statistical package may be
enhanced by enabling the package to communicate directly with other programs.
A model of particular interest at the moment is the component object model
(COM). This model provides a uniform mechanism for programs running under
Microsoft Windows to share data and functionality. Recently, statistical
routines for imputing values in multiple datasets have been packaged by Joe
Schafer as COM servers, making them available to a wide variety of statistical
analysis packages. (The routines themselves were also originally written by
Joe.) In the first part of this talk, I will discuss using Stata plugins to
access these multiple imputation routines from within Stata. Techniques for
handling missing data invariably involve processing multiple datasets. Since
Stata is fundamentally geared towards processing a single dataset at any
given time, a natural question that arises is how best to handle multiple
datasets in Stata in a general, flexible, and efficient manner. In the
remainder of the talk, I will discuss using COM servers and Stata plugins for
this purpose, and I will highlight the advantages of this approach from the
perspective of computational efficiency, flexibility, and elegance.

Propensity score matching using -psmatch-

In observational studies, the researcher has no control over treatment
assignment. Control and intervention groups are therefore often unbalanced
with respect to confounding variables, and even covariate adjustment doesn't
always fully eliminate bias. The propensity score is the conditional
probability of being in the treatment group given the covariates, and it can
be used to balance the covariates in the two groups. The score is derived
from a logistic regression model of treatment group on the covariates, with
the propensity score being the predicted probability of being in the treated
group.

Once calculated, the propensity score can be used to reduce bias by matching,
stratification, or by using it as a covariate in the regression model. In this
presentation, I will briefly present some of the theory behind the use of
propensity scores, and demonstrate the Stata procedure psmatch, which
facilitates propensity score matching.

Simulating a control pool to economize control recruitment in a matched case–control study

Rory WolfeMonash University, Melbourne

Abstract

Control recruitment in case–control studies is problematic if no register
for the study base exists. If random telephone contact is used and study base
members comprise a relatively small proportion of the population then control
recruitment can be resource-intensive. In a matched case–control study
with prospective case recruitment, the study base is accessed repeatedly for
control recruitment and in this context we propose a dynamic pool to economize
control recruitment. The pool gets added to when a study base member is
contacted but doesn't match the current case. The pool is then accessed for
future cases before resorting to random telephone contact again.

Using Stata, we simulate the operation of a control pool to quantify the
possible economies under a range of likely scenarios. These simulation results
are compared with early experience in the Farm Injury Risk among Males study,
which found modest efficiency gains (4 controls recruited from the pool at a
saving of approximately 90 telephone contacts per control).

Risk ratio estimation with the logistic model

Leigh BlizzardMenzies Research Institute, University of Tasmania
David W. HosmerSchool of Public Health and Health Sciences, University of Massachusetts

Abstract

The log-binomial model (the generalized linear model with binomial errors and
log link) makes it possible to directly estimate the relative risk from cohort
follow-up data, or the prevalence ratio from cross-sectional data, with
adjustment for confounders. One of the problems with the use of this model is
that the iterative estimation algorithm may fail to converge. Schouten et al
recognized this problem, and proposed a clever solution to it. Their approach
involves defining a dichotomous outcome variable (D) coded as D=1 for
occurrence and D=0 for non-occurrence, and augmenting the original data by
replicating the observations on subjects with the outcome (D=1) but with the
outcome variable coded as D=0 in the second instance. (In the language of a
case control study, each case is included both as a case and as a control).
Schouten et al show that that a logistic regression model fitted to the
expanded data set has the same parameters as the log-binomial model. They
derive a consistent "information sandwich" estimator of the covariance matrix
of the estimated coefficients that, with some data manipulation, can be
obtained from the output of the logistic regression. The problem is that while
a solution for the parameter vector can be obtained from nearly any set of
data, it is not guaranteed to be admissible for the log-binomial model. We use
Stata to demonstrate the method of Schouten et al, including the calculations
required to obtain standard error estimates, and describe the frequency of
inadmissible solutions in simulated data.