2015 London Stata Users Group Meeting - Abstracts

Bayesian Analysis using Stata

Stata 14 provides a suite of commands for performing Bayesian analysis. Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that a person accused of a crime is guilty? What is the probability that there is a positive effect of schooling on wage? What is the probability that the odds ratio is between 0.3 and 0.5? And many more. In my presentation, I will describe Stata's Bayesian suite of commands and demonstrate its use in various applications.

Log-linear models for cross-tabulations using Stata

Log-linear models for cross-tabulations are models for describing and testing patterns in cross-tabulations. These cross-tabulations could have two dimensions (e.g. father’s occupation versus son’s occupation) or more than two dimensions (e.g. father’s occupation versus son’s occupation for different cohorts and different countries). A wide range of patterns can be investigated and tested with these models. Some examples of these patterns are: one can investigate whether the dimensions are independent (e.g. father’s occupation has no relevance for the son’s occupation), whether the dimensions are independent except for the diagonals (e.g. sons are more likely to enter the occupation of their father, but the father has no influence once the son chooses to do something else than the father) or assume that the categories are ordinal and estimate a scale for each dimension and summarize the strength of the association with one number, which can be compared across cohorts or countries. The purpose of this talk is to give an overview of this family of models, discuss how to trick Stata (in particular, poisson and gsem) into estimating these models, and how to get interpretable parameters out of these models.

Social Network Analysis Using Stata

The field of social network analysis is one of the most rapidly growing fields of the social sciences. Social network analysis focuses on the relationships that exist between individuals (or other units of analysis) such as friendship, advice, trust, or trade relationships. Network analysis is concerned with the visualization and analysis of network structures, as well as with the importance of networks for individuals’ propensities to adopt different kinds of behaviors. Up until now such analyses have only been possible to perform using specialized software for network analysis. This tutorial introduces the so-called nwcommands, a software suite with over 80 Stata commands for social network analysis. The software includes commands (and dialog boxes) for importing, exporting, loading, saving, handling, manipulating, replacing, generating, visualizing, and animating networks. It also includes commands for measuring various properties of the networks and the individual nodes, for detecting network patterns and measuring the similarity of different networks, as well as advanced statistical techniques for network analysis including MR-QAP and ERGM.

A large-scale application of Stata’s forecast suite: challenges and potential

Stata 13 added a very important feature for macroeconomists: the forecast suite of commands that implements the definition of a model, consisting of a number of estimated equations and potentially nonlinear identities. Stata’s features include model solution, dynamic forecasting, scenario analysis and stochastic simulation. I report on my attempt to apply the forecast suite to a well-known large-scale macroeconomic model. I discuss the challenges related to use of these features in a much more complex context than that illustrated in the manual’s examples. I will also suggest a number of enhancements that would improve forecast’s capabilities in comparison to other popular forecasting tools.

Logistic regression: Why we often can do what we think we can do

There is increasing critisism of the ways in which the raw coefficients and odds ratios from logistic regression have been used. The argument is that logistic regression models a latent propensity of success and that the scale of that latent variable is fixed by fixing the variance of the error term. If one adds a variable to a model, the variance of the residual is likely to decrease, and the scale of the dependent variable thus changes. Comparing models with and without that additional variable thus becomes problematic. Similarly, a comparison of models in groups that are likely to have different residual variances will also be problematic. However, I will argue that logistic regression has an unusual dependent variable: a probability, which measures how certain we are that an event of interest happens. This degree of certainty is a function of how much information we have, which in case of logistic regression is captured by the variables we add to the model. If the dependent variable is interpreted in that way many of the problems with logistic regression turn out to be desirable properties of the logistic regression model.

Agents may consider information and other signals from their peers (especially close peers) when making their spatial site choices. However, the presence of other agents in a spatial location may generate congestion or agglomeration effects. Disentangling the potential peer effects with issues of congestion is difficult since it is hard to ascertain whether the observed congestion effects are a result of observing others behavior or the influence of peer effects within the same network encouraging a fisherman to visit a site even in the presence of congestion. The research develops an empirical framework to decompose both motivations in a spatial discrete choice model in an effort to synthesize the congestion/agglomeration literature with the peer effects literature. Using Monte Carlo analysis we investigate the robustness of our proposed estimation routine to the conventional random utility model (RUM) that ignores both peer and congestion/agglomeration effects as well as the spatial sorting equilibrium model that ignore peer effects. Our results indicate that both the RUM and sorting equilibrium models can be used to successfully investigate the presence of a peer effects. However, the estimates of congestion effects are poor because of ignored correlated random effects. Recent literature has largely used Bayesian methods for this hard problem. We also explore the use of Fixed Effects Multinomial Logit estimates to first estimate the base model, and then extract generalized residuals to estimate the peer effects.

Use of simulation with ipdpower in designing a randomised cluster study of an oral health intervention in care homes

This paper illustrates the use of a recently developed Stata procedure ipdpower (Kontopantelis, E.) in designing a cluster randomised trial. The trial required to compare change pre and post between intervention and non-intervention care homes. Forty nine residential care homes ranging in size from 3 to 112 beds (median 27 beds) were available to take part. Primary outcome measures were tooth cleaning (a dichotomy) and the Geriatric Oral Health Assessment Index (GOHAI, a continuous score). As is common in this situation it was required to explore the effect on sample size and power of a range of values of cluster sizes, within cluster correlation, between group variation, and intraclass correlation. Ranges of parameter values for a number of runs of the simulation procedure were obtained from published results of studies with similar features, transformed where necessary through standard formulae. The final design resulted in a recommendation of use of 16 homes with estimated statistical power of 80% for comparison of intervention with non-intervention participants, adjusting for baseline values. Simulation can be recommended as a valuable approach since it takes account of all features of the design, it facilitates communication among members of the study team in balancing design features and it provides a clear sense of the size required for the necessary statistical power.

Between and beyond: irregular series, interpolation, variograms and smoothing

Time series (and similar one-dimensional series) are more often irregularly spaced than many methods texts or courses admit. Even with a plan of regular measurements, gaps can arise for many human or inhuman reasons, while some series are naturally irregular. Interpolation of values between known values is a centuries-old need, but one neglected by official Stata, which offers only linear interpolation and cubic spline interpolation (in Mata). I review additional user-written commands for interpolation, including those for cubic, nearest neighbour and piecewise cubic Hermite methods available from SSC. Beyond interpolation of irregular series lie the questions of characterising the structure of such series and smoothing in various ways. One useful tool standard in spatial statistics is the variogram, which relates dissimilarity as squared differences between values to their separation in time or distance in space. Diggle and others have shown uses for variograms in time series and longitudinal data analysis. I discuss user-written Stata commands for variogram calculation, plotting and use in relation to exploratory data analysis on the one hand and smoothing on the other.

rscore: a Stata module to compute responsiveness scores

This paper presents rscore, a Stata module to compute unit responsiveness scores using a iterated random coefficient regression (RCR). The basic econometrics of this model can be found in Wooldridge (2002, pp. 638-642). The model estimated by rscore starts from a classical regression of Y, the target variable, on a series of factors X (the regressors), by assuming a different reaction (or responsiveness) of each unit to each factor contained in X. This is done by using a random coefficient regression (RCR), an approach in which the usual regression coefficients vary across units. The application of such an approach can convey new and interesting analytical findings compared to the traditional regression approach. In particular, by measuring a unit-specific regression coefficient for each regressor this model allows for: (i) ranking units according to the level of the responsiveness score obtained; (ii) detecting factors that are more influential in driving unit performance; (iii) studying, more in general, the distribution (variety) of the factors’ responsiveness scores across units. The knowledge of these idiosyncratic scores can be also exploited to test the presence of increasing, constant, or decreasing returns of Y to X in a straightforward and graphically easy-to-read way.

Fast Bayesian modelling in Stan using the StataStan program

Over the last three years, a new package for Bayesian modelling called Stan (after Stanislaw Ulam, co-author of the Metropolis algorithm) has been developing quickly and making an impact on computing for complex Bayesian models. By translating the model into C++ and then compiling that, it can run much faster than BUGS. A particular benefit is for simulation studies, because the model only needs to be compiled once. Furthermore, it includes a much faster and better mixing algorithm (NUTS: the No U-Turn Sampler), especially for correlated parameters that Gibbs samplers like BUGS cope with badly. I present a program StataStan, which sends your data and specifications to Stan, displays results, and can read the chains of samples back into Stata. There are also specific commands to run the commonly used models in the BUGS and Stan user manuals with your own data, avoiding the need to write the Stan model.

Efficient multivariate normal distribution calculations in Stata

The normal distribution holds significant importance in statistics. Much gathered real world data either is, or is assumed to be, normally distributed. Today though, a considerable amount of statistical analysis performed is not univariate, but multivariate in nature. Consequently, the multivariate normal distribution is of increasing importance. However, the complexity of this distribution makes computational analysis almost certainly necessary, and thus much research has been conducted in to developing efficient algorithms for its numerical analysis. Here we discuss our implementation of a certain choice of algorithm in Mata that allows its distribution function and equi-coordinate quantiles to be identified seamlessly for any choice of location vector and positive semi-definite covariance matrix. Moreover, we detail new commands to efficiently compute its density and to generate pseudo-random variables. We then discuss the performance of our commands relative to the presently available alternatives, and present how they provide greater generalisation and improved computational speed. Finally, through the example of designing a group sequential clinical trial, we demonstrate how our commands can be used easily to solve real-world problems facing Stata users.

A new Stata command for computing and graphing percentile shares

Percentile shares provide an intuitive and easy-to-understand way for analyzing income or wealth distributions. A celebrated example are the top income shares sported by the works of Thomas Piketty and colleagues. Moreover, series of percentile shares, defined as differences between Lorenz ordinates, can be used to visualize whole distributions or changes in distributions. In this talk I present a new command called pshare that computes and graphs percentile shares (or changes in percentile shares) from individual level data. The command also provides confidence intervals and supports survey estimation.

Using MICE to investigate loss to follow up in a 10 year cohort of HIV positive patients in Haiti

Loss to follow-up is unavoidable in many public health studies. Tracing all subjects may be impractical or prohibitively expensive. Traditional methods, including Kaplan-Meier analysis and inverse probability weighting (IPW), produce biased estimates if loss is not independent of survival. Multiple imputation with chained equations (MICE) provides an acceptable, robust and cost saving solution to this problem for HIV research in developing countries with limited resources. To illustrate utility, we applied MICE to ascertain outcome status of people who were lost to follow up within a cohort of N=910 HIV positive people followed for ten years in Port au Prince Haiti, 17% (n = 156) were lost to follow-up and 8% (n = 71) transferred facilities. Contact tracing was performed and 45 of the 156 subjects identified as lost to follow-up were found; 37 alive and 8 deceased. Analysis using IPW based on the traced subjects predicted that 63% of all subjects were alive at 10 years (95% CI 0.59-0.67). Results from MICE predicted that within 6 months 12% (95% CI 0.86-0.90) of those who were lost to follow-up or transferred were dead and 88% were alive (95% CI 0.10-0.14). At 10 years, 33% were predicted to be dead (95% CI 0.29-0.36) and 67% (95% CI: 0.64-0.71) were predicted to be alive. We found MICE to be more robust in predicting status as it allowed us to impute missing data so that we had the maximum number of observations to perform regression analyses. Additionally, the results were easier to interpret, less likely to be biased, and provided an interesting insight into a problem that is often commented upon in the extant literature. Overall MICE is a useful cost saving method for studying survival compared to contact tracing for HIV research in developing countries.

Big Data in Stata

With more and more data being stored by organizations across industries – from academia, to health care, to banking – along with plummeting storage and RAM costs, there is a growing need for tools to analyze “big data”. The world is moving from needing to analyze megabytes of data to needing to analyze many gigabytes. While Stata is very user-friendly, many of the most basic commands – summarize, sample, collapse, and encode, etc – are not optimized for speed. These commands – as of Stata 14 – all rely on sorting, making them tens, or even hundreds (in the case of sample), of times slower than what is possible with better algorithms. In this presentation I illustrate alternative algorithms along with coded examples in Stata, Mata, and C++ plugins which may be used to more quickly analyze big data. fastsample and fastcollapse are available from the SSC.

How used are user-released commands? Introducing ssccount

Statisticians and econometricians developing new methods are keen for their methods to be adopted, and releasing user-friendly software plays an important role in uptake. Methods that were not initially applied much, and became so after software implementations, include Cox’s proportional-hazards model, multiple imputation and propensity score matching. It is easy to release packages to the Stata community via the Boston College Statistical Software Components (SSC) archive, but gauging the uptake can be difficult. Stata’s ssc hot command lists the number of hits for a recent month for packages available on SSC. The new ssccount command goes further, obtaining monthly files of hits (from July 2007 when records began) for specified authors and packages, and optionally plots the number of hits over time. This can give authors an impression of how well their commands are being used. Funders are increasingly asking for evidence of impact, and thus ssccount provides a useful soft measure.

Somers’ D: A common currency for associations

Somers’ D(Y|X) is an asymmetric measure of ordinal association between two variables Y and X, on a scale from –1 to 1. It is defined as the difference between the conditional probabilities of concordance and discordance between two randomly-sampled (X,Y)-pairs, given that the two X-values are ordered. The somersd package enables the user to estimate Somers’ D for a wide range of sampling schemes, allowing clustering and/or sampling-probability weighting and/or restriction to comparisons within strata. Somers’ D has the useful feature that a larger D(Y|X) cannot be secondary to a smaller D(W|X) with the same sign, enabling us to make scientific statements that the first ordinal association cannot be caused by the second. An important practical example, especially for public-health scientists, is the case where Y is an outcome, X an exposure, and W a propensity score. However, an audience accustomed to other measures of association may be culture-shocked, if we present associations measured using Somers’ D. Fortunately, under some commonly-used models, Somers’ D is related monotonically to an alternative association measure, which may be more clearly related to the practical question of how much good we can do. These relationships are nearly linear (or log-linear) over the range of Somers’ D values from –0.5 to 0.5. We present examples with X and Y binary, with X binary and Y a survival time, with X binary and Y conditionally Normal, and with X and Y bivariate Normal. Somers’ D can therefore be used as a common currency for comparing a wide range of associations between variables, not limited to a particular model.

Irene Petersen Department of Primary Care and Population Health, University College London, UKi.petersen@ucl.ac.uk

Ethnicity is an important factor to be considered in many epidemiological studies because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and therefore is available in a number of large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected to serve clinical purposes, a large amount of data that are relevant for research purposes including ethnicity is often missing. A popular approach is to use multiple imputation, but the standard multiple imputation does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. However, census data can be utilised to form weights to use in multiple imputation such that the correct ethnicity distribution is recovered. I will describe how the method of weighted multiple imputation of missing data is implemented using the Stata’s mi impute suite, note some issues, and introduce a new procedure to implement the method for multiple incomplete variables which require different imputation weights. Finally, I will give an example showing how the method works when ethnicity is used as an explanatory variable in a cohort study.

Robust covariance estimation for quantile regression

Quantile regression is increasingly used by practitioners, but there are still some misconceptions about how difficult it is to obtain valid standard errors in this context. In this presentation I discuss the estimation of the covariance matrix of the quantile regression estimator, focusing special attention on the case where the regression errors may be heteroskedastic and/or “clustered”. Specification tests to detect heteroskedasticity and intra-cluster correlation are discussed, and small simulation studies illustrate the finite sample performance of the tests and of the covariance matrix estimators. The presentation concludes with a brief description of qreg2, which is a wrapper for qreg that implements all the methods discussed in the presentation.

Influence functions at work

This presentation illustrates three practical uses of influence functions (IF) in Stata. First (and most obviously), inspection of IFs helps detecting influential sample observations. I show how this can be done in practice and how similar this is to examining jackknife replicates. Second, IFs make it easy to calculate (asymptotic) standard errors and confidence intervals for a wide range of statistics. I illustrate how this can be done in Stata with the total command so as to account for complex survey design easily. Third and finally, application of ‘recentered influence function (RIF) regression’ has recently been advocated to approximate the impact of covariates on (unconditional) distribution statistics. I demonstrate this use of IFs in Stata and discuss interpretation of RIF regression model coefficients. Empirical applications are to income distribution analysis. Several user-written utilities and commands are illustrated along the way.

Who has won Rugby Union World Cups, and why ? A sequential approach based on multinomial probit

Final economic outcomes are often determined over consecutive process stages. The most prevalent approach is to model inter-nodal transition/event probabilities using techniques such as sequential logit. Transition success for survivors at each stage is then regressed on explanatory variables using standard logit (allowing for correlation in the error-terms). This seemingly un-related approach benefits from methodological convenience. It crucially depends, however, on the assumption that at each stage, any un-observable factors are independent. We believe that error term independence may often be an excessively strong assumption. We propose an alternative approach based on multinomial probit that does not rely on that very restrictive assumption. Implementation is no more demanding. We describe the procedure using Stata 13. To illustrate the usefulness of the method, we estimate the determinants of success for each stage at the Rugby World Cup.

Alexander Zlotnik Department of Electronic Engineering, Technical University of Madridazlotnik@die.upm.es

The integration of Stata with web applications can be of great use in some contexts. One such scenario is to make user-written Stata commands available directly through a webpage from any web-enabled device, such as a smartphone, tablet computer, personal digital assistant (PDA) or any personal computer with a web browser. This would allow reaching a large and diverse audience. Another scenario is the integration of subroutines written in Stata or Mata in web applications, which is desirable in organizations where statistical applications are developed by one team with Stata, while the rest of the business logic and front-end applications are developed by another team using different technologies. If Stata programs can be used directly, the often costly translation from Stata into other programming languages can be avoided, thus saving development resources, time and eliminating the errors and discrepancies due to translation mistakes and limitations of target languages. We demonstrate an approach for executing user-written commands on Stata IC, Stata SE and Stata MP through a web application based on the WAMP stack (Microsoft Windows, Apache, MySQL, PHP). Then, we introduce the adjustments needed for other operating systems, web servers and server-side scripting programming languages. We describe the requirements for Stata user-written commands accessible through web applications, their limitations, the bidirectional communication between Stata and generic web applications, possible solutions for concurrent execution scenarios, as well as the transformation of Stata dialog box (.dlg) files into web-ready HTML / CSS / JavaScript interfaces. Finally, we mention web application security principles, Stata-based web services and software licensing approaches.