Complex Survey

9:05–9:25

Approaches to imputing missing data in complex survey data

Abstract:
Complex survey data collected by government agencies are both
expensive and valuable. Producing a complete dataset is
important, but missing data in complex survey data pose some
unique challenges. Commonly used statistical software packages
such as Stata, SAS, and SUDAAN each have a procedure to impute
the missing data. However, unlike the procedures for describing
and analyzing complex survey data, the procedures implemented by
these three software programs are fundamentally different. The
three approaches will be described, and an example will show the
similarities and differences. The recent developments in this
area of the Census Bureau will also be discussed.

Abstract:
Calibration is a method for adjusting the sampling weights and often used to
account for nonresponse and underrepresented groups in the population.
Another benefit of calibration is smaller variance estimates compared with
estimates using unadjusted weights.
Stata implements two methods for calibration: the raking-ratio method
and the generalized regression method.
Stata supports calibration for the estimation of totals, ratios, and
regression models.
Calibration is also supported by each survey variance-estimation method
implemented in Stata.
In this presentation, I will show how to use calibration in survey data
analysis using Stata.

Causal Inference, Endogeneity, and Data Science

Abstract:
Contactless credit cards are a payment innovation combining the
speed and convenience of paying cash with desirable features of
credit card payments, for example, enhanced record keeping and the
ability to earn rewards. There have been several attempts to
measure the impact that contactless credit card adoption has on
consumers' use of cash for making point-of-sale transactions. Fung,
Huynh, and Sabetti (2014) use data from the Bank of Canada's 2009
Methods-of-Payment survey to estimate that contactless adoption
results in a decline of 10% for the volume share of purchases made
with cash. This analysis was undertaken when use and acceptance of
contactless payment was still nascent. Chen, Felt, and Huynh (2017),
by contrast, find no impact on the cash share. Their work exploited
panel-data structure to better control for unobserved heterogeneity
across consumers. Part of the difficulty in measuring the impact of
contactless adoption on cash usage is the obvious endogeneity issue:
it is unclear whether adoption of contactless technology lowers cash
usage or whether cash intensive consumers are less likely to adopt
contactless, perhaps for other reasons, for example, a preference for anonymity.
Huynh, Schmidt-Dengler, and Stix (2014) show that merchant acceptance
also plays a crucial role in cash usage, further complicating the
causality issue as contactless terminals, while increasing over time,
are certainly not ubiquitous. Recent work by Nam (2016) using an
approach developed by Woolridge (2014) allows us to address this
problem and provide a more robust model of payment choice and
contactless adoption. We utilize data from the Bank of Canada's 2013
Methods-of-Payment survey. The survey included a three-day payments
diary that tracks respondents' purchases over the course of three
days; this allows us to calculate cash, debit, and credit shares.
These shares have an obvious dependence — an increase in the
cash share will necessarily lead to a decrease in either debit or
credit because the shares must add to one. Nam's estimator allows us
to model this effect while simultaneously accounting for the
endogenous contactless adoption decision, hence providing more reliable
estimates of the impact on cash. We implement the estimator in Stata
and provide a method for bootstrapping error estimates.

Abstract:
Causal inference generally relies on strong assumptions
of exogeneity and selection on observables. Applied
researchers regularly make these assumptions but are
often concerned that their results may be sensitive to
small violations of them. This presentation will describe
one approach to this problem: inferring the correlation
between the treatment and unobservables from the observed
correlation between the treatment and the observable/control
variables. I will also describe implementation of this method
in Stata and some practical considerations in its use.

Abstract:
Machine learning techniques are utilized in this presentation
to improve upon the selection correction procedure of
Dahl (2002). Dahl's nonparametric method is widely
used in the empirical economics literature to control
for selection bias; however, it relies on a strong
identification assumption. This single index sufficiency
assumption (SISA) imposes restrictions on the error terms
of the selection equation that are likely violated in many
applications. This contribution establishes a modified
correction procedure that uses variable selection techniques
to relax this assumption. Identification in this alternative
procedure relies on a restriction that is data driven and
is a relaxation of the SISA. Variable selection is performed
by employing the post-double-lasso estimator of Belloni,
Chernozhukov, and Hansen (2014). This is implemented in Stata
using lassopack, a set of community-contributed commands
by Ahrens, Hansen, and Schaffer. I perform a numerical
experiment that establishes that this method is preferable to
traditional correction procedures in all cases, except where
researchers have strong a priori reasons to suspect that the
SISA holds. Machine learning methods, combined with the
insights of Lee (1983), can therefore be used to control for
selection bias, while overcoming the curse of dimensionality,
without the imposition of overly strong distributional assumptions.

Abstract:
Random Forest is a statistical machine-learning algorithm
for prediction and classification under supervised learning.
Our Stata command randomforest implements this algorithm
through a plugin to the WEKA library. randomforest is
available for Windows/Mac/Linux. We will review the algorithm
and illustrate randomforest with two
examples: 1) prediction of the election outcomes for individual
constituencies of the 2017 British Election Study data and
2) prediction of household income from the 2016 US Consumer
Finance Survey data.

Learning Tools

1:15–2:05

Efficient dynamic documents using Stata

Abstract:
Stata 15 includes three new commands for producing dynamic
documents: dyndoc, putdocx, and putpdf.
These commands have generated much interest in the user
community; this has led to a large amount of
community-contributed software. In this presentation, I'll give some
tips about how to use the commands efficiently both with official
Stata software and with some of these community-contributed tools.

Abstract:geotools is a community-contributed set of tools for exporting data
from Stata datasets in ubiquitous ShapeFile and GeoJSON
formats. These formats are supported by numerous online and
offline GIS systems, including ESRI's ArcView/ArcGIS products,
Google API, and other GIS and data-visualization systems. The
input data may be coming from own data collection, such as with
the use of GPS sensors in the growing segment of CAPI data
collection software, or it can be a product of geospatial data
analysis in Stata. The produced output can be utilized as layers
in composite multilayer maps, as interactive maps, etc. geotools
does not require online access or other software to produce its
output. In the presentation, I will overview the functionality
and options of geotools and establish relations with other
community-contributed Stata modules related to GIS capabilities/file
formats.

Abstract:
I present instructional aids using Stata that I have found
useful for an introductory course on biostatistics taught
at the University of Toronto. Particularly useful tools
include CDF graphs that highlight the fact that treatment
effects in logit and other binary response models depend on
the variance of the latent underlying continuous variable;
animations that show the relationship between hypothesis
tests on a parameter value and the corresponding confidence
interval; and a slightly generalized form of the power by
a simulation Stata program developed by A. H. Feiveson.

Upgrading business statistics curriculum to meet the needs of knowledge workers

Abstract:
Business faculties are often the largest units in universities
and colleges in North America. During 2013–14, over 300,000
graduate and undergraduate business degrees were conferred by
North American business schools accredited by the Association
to Advance Collegiate Schools of Business (AACSB). For the same
period, over 1.1 million students were enrolled in graduate and
undergraduate programs in business faculties. At the undergraduate
level, most business students are required to take at least one,
and in most cases two, courses in business statistics and analytics.
A quick review of course outlines and the table of contents of
popular textbooks in business statistics will reveal that not much
has changed over the past few decades in the way statistics are
taught to business students. Despite the emergence of big data,
advances in computing power, availability of open-source software and
open data, business statistics curricula still follow the learning
paths established before the successive revolutions in computing.
Thus, students are still taught how to conduct a battery of
inferential tests, while most tests could be replaced with regression
models. Consider that instructors continue to spend one or more
lectures introducing t-tests in undergraduate courses, while the
same output could be readily obtained from a regression model with
a continuous dependent and a categorical explanatory variable. In
this presentation, I highlight the need to update the curriculum for
courses in business statistics. I make the case to replace inferential
tests, for example, t-tests and correlation tests, with regression
models and introduce regression-driven inferential statistics sooner
in the course than at the very end, which continues to be the case
today. I also highlight the need to introduce basic machine-learning
algorithms to the curriculum so that one can narrow the gap between
the analytic skills desired by businesses and the statistical training
imparted to business students.

Clustering

3:20–3:50

Inference with clustered data

Abstract:
This article introduces clusteff, a new Stata command for
checking the severity of cluster heterogeneity in cluster–robust
analyses. Cluster heterogeneity can cause a size
distortion leading to under-rejection of the null hypothesis.
Carter, Schnepel, and Steigerwald (2015) develop the effective
number of clusters to reflect a reduction in the degrees of
freedom, thereby mirroring the distortion caused by assuming
homogenous clusters. clusteff generates the effective number
of clusters. We provide a decision tree for cluster–robust
analysis, demonstrate the use of clusteff, and recommend
methods to minimize the size distortion.

Abstract:
The Stata package boottest implements a wide variety of
bootstrap tests, including tests for linear regression
models that are robust to one-way or multiway clustering.
I explain how these tests work and provide empirical examples.
In the one-way case, the program can generate the bootstrap
data in two different ways, using the wild bootstrap or the
wild cluster bootstrap. In the two-way case, it can do so in
four different ways, using the wild bootstrap or three variants
of the wild cluster bootstrap. For each method, four different
p-values can be calculated to handle all types of
one-sided and two-sided tests.

Abstract:
Stata developers present will carefully and cautiously
consider wishes and grumbles from Stata users in the audience.
Questions, and possibly answers, may concern reports of
present bugs and limitations or requests for new features in
future releases of the software.