Bicycling and walking are two methods of transportation that have significant benefits over cars or even public transportation. The costs of supporting the necessary infrastructure and individuals' costs of travel are lower, and higher activity level is liked to better health and wellbeing. It has become common in the last decade for urban neighborhoods to promote themselves with a walk score, indicating how easy it is to live in that neighborhood and walk as a primary mode of transportation. These scores typically rely on the availability and accessibility of sidewalks, shops, grocery stores, parks, etc.

This project aims to understand the socioeconomic factors that lead people to choose bicycling and use that knowledge to develop a "bikability" score for neighborhoods in the city of Chicago.

The American Community Survey provides tabular data on a number of socioeconomic features at a very small spatial scale, with over 2500 "block groups" within the Chicago city limits. This data includes a question on the method used for commuting to work, which includes by bicycle. The ACS report Modes Less Traveled - Bicycling and Walking to Work in the United States explores this data on a national level, and uncovers some very useful trends that we will look for in the Chicago-centric data:

Men commute by bicycle more often than women (0.8% to 0.3%, respectively)

The rate of commuting steadily drops with age from ~1% from 16-24 to 0.3% for workers 55 and older

Workers with a graduate or professional degree were 3 times as likely to commute by bicycle than workers with only a high school diploma.

Households with income less than \$25,000 were 2-3 times more likely to commute by bicycle

This suggests two different populations of bicycle commuters: those with low education and income where bicycling proves an economical choise, and those with high education and income. In both cases young men in households without children predominate.

Initial data exploration

In this section we will load the commuting dataset for Chicago and perform some simple cleaning and analysis to ensure it is suitable for this project. The ACS is downloadable at the census block group level for individual counties, in this case Cook county which contains virrtually all of Chicago. We use the GEOID encoding to map block groups to census tracts and this helpful mappping of the 2010 census tracts to Chicago community area (slightly larger than a neighborhood).

In [2]:

defload_acs2013_commuting():""" Load the ACS 2013 survey on method of commuting to work for all of Cook county and restrict to those census blocks located within the city limits"""# load all block groups in Cook countyacs=pd.read_csv("../ACS/ACS_13_5YR_B08301_with_ann.csv",header=1,index_col=0,usecols=["Id2","Estimate; Total:","Margin of Error; Total:","Estimate; Bicycle","Margin of Error; Bicycle"])# rename columnsacs.columns=["total","total_error","bicycle","bicycle_error"]# parse tract codes from Id2 (GEOID) for collationacs['tract']=[int(str(i)[5:11])foriinacs.index]# load list of Chicago tracts and associated communitiescommunity=pd.read_csv("../communities/chicago_census_tract_to_community.csv",header=0,index_col=0,comment="#")community.columns=['community_id','community']# eliminate tracts not in Chicago city boundariesacs=acs.join(community['community'],on='tract',how='inner')# remove handful of census blocks with no working population# (generally non-residential locations like O'Hare and Midway airports)acs=acs[acs['total']>0]# Convert from ACS 90% margin of error to 68% 1-sigma S/N# erf(1.645/sqrt(2)) ~ 0.9acs[['total_error','bicycle_error']]/=1.645returnacs

In [3]:

acs=load_acs2013_commuting()

In [4]:

defaggregate_error(x):""" A margin of error of 11 represents the sample error when the estimate is 0. ACS recommends adding only one such value when aggregating see ACS "Multiyear Accuracy of the Data", however it appears to underestimate uncertainty for community level aggregates """#if (x == 11).any():# return np.sqrt((x[x != 11]**2).sum() + 11**2)#else:returnnp.sqrt(x.dot(x))defratio_error(X,Y,Xerr,Yerr):""" Compute the error of the ratio of two population statistics where X is a subset of Y, and Xerr and Yerr are the sample margins of error on X and Y respectively."""if(np.asarray(Y)==0).any():raiseValueError("Zero value detected in numerator of ratio estimate")YXerr=Y**2*Xerr**2XYerr=X**2*Yerr**2# ACS recommends using standard error for proportion when# X is subset of Y, and normal ratio form if any terms are# negative (Eqs 3 & 4 in ACS "Multiyear Accuracy of the Data")if(XYerr>YXerr).any():returnnp.sqrt(YXerr+XYerr)/Y**2else:returnnp.sqrt(YXerr-XYerr)/Y**2

This figure shows that a large fraction of the block groups have no bicycle commuters within the margin of error, which is unsurprising given the city-wide average of \(1.3\%\). It's tempting to simply remove those blocks from the sample, however that will introduce a correlation with the population of each block (in that a smaller total population increases the likelihood of no detection). There are also correlations between the \(S/N\) and the fraction at the detection threshold, with some of the largest fractions having \(S/N \sim 1\); these may simply be outliers.

Because the data is so noisy at the census block level, we explore the possibility of aggregating geographical regions with low \(S/N\) to increase their signal.

In [7]:

# aggregate by Census tractacs_tract=acs.groupby('tract').agg({"total":"sum","total_error":aggregate_error,"bicycle":"sum","bicycle_error":aggregate_error,"community":"first"})acs_tract['bicycle_frac']=acs_tract['bicycle']/acs_tract['total']acs_tract['bicycle_frac_error']=ratio_error(acs_tract['bicycle'],acs_tract['total'],acs_tract['bicycle_error'],acs_tract['total_error'])acs_tract['bicycle_SN']=acs_tract['bicycle']/acs_tract['bicycle_error']# aggregate by Community (name)acs_community=acs.groupby('community').agg({"total":"sum","total_error":aggregate_error,"bicycle":"sum","bicycle_error":aggregate_error})acs_community['bicycle_frac']=acs_community['bicycle']/acs_community['total']acs_community['bicycle_frac_error']=ratio_error(acs_community['bicycle'],acs_community['total'],acs_community['bicycle_error'],acs_community['total_error'])acs_community['bicycle_SN']=acs_community['bicycle']/acs_community['bicycle_error']print"Chicago communities with no measured bicycle commuters:"acs_community[['total']][acs_community['bicycle']==0]

Chicago communities with no measured bicycle commuters:

Out[7]:

total

community

Ashburn

18317

Auburn Gresham

14878

Avalon Park

3845

Burnside

809

Chatham

10919

Chicago Lawn

18504

Fuller Park

683

Garfield Ridge

15385

Hegewisch

3961

O'Hare

6770

Pullman

2460

Riverdale

1321

Washington Park

2992

West Pullman

8668

By aggregating into the much larger Chicago communities, we have reduced the number of data points without bicycle commuter samples to 14 (out of 77), and a by-eye inspection shows they are mostly in outlying, suburban areas where we would expect very low rates of bicycle commuters. Plotting the distribution of fractions and \(S/N\) for the Census tract and Chicago community aggregations we also see significant improvement for the aggregated samples. The \(S/N\) for Chicago communities in particular shows a nice linear trend with fraction, as you would expect if the uncertainty was approximately constant.

It would be a shame to coarsen the ACS data to the community scale in regions where the measurements are statistically significant at census tract or even block scales. Instead we keep data at the census tract level when all tracts in a community have data, and from the block level when all blocks in a tract have data. In principle we should make a further restriction on the \(S/N\) of those regions.

Note, this aggregation is dangerous. If the underlying distribution of bicycle commuters is not linear in the aggregated variables, this will bias any recovered trends. Preferentially aggregating regions with low probability also creates a bias (known as Malmquist bias in astronomy) where low counts of bicycle commuters are preferentially selected from census block groups with higher than average bicycle commuting rates (allowing them to remain in the sample). This can be somewhat ameliorated by fitting not raw counts but population fractions, which also helps eliminate the underlying correlations between variables proportional to the total population.

The final binned sample is computed below.

In [9]:

divide_communities=acs_community[acs_tract.groupby('community')['bicycle'].apply(lambdax:x.all())].indexkeep_communities=acs_community[~acs_community.index.isin(divide_communities)]print"Number of communities:",len(keep_communities)divide_tracts=acs_tract[(acs.groupby('tract')['bicycle'].apply(lambdax:x.all()))&(acs_tract['community'].isin(divide_communities))].indexkeep_tracts=acs_tract[~acs_tract.index.isin(divide_tracts)&acs_tract['community'].isin(divide_communities)].drop(['community'],axis=1)print"Number of census tracts:",len(keep_tracts)keep_blocks=acs[acs['tract'].isin(divide_tracts)].drop(['tract','community'],axis=1)print"Number of census blocks:",len(keep_blocks)print"Total regions included in sample:",len(keep_blocks)+len(keep_tracts)+len(keep_communities)acs_sample=pd.concat([keep_blocks,keep_tracts,keep_communities])

Number of communities: 69
Number of census tracts: 41
Number of census blocks: 109
Total regions included in sample: 219

Before proceeding to analysis, it's always nice (and a good idea) to plot the data after cleaning. The census kindly provides cartographic boundary shapefiles for census blocks in KML format. The following uses pyKML to parse the boundary database for IL by GEOID (identifying unique census block groups), then merges them into census tract and Chicago community level boundaries using shapely. Finally, we plot all regions kept in the sample, colored by the sampled bicycle commuter fraction.

frommatplotlib.collectionsimportPatchCollectionfrommatplotlib.colorsimportNormalizefromdescartesimportPolygonPatchdefplot_regions(regions,fill_value=None,plot_communities=False,cm=plt.cm.Blues,vmin=None,vmax=None,label=None,ec='k',lw=0.5,plot_bike_routes=False):ifnotfill_valueisNone:ifnotvmax:vmax=fill_value.max()ifnotvmin:vmin=fill_value.min()norm=Normalize(vmin=vmin,vmax=vmax)# hack to get a colorbar without a mappableZ=[[0,0],[0,0]]fcm=plt.figure()levels=np.linspace(vmin,vmax,100)contours=plt.contourf(Z,levels,cmap=cm)fcm.clf()f=plt.figure(figsize=(12,12))ax=f.add_axes([0,0,.9,1])ax.set_xlim(-87.89,-87.5)ax.set_ylim(41.6,42.05)ax.set_aspect(1)ax.get_xaxis().set_visible(False)ax.get_yaxis().set_visible(False)plt.axis('off')ifplot_communities:patches=[PolygonPatch(polygons[region],fill=0,ec='k')forregioninacs_community.index]else:patches=[]ifnotfill_valueisNone:patches.extend([PolygonPatch(polygons[r],fill=1,ec=ec,lw=lw,fc=cm(norm(fill_value[r])))forrinregions])else:patches.extend([PolygonPatch(polygons[r],fill=0,ec='k')forrinregions])ax.add_collection(PatchCollection(patches,match_original=True))ifplot_bike_routes:routes=load_bike_routes()forrinroutes:x,y=zip(*[(x[0],x[1])forxinr])ax.plot(x,y,color='c',lw=1)ifnotfill_valueisNone:f.colorbar(contours,cmap=cm,shrink=0.6,label=label,norm=norm)plt.close(fcm)plot_regions(acs_sample.index,fill_value=100.*acs_sample['bicycle_frac'],vmin=0.,label="% Workers Commute by Bicyle")print"Largest observed bicycle commuter fraction: {:.2f}%".format(100.*acs_sample['bicycle_frac'].max())

Largest observed bicycle commuter fraction: 13.72%

A major question of the remaining analysis is whether the observed spatial variation at block and tract scales is due mostly to noise or whether it is predictable from some measureable socioeconomic factors.

We naturally look to other population characteristics as measured in the ACS 2013 dataset as they are directly comparable, both in methodology and geographic breakdown. We also expect from the ACS report on national trends that age, education level, income and sex are all likely factors that we should include in our model. The ACS data are available as separate tables in CSV format, which we read and join with the bicycle commuter fraction dataset.

B01001 - Sex by Age

This table includes both sex and age characteristics. While nationwide data shows a significant difference in commuter fraction between Male and Female commuters, the region-by-region differences are minimal (no more than 10%), and so we find the sex alone is not a useful feature for characterising bicycle commuter fraction. We combine both sexes to obtain population counts by age, and divide into approximate thirds, 18-29, 30-54, and 55+.

C17002 - Ratio of income to poverty level

This table is currently not used in the analysis as more than 2/3 of the population falls in the last bin (2.00 and over). See B19001 for household income.

In [15]:

defload_acs2013_poverty_level(data_expression="fraction",sample=acs_sample):acs_poverty=load_acs2013_dataset("C17002",acs_sample)ifdata_expression=="fraction":poverty_groups={"below":[".50 to .99","Under .50"],"above":["1.00 to 1.24","1.25 to 1.49","1.50 to 1.84","1.85 to 1.99","2.00 and over"]}forgroup,poverty_rangesinpoverty_groups.items():compute_proportion(acs_poverty,poverty_ranges,"Poverty",group)returnacs_poverty.filter(like="Frac")else:# data_expressionacs_poverty.drop(["Estimate; Total:","Margin of Error; Total:"],axis=1,inplace=True)returnacs_poverty

B19001 - Household income

This table breaks down total household income (expressed in 2013 dollars) into a number of ranges from less than \$10,000 to more than \$200,000. In Chicago, approximate 1/3 have incomes < \$35k, \$35k-\$75k, and \$75k+.

In [16]:

defload_acs2013_household_income(data_expression="fraction",sample=acs_sample):acs_income=load_acs2013_dataset("B19001",sample)ifdata_expression=="fraction":income_groups={"$0-35k":["Less than $10,000","$10,000 to $14,999","$15,000 to $19,999","$20,000 to $24,999","$25,000 to $29,999","$30,000 to $34,999"],"$35k-75k":["$35,000 to $39,999","$40,000 to $44,999","$45,000 to $49,999","$50,000 to $59,999","$60,000 to $74,999"],"$75k+":["$75,000 to $99,999","$100,000 to $124,999","$125,000 to $149,999","$150,000 to $199,999","$200,000 or more"]}forgroup,income_rangesinincome_groups.items():compute_proportion(acs_income,income_ranges,"Income",group)returnacs_income.filter(like="Frac")else:# data_expressionacs_income.drop(["Estimate; Total:","Margin of Error; Total:"],axis=1,inplace=True)returnacs_income

B25063 - Gross rent for renter occupied households

The number of households in the region broken down by rent paid. Approximate thirds are located at up to \$799, \$800-\$1249, and \$1250+.

In [17]:

defload_acs2013_rent(data_expression="fraction",sample=acs_sample):acs_rent=load_acs2013_dataset("B25063",sample)ifdata_expression=="fraction":rent_groups={"$0-799":["Less than $100","$100 to $149","$150 to $199","$200 to $249","$250 to $299","$300 to $349","$350 to $399","$400 to $449","$450 to $499","$500 to $549","$550 to $599","$600 to $649","$650 to $699","$700 to $749","$750 to $799"],"$800-1249":["$800 to $899","$900 to $999","$1,000 to $1,249"],"$1250+":["$1,250 to $1,499","$1,500 to $1,999","$2,000 or more"]}forrent,rent_rangesinrent_groups.items():compute_proportion(acs_rent,rent_ranges,"Rent",rent,"With cash rent")returnacs_rent.filter(like="Frac")else:acs_rent.drop(["Estimate; Total:","Margin of Error; Total:","Estimate; With cash rent:","Margin of Error; With cash rent:","Estimate; No cash rent","Margin of Error; No cash rent"],axis=1,inplace=True)returnacs_rent

B15003 - Education level

The number of residents broken down by maximum education attained. Approximately 1/3 each of the Chicago population have no schooling through a high school diploma, GED through an associate's degree, and a bachelor's or professional degree.

Regression Analysis

In this section we use the ACS socioeconomic factors to model the bicycle commuting fraction through standard regression techniques.

"Traditional" linear modeling

In this section we perform regressions on data that is binned in both dependent and independent variables. This ensures the statistical significance of all measured points and allows us to use methods that account for measurement uncertainty. We select the top and bottom third of each independent variable, which provides sensitivity to trends with those variables.

The results of the fit are as expected from the national trends. Positive indicators of bicycle commuting are young, low income, low rent, and low education, as well as high education. Older populations, and those with higher income and higher rent are negative indicators of bicycle commuting.

We plot the distribution of residuals from the model and compare to a simple citiwide average. It clearly shows improvement in the model predictions, however a number of regions are mis-predicted by several times their measurement error.

This map shows where the model particularly fails (those regions where the prediction fails by more than twice the measurement error). In particular the model seems to fail along a line running from NW-SE in the northern half of the city. The neighborhoods along this line tend to have large numbers of young bicyclists, and have bike routes that lead directly to the downtown.

Modeling using the full dataset

The previous analysis loses some information by aggregating geographically and by binning the independent variables. In this section we create a model based on the entire dataset, including low-significance data and non-detections.

In [170]:

acs_raw_sex_by_age=load_acs2013_sex_by_age(data_expression="raw",sample=None)acs_raw_income=load_acs2013_household_income(data_expression="raw",sample=None)acs_raw_rent=load_acs2013_rent(data_expression="raw",sample=None)acs_raw_education=load_acs2013_education(data_expression="raw",sample=None)acs_raw_sample=pd.concat([acs[['bicycle','bicycle_error']],acs_raw_sex_by_age,acs_raw_income,acs_raw_rent,acs_raw_education],axis=1,join="inner")print"Description of the full sample:"printlen(acs_raw_sample.filter(like="Estimate;").columns),"features,",len(acs_raw_sample),"data points"

Description of the full sample:
107 features, 2165 data points

The full set of features contains significant covariance, as can be seen in the plot below. Areas of high-correlation in the figure can be attributed to different subgroups, including college students, young families, and professionals who stop renting and buy their home or move to the suburbs in their late 30s.

Standard regression techniques have difficult with these highly-correlated, colinear features, and we explore ways of mitigating this before fitting a model.

Using NMF to reduce dimensionality

Non-negative matrix factorization is a technique similar to principle component analysis, but with the useful property that all components are strictly positive. This is advantagous for our dataset which is comprised of populations counts for which negative values are non-sensical.

This plot shows the error in the reconstructed data matrix \(||X - WH||^2\) as a function of the number of components, where \(X\) is the original matrix, and \(W\) and \(H\) are the matrix decompositions provided by NMF. The graph is relatively flat, showing only a few components are necessary to describe the data's coarsest features. Only when the number of components nears the number of features does the error drop steeply, at which point the decomposition is as complex as the original data.

In this analysis we use 5 components from the NMF decomposition and a linear model. We account for regions with no detected bicycle commuters by adding their upper limits to the likelihood. These points have a significant impact on the model, which struggles to jointly fit both sets of data.

These plots show the distribution of residuals from the best-fit linear model. The histogram is bimodal, showing that the model tends to over-estimate those regions where no bicycle commuters were measured (within the error), while under-estimating in regions where the data is non-zero.

However, when we plot the residuals as a function of the measured bicycle commuters, we see that the trend exists in the full dataset, not only in the non-detected points. We take this as evidence that the model lacks a necessary degree of freedom to completely explain the data.

In the map below we show the normalized error in the model fit and overlay the map of chicago bike routes. Visually we see a higher density of bike routes in blue areas, where the model underpredicts the number of commuters.

Looking for additional features

Since both our previous analyses show evidence for an additional, hidden variable, and we have some hint that there is some spatial component, we broaden our feature set.

Bike routes

As we showed in the previous sections, there seems to be a connection between the availability of bike routes and the ability of the model to predict bicycle commuters. Naturally people are more likely to commute by bicycle when they have a safe means for doing so.

A complete list of Chicago bike routes, both on and off street, and shared lanes, are provided by the Chicago Data Portal in KML format. We match each bike route to our city regions by stepping along the route at 1m resolution. We use a tree-based method to ensure the calculation is tractable, and the resulting map, overlayed with the original bike routes, is shown below.

In [187]:

frommathimportsin,cos,pi,degrees,radians,asin,sqrtfromitertoolsimportizipfromshapely.geometryimportPointimportrtreeimportcsvimportosEARTH_RADIUS=6.371e6# in metersdefhaversine(angle_radians):returnsin(angle_radians/2.0)**2definverse_haversine(h):return2*asin(sqrt(h))# radiansdefdistance_between_points(lon1,lat1,lon2,lat2):# all args are in degrees# WARNING: loss of absolute precision when points are near-antipodallat1=radians(lat1)lat2=radians(lat2)dlat=lat2-lat1dlon=radians(lon2-lon1)h=haversine(dlat)+cos(lat1)*cos(lat2)*haversine(dlon)returnEARTH_RADIUS*inverse_haversine(h)defbike_route_lengths(sample,datafile="../chicagodata/region_bike_route_lengths.csv"):ifnotos.path.exists(datafile):routes=load_bike_routes()lengths={region:0forregioninpolygons}# use libspatialindex to speed lookupsindex=rtree.index.Index()fori,regioninenumerate(polygons):index.insert(i,polygons[region].bounds,obj=region)forrinroutes:foriinrange(len(r)-1):d=distance_between_points(r[i][0],r[i][1],r[i+1][0],r[i+1][1])# sample points at 1m resolution along segmentforpinizip(np.linspace(r[i][0],r[i+1][0],d),np.linspace(r[i][1],r[i+1][1],d)):pt=Point(p)forregioninindex.intersection((p[0],p[1],p[0],p[1]),objects="raw"):ifpolygons[region].contains(pt):lengths[region]+=1# save dictionary to filewithopen(datafile,"w")asf:writer=csv.writer(f)forrowinlengths.items():writer.writerow(row)# initialize from cached datalengths={}withopen(datafile,"r")asf:reader=csv.reader(f)forrowinreader:lengths[row[0]]=int(row[1])# initialize new columnsample["bike route lengths"]=np.array([lengths[str(region)]forregioninsample.index])