Findings

We analyzed recent population trends (2010 to 2016) in New York City and the greater metropolitan area using the US Census Bureau’s Population Estimates to study components of population change (births, deaths, domestic and international migration) and the IRS Statistics of Income division’s county to county migration data to study domestic migration flows.

Here are the main findings:

The population of New York City and the New York Metropolitan Area increased significantly between 2010 and 2016, but annually growth has slowed due to greater domestic out-migration.

Compared to other large US cities and metro areas, New York’s population growth depends heavily on foreign immigration and natural increase (the difference between births and deaths) to offset losses from domestic out-migration.

Between 2011 and 2015 the city had few relationships where it was a net receiver of migrants (receiving more migrants than it sends) from other large counties. The New York metro area had no net-receiver relationships with any major metropolitan area.

The city was a net sender (sending more migrants than it received) to all of its surrounding suburban counties and to a number of large urban counties across the US. The metro area was a net sender to metropolitan areas throughout the country.

For the domestic migration portion of the analysis we were interested in seeing the net flows between places. For example, the NYC metro area sends migrants to and receives migrants from the Miami metro. What is the net balance between the two – who receives more versus who sends more?

The answer is: the NYC metro is a net sender to most of the major metropolitan areas in the country, and has no significant net receiver relationships with any other major metropolitan area. For example, for the period from 2011 to 2015 the NYC metro’s largest net sender relationship was with the Miami metro. About 88,000 people left the NYC metro for metro Miami while 58,000 people moved in the opposite direction, resulting in a net gain of 30,000 people for Miami (or in other words, a net loss of 30k people for NYC). The chart below shows the top twenty metros where the NYC metro had a deficit in migration (sending more migrants to these areas than it received). A map of net out-migration from the NYC metro to other metros appears at the top of this post. In contrast, NYC’s largest net receiver relationship (where the NYC metro received more migrants than it sent) was with Ithaca, New York, which lost a mere 300 people to the NYC metro.

Process

For the IRS data we used the county to county migration SQLite database that Janine meticulously constructed over the course of the last year, which is freely available on the Baruch Geoportal. Anastasia employed her Python and Pandas wizardry to create Jupyter notebooks that we used for doing our analysis and generating our charts, all of which are available on github. I used an alternate approach with Python and the SQLite and prettytable modules to generate estimates independently of Anastasia, so we could compare the two and verify our numbers (we were aggregating migration flows across years and geographies from several tables, and calculating net flows between places).

One of our goals for this project was to use modern tools and avoid the clunky use of email. With the Jupyter notebooks, git and github for storing and syncing our work, and ShareLaTeX for writing the paper, we avoided using email for constantly exchanging revised versions of scripts and papers. Ultimately I had to use latex2rtf to convert the paper to a word processing format that the publisher could use. This post helped me figure out which bibliography packages to choose (in order for latex2rtf to interpret citations and references, you need to use the older natbib & bibtex combo and not biblatex & biber).

The staff at the Population Division at NYC City Planning take the limitations of the American Community Survey (ACS) data seriously. Census estimates for tract-level data tend to be unreliable; to counter this, they aggregate tracts into larger Neighborhood Tabulation Areas (NTAs) to produce estimates that have better precision. In their Census Factfinder tool, they display but grey-out variables where the margin of error (MOE) is unacceptably large. If users want to aggregate geographies, the Factfinder does the work of re-computing the margins of error.

Now they’ve released a new tool for census mappers. The Map Reliability Calculator is an Excel spreadsheet for measuring the reliability of classification schemes for making choropleth maps. Because each ACS estimate is published with a MOE, it’s possible that certain estimates may fall outside their designated classification range.

For example, we’re 90% confident that 60.5% plus or minus 1.5% of resident workers 16 years and older in Forest Hills, Queens took public transit to work during 2011-2015. The actual value could be as low as 59% or as high as 62%. Now let’s say we have a classification scheme that has a class with a range from 60% to 80%. Forest Hills would be placed in this class since its estimate is 60.5%, but it’s possible that it could fall into the class below it given the range of the margin of error (as the value could be as low as 59%).

The tool determines how good your classification scheme is by calculating the percent of estimates that could fall outside their assigned class, based on each MOE and the break point of the class. On the left of the sheet you paste your estimates and MOEs, and then type the number of classes you want. On the right, the reliability of classifying that data is calculated for equal intervals (equal range of values in each class) and quantiles (equal number of data points in each class). You can see the reliability of each class and the overall reliability of the scheme. The scheme is classified as reliable if: no individual class has more than 20% of its values identified as possibly falling outside the class, and less than 10% of all the scheme’s values possibly fall outside their classes.

I pasted some 5-year ACS data for NYC PUMAs below (the percentage of workers 16 years and older who take public transit to work in 2011-2015) under STEP 1. In STEP 2 I entered 5 for the number of classes. In the classification schemes on the right, equal intervals is reliable; only 6.6% of the values may fall outside their class. Quantiles was not reliable; 11.9% fell outside. If I reduce the number of classes to 4, reliability improves and both schemes fall under 10%; although unreliability for one of the classes for quantiles is high at 18%, but still below the 20% threshold. Equal intervals should usually perform better than quantiles, as the latter scheme can make rather arbitrary breaks that result in small differences in value ranges between classes (in order to insure that each class has the same number of data points).

You can also enter custom-defined schemes. For example let’s say you use natural breaks (classes determined by gaps in value ranges). There’s a 2-step process here; first you classify the data in GIS and determine what the breaks are, and then you enter them in the spreadsheet. If you’re using QGIS there’s a snag in doing this; QGIS doesn’t show you the “true” breaks of your data based on the actual values, and when you classify data it displays clean breaks that overlap. For example, natural breaks of this data with 5 classes appears like this:

24.4 – 29.0
29.0 – 45.9
45.9 – 55.8
55.8 – 65.1
65.1 – 73.3

So, does the value for 29.0 fall in the first class or the second? The answer is, the first (test it by selecting that record in the attribute table and see where it is on the map, and what color it is). So you need to adjust the values appropriately, paying attention to the precision and scale of your numbers. In this case I bump the first value of each class up by .1, except for the bottom class which you leave alone:

24.4 – 29.0
29.1 – 45.9
46.0 – 55.8
55.9 – 65.1
65.2 – 73.3

In the calculator you have to enter the top class value first, and just the first value in the range:

65.2
55.9
46.0
29.1
29.4

In this case only 7.1% of the total values may fall outside their class so things look good – but my bottom class barely makes the minimum class threshold at 19.4%. I can try dropping the classes down to 4 or I can manually adjust this class to see if I can improve reliability.

If you’re unsure if you made the right adjustments to the classes in translating them from QGIS to the calculator, in QGIS turn on the Show Feature Count option for the layer to see how many data points are in each class, and compare that to the class counts in the calculator. If they don’t match, you need to re-adjust.

This is a great tool for census mappers who want or need to account for issues with ACS reliability. It’s an Excel spreadsheet but I used it in LibreOffice Calc with no problem. In addition to the calculator sheet there’s a second sheet with instructions and background info. Download the Map Reliability Calculator here. You can try it out with this test data, workers who commute with mass transit, 2011-2015 ACS for NYC PUMAs.

As I’m updating my presentations and handouts for the new academic year, I’m taking two new census resources for a test drive. I’ll talk about the first resource in this post.

The NYC Department of City Planning has been collating census data and publishing it for the City for quite some time. They’ve created neighborhood tabulation areas (NTAs) by aggregating census tracts, so that they could publish more reliable ACS data for small areas (since the margins of error for census tracts can be quite large) and so that New Yorkers have data for neighborhood-like areas that they would recognize. The City also publishes PUMA-level data that’s associated with the City’s Community Districts, as well as borough and city-level data. All of this information is available in a large series of Excel spreadsheets or PDFs in the form of comparison tables for each dataset.

The Department of Planning also created the NYC Census Factfinder, a web-mapping interface that let’s users explore census tract and NTA level data profiles. You could plug in an address or click on the map and get a 2010 Census profile, or a demographic change profile that showed shifts between the 2000 and 2010 Census.

It was a nice application, but they’ve just made a series of updates that make it infinitely better:

They’ve added the American Community Survey data from 2009-2013, and you can view the four demographic profile tables (demographics, social, economic, and housing) for tracts and NTAs.

Unlike many other sources, they do publish the margin of error for all of the ACS data, which is immensely important. Estimates that have a high margin of error (as defined by a coefficient of variation) appear in grey instead of solid black. While the actual margins are not shown by default, you can simply click the Show radio button to turn on the Reliability data.

Tracts or neighborhoods can be compared to the City as a whole or to an individual borough by selecting the drop down for the column header.

This is especially cool – if you’re viewing census tracts you can use the select pointer and hold down the Control key (Command key on a Mac) to select multiple tracts, and then the data tables will aggregate the tract-level data for you (so essentially you can build your own neighborhoods). What’s noteworthy here is that it also calculates the new margins of error for all of the derived estimates, AND it even calculates new medians and averages with margins of error! This is something that I’ve never seen in any other application.

In addition to searching for locations by address, you can hit the search type drop down and you have a number of additional options like Intersection, Place of Interest, and even Subway Stations.

There are a few quirks:

I had trouble viewing the map in Firefox – this isn’t a consistent problem but something I noticed today when I went exploring. Hopefully something temporary that will be corrected. Had no problems in IE.

If you want to click to select an area on the map, you have to hit the select button first (the arrow beside the zoom slider and print button) and then click on your area to select it. Just clicking on the map without hitting select first won’t do much – it will just highlight the area and tell you it’s name. Clicking the arrow button turns it blue and allows you to select features, clicking it again turns it white and lets you identify features and pan around the map.

The one bummer is that there isn’t a way to download any of the profiles – particularly the ones you custom design by selecting tracts. Hitting the Get Data button takes you out of the Factfinder and back to the page with all of the pre-compiled comparison tables. You can print the table out to a PDF for presentation purposes, but if you want a data-friendly format you’ll have to highlight and select the table on the page, copy, and paste into a spreadsheet.

These are just small quibbles that I’m sure will eventually be addressed. As is stands, with the addition of the ACS and the new features they’ve added, I’ll definitely be integrating the NYC Census Factfinder into my presentations and will be revising my NYC Neighborhood Census data handout to add it as a source. It’s unique among resources in that it provides NTA-level data in addition to tract data, has 2000 and 2010 historical change and the latest 5-year ACS (with margins of error) in one application, and allows you to build your own neighborhoods to aggregate tract data WITH new margins of error for all derived estimates. It’s well-suited for users who want basic Census demographic profiles for neighborhood-like areas in NYC.