Menu

As I’m four chapters into my writing my book on the census, I’m paying close attention to what’s going to happen in 2020. Now that the Census Bureau has outlined the subjects and the specific wording of the questions for the 2020 census and future ACS, here’s my summary of what’s changing, and what’s not. NPR has been doing an excellent job covering every aspect of this, and I link to their articles throughout this post.

What Changes

Residency for the military. Up until now, members of the military who were temporarily deployed overseas were counted at their US address at their time of enlistment. Many communities that are home to military bases have argued that this severely impacts their counts, and that the census should count personnel based on where their home military base is (remember, in addition to apportioning Congress, $600 billion in federal aid is distributed annually to state and local governments based on census numbers). The Census Bureau agreed, and this change will be implemented in 2020.

Same sex marriage. The question on relationships will change, so that people can explicitly state whether they have a married or unmarried partner, and whether their spouse or partner is opposite sex or same sex. Up until now, the Census Bureau recoded every household that indicated that they were married people of the same sex to unmarried partner status, and did a special tabulation using the relationship and gender questions to count same sex partnerships. Now that same sex marriage is recognized under federal law (as of 2015), the ambiguity of counting people by relationship can be cleared.

Ethnicities for white and black. For the first time, white and black people will be able to write in a specific nationality or ethnicity under their race, just as people of other races (Asian, Native American, Hawaiian or Pacific Islander) and of Hispanic and Latino ethnicity have long been able to do.

Citizenship. This is a pretty controversial one that may change by the time we get to 2020, but due to Justice Department lobbying and pressure from the Executive branch, there will be a question that asks whether a person is a US citizen or not. Ostensibly they say this data is necessary to combat voter fraud, although there is absolutely no evidence that widespread voter fraud exists. I’ll address this in more detail at the end of this post.

What Stays the Same

Residency for people in prison. Parallel to the military discussion, many have argued that people who are incarcerated should be counted at their home address, and not where they are incarcerated. This practice has an adverse impact on communities that suffer from crime and need political representation and funding, while artificially inflating the populations of rural communities where large prisons are typically located. The Census Bureau disagreed, maintaining that prison is the usual place of residence for people who are incarcerated, and therefore they should be counted there.

Measuring gender. The census asks people to identify their sex as male or female, and specifically uses the term sex instead of gender to emphasize that they are asking about simple biology and chromosomes, and not about sexual identification, expression, or preference. Many (including several federal agencies and members of Congress) have lobbied for a question on gender identification and sexual preferences, arguing that it is necessary under civil rights legislation. No such questions will be added.

Hispanic and Latino and race. The Census Bureau conducted almost a decade of studies and tests on revising the race and ethnicity questions, and the biggest suggestion was to make Hispanic and Latino a race and not a separate ethnicity. Their studies showed that society, and people of Hispanic ethnicity, largely view Hispanic and Latino as a race. A large percentage of Hispanics choose Other as their race, since Hispanic is not an option under the race question. The Office of Management and Budget chose to simply ignore these suggestions, and since approval from OMB is necessary (as data on race is collected across the federal government) the questions will remain the same. The second major suggestion, also ignored, was to create a new racial category for Middle Eastern and North African (MENA) people who are currently counted as white. With the exception of the optional ethnic write-ins for whites and blacks, the racial and ethnic categories for 2020 will be the same ones used in 2000 and 2010.

The Citizenship Question Controversy

Is asking about citizenship really new? The oft-quoted “fact” in the media is that this question hasn’t been asked since the 1950 census, but that’s only partially true. This could be the first time since 1950 that this question was asked as part of the short census form that 100% of the population is asked to complete. The short form (from 1960 to 2000) and the only form (2010 to present) is designed to record just the basic demographic characteristics of the population for the purpose of reapportioning seats in Congress, redrawing legislative boundaries, and providing fundamental baseline numbers on which other statistical products are based. Citizenship had previously been asked on the sample long form (from 1960 to 2000 sent, to 1 in 6 households) and is currently asked on the American Community Survey (2005 to present, sent to 3.5 million addresses annually) The long form and ACS were designed for a different purpose: to measure the broad socio-economic characteristics of the country. Both the ten-year census and the ACS ask questions which are required under federal law, to meet different legislative obligations.

Why is this controversial? Not only is the citizenship question unnecessary for fulfilling the basic requirements of the decennial census, it’s actually detrimental. The Census Bureau is charged with counting every single person in the United States, regardless of their voting eligibility, citizenship status, or legality. As long as a person isn’t a visitor or tourist, they are counted as a resident. None of these characteristics (with the sole exception of counting slaves as three-fifths of a person prior to the Civil War) has ever been a factor in whether a person is counted in the census or not. Congressional seats have always been apportioned based on total population, and legislative districts have long been delineated using total counts. This was the intent of the Founding Fathers and has been upheld by the Supreme Court numerous times (as recently as 2016).

Given the fear that many immigrants have of government officials, especially from this current administration which has shown unbridled hostility, it’s likely that many non-citizens (legal residents and undocumented alike) will not fill out the census form, thus resulting in an undercount and a possible loss of political power and federal aid for states that have high immigrant populations (which tend to be blue states, but not exclusively so). In order for the Census Bureau to insure confidence in the counting process, and to get the most accurate count (which is their primary mission), they have to assure people that their personal, individual data will not be published or shared, and they must make it clear that they have no relationship with regulatory branches of the government that would use this data against them.

But even though individual responses are kept confidential for 72 years (the Census only publishes data summarized by population groups and geography), data from the ten year census is available at the census block level, which is the smallest and finest level of geographic detail. This data could potentially be used (by the Border Patrol or ICE) to identify and target small clusters of areas that have a high percentage of non-citizens. If you think this sounds far-fetched, see this article from the Washington Post: it’s essentially what happened to Japanese American citizens who were rounded up for internment during WWII.

Obviously, it’s not in the Census Bureau’s best interest to add this question. The announcement that it was being added was made by the head of the Commerce Department, of which the Census Bureau is part, and not by the Bureau itself. The Census Bureau meticulously studies and tests every question that gets added to the form years in advance, and it seems clear that this was something that was tacked on at the last minute.

Congress could move to strip this question from the form; it’s pretty unlikely this will happen now, but it could if the midterm elections flip the legislature to the Democrats. There are also a number of lawsuits from different states to try and stop the question from being added. Whether the question stays or goes, I think the damage is done and the count will be negatively impacted, given the sad state of our government. If the question remains it could end up being a useless data point, as people may still fill out the form and skip that question. While you’re required by law to complete the census, it seems unlikely that the Census Bureau will be able to chase after millions of people who refuse to answer a single question.

Findings

We analyzed recent population trends (2010 to 2016) in New York City and the greater metropolitan area using the US Census Bureau’s Population Estimates to study components of population change (births, deaths, domestic and international migration) and the IRS Statistics of Income division’s county to county migration data to study domestic migration flows.

Here are the main findings:

The population of New York City and the New York Metropolitan Area increased significantly between 2010 and 2016, but annually growth has slowed due to greater domestic out-migration.

Compared to other large US cities and metro areas, New York’s population growth depends heavily on foreign immigration and natural increase (the difference between births and deaths) to offset losses from domestic out-migration.

Between 2011 and 2015 the city had few relationships where it was a net receiver of migrants (receiving more migrants than it sends) from other large counties. The New York metro area had no net-receiver relationships with any major metropolitan area.

The city was a net sender (sending more migrants than it received) to all of its surrounding suburban counties and to a number of large urban counties across the US. The metro area was a net sender to metropolitan areas throughout the country.

For the domestic migration portion of the analysis we were interested in seeing the net flows between places. For example, the NYC metro area sends migrants to and receives migrants from the Miami metro. What is the net balance between the two – who receives more versus who sends more?

The answer is: the NYC metro is a net sender to most of the major metropolitan areas in the country, and has no significant net receiver relationships with any other major metropolitan area. For example, for the period from 2011 to 2015 the NYC metro’s largest net sender relationship was with the Miami metro. About 88,000 people left the NYC metro for metro Miami while 58,000 people moved in the opposite direction, resulting in a net gain of 30,000 people for Miami (or in other words, a net loss of 30k people for NYC). The chart below shows the top twenty metros where the NYC metro had a deficit in migration (sending more migrants to these areas than it received). A map of net out-migration from the NYC metro to other metros appears at the top of this post. In contrast, NYC’s largest net receiver relationship (where the NYC metro received more migrants than it sent) was with Ithaca, New York, which lost a mere 300 people to the NYC metro.

Process

For the IRS data we used the county to county migration SQLite database that Janine meticulously constructed over the course of the last year, which is freely available on the Baruch Geoportal. Anastasia employed her Python and Pandas wizardry to create Jupyter notebooks that we used for doing our analysis and generating our charts, all of which are available on github. I used an alternate approach with Python and the SQLite and prettytable modules to generate estimates independently of Anastasia, so we could compare the two and verify our numbers (we were aggregating migration flows across years and geographies from several tables, and calculating net flows between places).

One of our goals for this project was to use modern tools and avoid the clunky use of email. With the Jupyter notebooks, git and github for storing and syncing our work, and ShareLaTeX for writing the paper, we avoided using email for constantly exchanging revised versions of scripts and papers. Ultimately I had to use latex2rtf to convert the paper to a word processing format that the publisher could use. This post helped me figure out which bibliography packages to choose (in order for latex2rtf to interpret citations and references, you need to use the older natbib & bibtex combo and not biblatex & biber).

I’ve been cranking away on my census book these past few months. I’m almost finished revising the preface and chapter 1 to submit as part of the book proposal review process, and a rough draft of chapter 2 is also complete.

I wrote a short piece on the history of the US census while I was doing background research last fall. It was recently published in Metropolitics, which is an on-line academic journal that specializes in short pieces on cities, urban politics, and urban economics. The journal was originally created in Paris as Metropolitiques, and a separate English-language version with a New York-based editorial board was created more recently. Each version is published independently and a select number of articles are translated from one version and published in the other, giving it a unique international flair in terms of content and contributors. A new article in the English version is published every Tuesday.

I wrote the piece as a lead-in to the 2020 census, which we’re starting to hear more about in the news. After spending a decade of research on creating new and improved categories for the race question, the OMB and White House decided not to accept the Bureau’s proposal and thus we’re keeping the same categories from 2010 and 2000. Meanwhile, the Justice Department is lobbying for a adding a citizenship question, which already appears in the American Community Survey and has not appeared in the short form of the decennial census since 1950. Read the article and follow the links to the references to see what the repercussions could be. The questions and categories must be finalized by March 31st…

US Census data isn’t “big data” in the technical sense, as it’s not being captured and updated in real time and it isn’t fine-grained enough to pinpoint specific coordinates. But it’s big in the conventional sense: it consists of many different datasets that record a variety of aspects about the entire population at many scales, and it’s relational and flexible in nature (tables can be joined, new data can be added or modified). And, there’s a LOT of data!

So which census dataset do you choose for a particular application? In this post I provide a summary overview of what I consider to be the big five: what they are, how they’re constructed, and what’s available. I’ll be describing summary data here, which is data that’s aggregated and published by geographic area and population groups. I won’t be addressing sample microdata (individual responses to census questions) which are available for the decennial census, American Community Survey, and Current Population Survey.

ALL of these datasets are available through the American Factfinder, and via the Census Bureau’s APIs. The smaller datasets can also be downloaded directly from the individual program pages (I note this when it’s available). Outside of he Census Bureau, the Census Reporter is a nice tool for exploring data, while the Missouri Census Data Center and the NHGIS are good alternatives for generating summaries and downloading data in bulk.

The Decennial Census (DEC)

When people think of “the census” the decennial census (DEC) is the dataset that typically comes to mind. It’s the 100% count of the population that’s conducted every ten years on April 1st in years ending with a zero. Required by the Constitution and taken since 1790, its primary purpose is to provide detailed population counts that are used to re-apportion seats in the US House of Representatives. It’s also used to study population distribution and change at the smallest geographic levels, and serves as baseline data for many of the other Census Bureau statistical programs.

The modern census (from the year 2010 forward) collects just basic demographic variables about the population and housing units: gender, age, race, household relationships, group quarters, occupied and vacant units, owner and renter units. This data is published in a series of collections; the primary one is summary file 1 (SF1), but there’s also a summary file 2 (SF2) that contains more detailed cross-tabs. The Redistricting Data file (PL 94-171) is always the first to be released, and contains just the basics.

From the year 2000 back, the DEC had additional summary files that included detailed socio-economic characteristics of the population that were captured on a longer sample form sent to one in six households. The on-going American Community Survey has since replaced it, so if you are looking for anything beyond the basics you need to look at the ACS. For older DEC data, you can find the 2000 census in the American Factfinder but if you want to go back further in time use the NHGIS.

For a sample of what’s included in the DEC, look at the demographic profile table (DP-1), which contains a good cross-section of variables.

Use the DEC when:

You need 100% counts of the population

You need to use the smallest geographies available (census blocks, block groups, tracts)

You don’t need anything more than basic demographic variables

You’re studying very small population groups in a given area

You’re making historical comparisons with earlier DEC data

The American Community Survey (ACS)

The American Community Survey (ACS) was launched in 2005 to provide more timely data about the US population on an on-going basis. In addition to the basic demographic variables captured in the DEC, the ACS also captures all the detailed socio-economic statistics that the older census used to capture, such as: employment, marital status, educational enrollment and attainment, veteran status, income, poverty, place of origin, housing value and rent, housing characteristics, and much more.

The ACS is a rolling sample survey that’s conducted each month, and is sent to 3.5 million households annually. The data is published as 1-year averages for any geographical area (state, county, place, etc) that has more than 65k people. 5-year averages are published for all geographies down to the census tract level (some block group level data is available, but it’s highly unreliable). The 5-year average is updated each year by adding a new year of data and dropping the oldest year.

ACS estimates are published at a 90% confidence interval with a margin of error that indicates the possible range of the estimate. For example, if the population for an area is 20,000 people plus or minus 1,000, that means we’re 90% confident that the population is between 19,000 and 21,000 people, and there’s a 10% chance the true population falls outside this range. The timeliness, geographic depth, and variety of variables make the ACS an essential dataset. However, it’s more complicated to work with compared to the simple counts in the DEC, and as a researcher you must pay close attention to the margins of error; estimates for small areas and small population groups can be highly unreliable. To manage this, you can aggregate the data into larger geographies or into fewer population groups.

The 1-year averages are available for all states and metropolitan areas, and for statistical areas called PUMAS that are designed to have 100k people. 1-year averages are available for large counties or places (cities and towns), but since many of these areas have less than 65k people coverage will not be complete. Use the 5-year averages if you need complete coverage of all counties or places in an area, or if you need small areas like census tracts and ZCTAs. When making historical comparisons, it’s only appropriate to compare five year periods that do not overlap. For example, comparing 2007-2011 to 2012-2016 would be appropriate.

For a thorough sample of what’s included in the ACS, look at the demographic profile tables for social (DP02), economic (DP03), housing (DP04), and demographic (DP05) variables.

Use the ACS when:

You need detailed socio-economic indicators about the population

You need the most recent data for these indicators

You’re not working with data below the census tract level

You can live with the margins of error associated with the estimates

Use the 1-year averages when you are looking at just large places with more than 65k people and large population groups

Use 5-year averages to study all areas of a given type, small areas and population groups, and to reduce the size of the margin of error for larger areas and groups

Population Estimates Program (PEP)

The Population Estimates Program (PEP) is used to create basic, annual estimates of the US population for large areas. Using the latest DEC as a starting point, the Bureau takes data on births, deaths, and domestic and international migration to estimate what the population is the following year, and then creates a new estimate each year on July 1st. The estimates are created at the county level, and are rolled up to states and metropolitan areas and disaggregated down to census places (cities and towns). Once a new DEC is taken, the Bureau will go back to the previous decade and issue a set of revised estimates to approximate what actually happened.

Besides the total population, estimates are created for age, gender, race, and housing units. The PEP is also a source for the components of population change for each place (births, deaths, migration), which the Census Bureau compiles from other sources. Since this is a much smaller dataset compared to the DEC or ACS, PEP data can be downloaded in pre-compiled spreadsheets directly from the PEP website, in addition to the American Factfinder and APIs.

Use the PEP when:

You just need basic demographic variables for large geographic areas

You’re interested in annual population change

You want a simple dataset to work with

You’re interested in the components of population change

Current Population Survey (CPS)

The Current Population Survey (CPS) is a monthly survey of 60,000 households that’s sponsored by the Census Bureau and the Bureau of Labor Statistics (BLS). It’s designed to provide national estimates for a variety of demographic and labor force indicators on a regular basis. The same household is: interviewed for 4 consecutive months, not interviewed again for 8 months, interviewed again for 4 consecutive months, and then is removed from the survey.

Some questions like employment and unemployment are asked repeatedly each month, other questions are asked only during certain months at regular intervals (for example, every two years in November a question is asked about voter registration and participation), and other questions are special topics that are asked on a one-time or limited basis.

Some of the most important indicators are tabulated and published as annual estimates directly on the CPS website, while many of the labor force statistics are published monthly or annually on the BLS website. Given the small sample size of the CPS relative to the ACS, it’s more common that researchers will manipulate the raw CPS data (the individual responses) to create their own estimates and cross-tabulations. The CPS website has some tools for doing this, and IPUMS USA is a popular tool as well.

Given the size of the sample, CPS estimates tabulated by the Census or BLS are published at a national or regional level, and in limited cases at the state level. Aggregated, summarized data is published with margins of error at a 90% confidence interval.

Use the CPS when:

You need monthly or annual, detailed demographic or labor force data for the entire country or for large regions

You’re looking for special topic data that’s not published in any other dataset

You’re comfortable working with microdata if the data you’re looking for has not been aggregated or summarized

Business Patterns and Economic Census

The previous four sources provide data on population and housing units. If you’re looking for data on businesses, here are the two most common options.

The County and ZIP Code Business Patterns provides annual counts of the number of businesses for states, metropolitan areas, counties, and ZIP Codes that includes the number of employees, establishments, and wages. It also provides counts of establishments classified by the North American Industrial Classification System (NAICS). You can look at broad (i.e. manufacturing, retail, finance & insurance) or narrow (auto parts manufacturers, department stores, commercial banks) NAICS summaries. If a particular area has fewer than 3 businesses of a specific type the data isn’t disclosed for confidentiality reasons. The data is generated from the Business Register, which is a government master file of businesses that’s updated on an on-going basis from several sources.

The Economic Census is conducted every five years in years that end in two or seven. It is an actual count of businesses that captures all of the fields that are published in the Business Patterns, but it also: captures sales as well as wages, provides place-level data (cities and towns), and is published in a variety of topical as well as geographic summaries. Because there is quite a time-lag between the collection and publication of this data (several years), the Economic Census is better suited for studying the economy in retrospect. The USDA publishes a Census of Agriculture which covers farming in more detail.

For business statistics:

Use the Business Patterns for basic counts of the latest data

Use the Economic Census for more detailed information that’s a bit older

The staff at the Population Division at NYC City Planning take the limitations of the American Community Survey (ACS) data seriously. Census estimates for tract-level data tend to be unreliable; to counter this, they aggregate tracts into larger Neighborhood Tabulation Areas (NTAs) to produce estimates that have better precision. In their Census Factfinder tool, they display but grey-out variables where the margin of error (MOE) is unacceptably large. If users want to aggregate geographies, the Factfinder does the work of re-computing the margins of error.

Now they’ve released a new tool for census mappers. The Map Reliability Calculator is an Excel spreadsheet for measuring the reliability of classification schemes for making choropleth maps. Because each ACS estimate is published with a MOE, it’s possible that certain estimates may fall outside their designated classification range.

For example, we’re 90% confident that 60.5% plus or minus 1.5% of resident workers 16 years and older in Forest Hills, Queens took public transit to work during 2011-2015. The actual value could be as low as 59% or as high as 62%. Now let’s say we have a classification scheme that has a class with a range from 60% to 80%. Forest Hills would be placed in this class since its estimate is 60.5%, but it’s possible that it could fall into the class below it given the range of the margin of error (as the value could be as low as 59%).

The tool determines how good your classification scheme is by calculating the percent of estimates that could fall outside their assigned class, based on each MOE and the break point of the class. On the left of the sheet you paste your estimates and MOEs, and then type the number of classes you want. On the right, the reliability of classifying that data is calculated for equal intervals (equal range of values in each class) and quantiles (equal number of data points in each class). You can see the reliability of each class and the overall reliability of the scheme. The scheme is classified as reliable if: no individual class has more than 20% of its values identified as possibly falling outside the class, and less than 10% of all the scheme’s values possibly fall outside their classes.

I pasted some 5-year ACS data for NYC PUMAs below (the percentage of workers 16 years and older who take public transit to work in 2011-2015) under STEP 1. In STEP 2 I entered 5 for the number of classes. In the classification schemes on the right, equal intervals is reliable; only 6.6% of the values may fall outside their class. Quantiles was not reliable; 11.9% fell outside. If I reduce the number of classes to 4, reliability improves and both schemes fall under 10%; although unreliability for one of the classes for quantiles is high at 18%, but still below the 20% threshold. Equal intervals should usually perform better than quantiles, as the latter scheme can make rather arbitrary breaks that result in small differences in value ranges between classes (in order to insure that each class has the same number of data points).

You can also enter custom-defined schemes. For example let’s say you use natural breaks (classes determined by gaps in value ranges). There’s a 2-step process here; first you classify the data in GIS and determine what the breaks are, and then you enter them in the spreadsheet. If you’re using QGIS there’s a snag in doing this; QGIS doesn’t show you the “true” breaks of your data based on the actual values, and when you classify data it displays clean breaks that overlap. For example, natural breaks of this data with 5 classes appears like this:

24.4 – 29.0
29.0 – 45.9
45.9 – 55.8
55.8 – 65.1
65.1 – 73.3

So, does the value for 29.0 fall in the first class or the second? The answer is, the first (test it by selecting that record in the attribute table and see where it is on the map, and what color it is). So you need to adjust the values appropriately, paying attention to the precision and scale of your numbers. In this case I bump the first value of each class up by .1, except for the bottom class which you leave alone:

24.4 – 29.0
29.1 – 45.9
46.0 – 55.8
55.9 – 65.1
65.2 – 73.3

In the calculator you have to enter the top class value first, and just the first value in the range:

65.2
55.9
46.0
29.1
29.4

In this case only 7.1% of the total values may fall outside their class so things look good – but my bottom class barely makes the minimum class threshold at 19.4%. I can try dropping the classes down to 4 or I can manually adjust this class to see if I can improve reliability.

If you’re unsure if you made the right adjustments to the classes in translating them from QGIS to the calculator, in QGIS turn on the Show Feature Count option for the layer to see how many data points are in each class, and compare that to the class counts in the calculator. If they don’t match, you need to re-adjust.

This is a great tool for census mappers who want or need to account for issues with ACS reliability. It’s an Excel spreadsheet but I used it in LibreOffice Calc with no problem. In addition to the calculator sheet there’s a second sheet with instructions and background info. Download the Map Reliability Calculator here. You can try it out with this test data, workers who commute with mass transit, 2011-2015 ACS for NYC PUMAs.

I’ve been reviewing a lot of literature over the past year in preparation for writing my book, so note taking is at the forefront of my mind. Grad students occasionally ask me for suggestions on how to effectively take notes, so I’ll share some pointers here. I’ll begin with my quest to find the right note taking software, followed by my actual process for taking notes.

Finding the Right Tool

Ten years ago, I suddenly found myself back in a position where I needed to write academic papers, something I hadn’t done since I wrote my master’s thesis about eight years before that. At that time, I was still using the techniques I had learned in high school (a much longer time ago…). Back then, you were either an index card person or a binder person. The card people would write one note on each card, while binder people kept a ledger of notes and would add additional pages as needed. You’d classify your notes as summaries, paraphrases, or quotations.

I assumed that my high school methods must be outdated by now, so I cast around to see what note taking software was available. I knew I wanted to go open source, as I didn’t want my notes tethered to a specific tool and stored in proprietary format. There were a lot of options, and I quickly became bogged down and frustrated with trying them all. I felt that much of the software forced me to conform to it, and I was spending too much time fidgeting and figuring things out.

I abandoned the search and recorded my notes in a simple text (aka notepad) document. I had always been a binder person, so the single document approach appealed to me. I could copy and paste, use spell check, and search for keyword terms that I assigned (the Linux editors like gedit, leafpad, and xed are lightweight but more robust than MS Notepad). This worked fine for a stand-alone paper and I still use this approach for small projects. But as my research became on-going I needed to rely on these notes for many future projects. The single notepad document grew unwieldy and browsing and searching became difficult.

A few years later, I made a second attempt at searching for note taking software, and this time I broadened the search to include more general-purpose options. My solution: use a wiki! With a wiki, every single source can have it’s own page, the sources can be grouped together under thematic categories, you can assign tags, and you can search across all the pages. I could also add links between pages and out to the web, and could link the notes to the source documents. The wiki was so open ended that I didn’t feel constrained in writing my notes to fit a particular interface, nor did I have to waste a lot of time sifting though buttons and tools.

I opted for a desktop wiki called Zim, which has been actively maintained since 2008. All of the pages in Zim are saved as individual text files in a basic wiki mark-up, which insures that they can be accessed outside the program. Pages are stored in a notebook which is essentially just a folder. If you create hierarchies of pages, these categories become folders and sub-folders. Zim has a ton of extra plugins so you can do spell checking, concept mapping, you can create formulas, calendars, and more. You can also export your entire notebook or portions of it out as HTML or LaTeX files.

Most importantly, the wiki solved one of my most vexing problems. I found that a lot of the note taking software was geared towards just taking notes, and couldn’t handle keeping track of citations. Citation software is it’s own genre, and I found that those packages were poor for taking notes. With Zim, I create a page dedicated to each source, and at the top of each page I embed some BibTeX code for storing the citation data. BibTeX is a format that’s used for creating LaTeX bibliographies, but it has become a common standard and can be used by word processors too. I have a template page (see below) with several BibTeX document types that I just copy and paste when I have a new source to add. Since the pages are saved as plain text, I wrote a short Python script (appears at the end of this post) that loops through my note pages, scrapes out the BibTeX records, and creates a BibTeX file that I can use in LaTeX. Within the BibTeX record I store a link to the source: either to a PDF I have locally, or a web page (if it’s a site), or a WorldCat catalog record (if it’s a book). So all my notes, citations, and the source material are kept together in one place!

Zim is desktop software that you have to download and install locally. Since the notebook consists of text files in folders, it’s easy to back it up into Box or DropBox or whatever you use. Zim doesn’t save histories or have version control, but there’s a plugin that lets you sync your files with Git and other systems.

Relying on Tried and True Methods

While the right tool is important, it’s really the method that counts. I learned that I had to jettison the idea that the note taking process has to be 100% efficient. While you certainly don’t want to flail around and waste time, note taking is not supposed to be quick and easy. The only way you can truly learn new material is to spend time with it: reading, re-reading, taking notes, and reading the notes. The process of note taking is equally if not more important than the actual notes themselves, as the process is what helps you to synthesize and learn the material. While I left the binder and note cards behind, the actual note taking process was similar to what I did in high school.

I always download articles and bookmark websites or catalog records as I’m doing my searches. Once I complete a series of searches and have gathered material from the web and library databases, I sift through the files and rename them using the first author’s last name and the year of publication (i.e. Jones2017). I’ll also use this file name as the BibTeX key that uniquely identifies the article. I create a documents folder with sub-folders for articles, books, and reports, and I keep these folders in the same location as the ZIM notebook. There’s no reason to create lots of topical or thematic folders, as you can use the wiki to categorize and tag the notes, and the wiki becomes the vehicle for searching or browsing through the documents.

As I sort through the sources I identify what’s essential and what’s ancillary. High priority sources will be read thoroughly and covered in detail, while the low-priority stuff will be skimmed and summarized. High priority sources are critical to your research and include touchstone articles in your field, excellent case studies, relevant background material, and any past research that remotely resembles what you’re working on. Low priority sources may have one important fact or concept that you need to remember; these materials are more tangential to your work and ultimately you might cite them in passing, or even not at all.

I always print out the high-priority articles. I’ll read it first, and then I’ll go back and do a second read and mark passages with a high-lighter. Next, I’ll create and type notes directly into the wiki. I might read and mark up a couple articles before I start note taking, but I don’t wait too long as I want the articles fresh in my memory. For essential books, I’ll read a chapter or two at a time and mark passages with little sticky flags. Then I’ll go back and take notes in a paper notebook, and will keep doing that until I finish the book. Then I transcribe all the notes for the book onto my laptop. This takes longer, but once again it’s not all about efficiency. I get to spend more time with the material and it helps me absorb it. This approach also separates the computer from the reading, which cuts down distractions and provides more flexibility in terms of where I can work. Reading in a comfortable chair or outside is preferable to reading while sitting at a table with a laptop.

I never print out or mark up low-priority articles; I skim through the digital copies and write a summary directly in the wiki. For books, I read the book in one go and may use sticky flags here and there, and when I’m done I type the notes directly into the wiki.

For the notes themselves, each note page has the title and author prominently at the top followed by a summary of the source, and then the BibTeX citation (see below). Low-priority materials usually get nothing but a summary and a citation. High-priority materials get detailed notes. Each note is written as a bullet point, and can represent one important fact or insight, or can be a summary of a paragraph or several pages, or even a summary of a chapter. It depends on how important the material is relative to my work.

Taking notes is not like writing a book report. I’m not writing an even or objective summary of the material in it’s entirety. Instead, I’m picking out the pieces that are of interest to me and to the work I’ll be doing, and I skip the rest. Sometimes I’ll editorialize (this is great, or this stinks) but I write in such a way that my thoughts are distinct from what the author is saying. This is where efficiency comes into the picture: identify sources that are high versus low priority, and summarize the source and identify just the specific details that are relevant to you. You’re writing these notes with specific research goals in mind, so don’t waste time writing a generic book report.

I always summarize or paraphrase the material as I take notes, putting concepts in my own words. Doing this forces you to wrestle with the concepts and internalize them, which improves your understanding of the material and your memory for it. It also helps guard against plagiarism; once you start writing the paper, you’ll know your notes are already in your own words and you can use them freely. If I do quote something directly, I always surround it with quotation marks. Lastly, at the end of my note I provide the page numbers to indicate what’s been summarized, so I can go back if need be.

Note taking is an idiosyncratic process. What works for you may not work for someone else and vice versa. The key is to figure out what works best for you; create a system, try it out, and once you’re happy go with it. You can always tweak things as you move along. The notes will help you when it comes time to pull your ideas together into a cohesive paper, but it’s the reading and note taking process that helps you to become proficient with the subject matter.

As I was re-learning how to take notes, I found the handouts from the University of Melbourne’s Academic Skills Unit to be particularly valuable. This is their latest version of Taking Notes From Texts, and this is the older version that I stumbled on years ago.

(Python code for scraping BibTeX records out of wiki notes to create a bibliography is posted below).

#Parse notes stored in zim wiki to extract all bibtex records and write them
#to a new bibtex file named with today's date.
#Script must be stored directly above the notes folder where the wiki data
#is stored. It will ignore the empty bibtex template files and will only
#read wiki files stored as .txt.
#Within the wiki, all bibtex records in the notes are enclosed in a bibtex tag.
#The script reads each line and ignores them until it finds the open <bibtex>
#tag. Then it starts writing each line until it reads the close </bibtex> tag.
#A line return is appended so records are separated in the output file.
#A list and count of extracted records is provided as a diagnostic
import os, datetime
now=datetime.date.today()
path='.'
outfile='sources_'+str(now)+'.bib'
writefile=open(outfile,'w')
counter=0
titles=[]
for (subdir,dirs,files) in os.walk(path):
if 'Templates' in dirs:
dirs.remove('Templates')
if 'documents' in dirs:
dirs.remove('documents')
for file in files:
if file[-4:]=='.txt':
readfile=open(os.path.join(subdir,file),'r')
for line in readfile:
if line.startswith('<bibtex>'):
break
for line in readfile:
if line.startswith('<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span></bibtex'):
titles.append(file)
writefile.write('\n')
counter=counter+1
break
else:
writefile.write(line)
readfile.close()
writefile.close()
titles = [t.replace('_', ' ') for t in titles]
titles=[t.strip('.txt') for t in titles]
titles.sort()
print('Bibliographic records have been extracted for the following sources:','\n')
for title in titles:
print('*',title)
print('\n')
print(counter,'bibilographic records have been parsed and written to',outfile)