As part of a multi-year project on language shift in contemporary Indonesia, Abby Cohn, Maya Ravindranth, and I have been using the incredible census data provided by IPUMS to study what factors determine whether Indonesians speak Indonesian at home. The data are remarkable in that they comprise a 1% sample the 2010 Indonesian census—which means that our sample size is 2,358,774 individuals. And better yet, anyone can access these data.

One thing that sociolinguists know is that urbanization leads to language shift in multilingual societies. In the Indonesian context, this means that speakers of ethnic languages like Javanese, Sundanese, Batak, and so forth will shift to speaking Bahasa Indonesia, the country’s national language.

But what’s going on here? Is this a consequence of urbanization itself, and the accompanying process of “modernization” of everyday life that (1) expose you to media in the national language and (2) lead to shifting identities away from regional/ethnic to national? Or is it a consequence of the ethnic diversity found in urban areas, which lead speakers of different languages to encounter one another more regularly and thus increase the benefits of speaking a common national language? In principle these two processes are distinct: you could have urban areas without ethnic diversity, or rural areas that are highly diverse. The neat thing about Indonesia is that it is so big and heterogeneous that we have instances of urban and rural districts that are both homogenous and diverse. This allows us to distinguish the two effects from one another.

Because we know the district (kabupaten or kota) in which every individual lives, and we know his/her ethnic group, we can calculate a district-level measure of ethnic diversity (using a so-called Ethnic Fractionalization index [PDF]). We also know whether or not each individual is classified as living in an urban residence or not, so we can use that to calculate the fraction of each district that is urban. Both of these measures range from 0 to 1. Comparing each of the 494 districts recorded in the 2010 census, here is what we find.

The good news is just how varied Indonesian districts are. There are ethnically homogenous, wholly urban districts (Kota Blitar, on Java) as well as ethnically homogenous, entirely rural districts (Nias Barat, off the coast of Sumatra). And looking to the right side of this scatterplot, we see a range of incredibly diverse districts, all of which are on Papua, that range from highly urban to highly rural.

From there, we fit a hierarchical/multilevel logistic regression model in which we predict whether or not an individual speaks Indonesian at home as a function of a range of individual-level characteristics (age and its square, gender, religion, education, etc.) as well as district-level urbanization, ethnic diversity, and their interaction. We then predict, based on the results of that model, the probability that an individual speaks Indonesian as a function of their district’s ethnic diversity and at the 10th, 50th, and 90th percentiles of district urbanization. Here is what we find.

If you live in an ethnically homogenous district, the likelihood that you speak in Indonesian at home is very low, no matter how urban that district is. But as ethnic diversity increases, so does the likelihood of speaking Indonesian—and especially so in urban districts. This shows very clearly that the relationship between urbanization and language shift in a diverse country like Indonesia really does depend on whether or not urbanization comes with increasing ethnic diversity. And although the relationship between ethnic diversity and language shift is largest for urban districts, this relationship is substantively quite large in rural districts too.

Note, though, that to reach such a conclusion, you need a really diverse country like Indonesia that allows you to separate urbanization from ethnic diversity empirically. Thanks, Indonesia.