Nationality Classification Using Name Embeddings

Publication

Nov 6, 2016

Abstract

Nationality identication unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support ne-grained classication. We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classiers and other systems. rough our analysis of 57M contact lists from a major Internet company, we are able to design a ne-grained nationality classier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial beer than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, ne-grained nationality classier available. As a social media application, we apply our classiers to the followers of major Twier celebrities over six dierent domains. We demonstrate stark dierences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by dierent groups. Finally, we identify an anomalous political gure whose presumably inated following appears largely incapable of reading the language he posts in