Significance

This study compares the accuracy of personality judgment—a ubiquitous and important social-cognitive activity—between computer models and humans. Using several criteria, we show that computers’ judgments of people’s personalities based on their digital footprints are more accurate and valid than judgments made by their close others or acquaintances (friends, family, spouse, colleagues, etc.). Our findings highlight that people’s personalities can be predicted automatically and without involving human social-cognitive skills.

Abstract

Judging others’ personalities is an essential skill in successful social living, as personality is a key driver behind people’s interactions, behaviors, and emotions. Although accurate personality judgments stem from social-cognitive skills, developments in machine learning show that computer models can also make valid judgments. This study compares the accuracy of human and computer-based personality judgments, using a sample of 86,220 volunteers who completed a 100-item personality questionnaire. We show that (i) computer predictions based on a generic digital footprint (Facebook Likes) are more accurate (r = 0.56) than those made by the participants’ Facebook friends using a personality questionnaire (r = 0.49); (ii) computer models show higher interjudge agreement; and (iii) computer personality judgments have higher external validity when predicting life outcomes such as substance use, political attitudes, and physical health; for some outcomes, they even outperform the self-rated personality scores. Computers outpacing humans in personality judgment presents significant opportunities and challenges in the areas of psychological assessment, marketing, and privacy.

Perceiving and judging other people’s personality traits is an essential component of social living (1, 2). People use personality judgments to make day-to-day decisions and long-term plans in their personal and professional lives, such as whom to befriend, marry, trust, hire, or elect as president (3). The more accurate the judgment, the better the decision (2, 4, 5). Previous research has shown that people are fairly good at judging each other’s personalities (6⇓–8); for example, even complete strangers can make valid personality judgments after watching a short video presenting a sample of behavior (9, 10).

Although it is typically believed that accurate personality perceptions stem from social-cognitive skills of the human brain, recent developments in machine learning and statistics show that computer models are also capable of making valid personality judgments by using digital records of human behavior (11⇓–13). However, the comparative accuracy of computer and human judgments remains unknown; this study addresses this gap.

Personality traits, like many other psychological dimensions, are latent and cannot be measured directly; various perspectives exist regarding the evaluation criteria of judgmental accuracy (3, 5). We adopted the realistic approach, which assumes that personality traits represent real individual characteristics, and the accuracy of personality judgments may be benchmarked using three key criteria: self-other agreement, interjudge agreement, and external validity (1, 5, 7). We apply those benchmarks to a sample of 86,220 volunteers,* who filled in the 100-item International Personality Item Pool (IPIP) Five-Factor Model of personality (14) questionnaire (15), measuring traits of openness, conscientiousness, extraversion, agreeableness, and neuroticism.

Computer-based personality judgments, based on Facebook Likes, were obtained for 70,520 participants. Likes were previously shown to successfully predict personality and other psychological traits (11). We used LASSO (Least Absolute Shrinkage and Selection Operator) linear regressions (16) with 10-fold cross-validations, so that judgments for each participant were made using models developed on a different subsample of participants and their Likes. Likes are used by Facebook users to express positive association with online and offline objects, such as products, activities, sports, musicians, books, restaurants, or websites. Given the variety of objects, subjects, brands, and people that can be liked and the number of Facebook users (>1.3 billion), Likes represent one of the most generic kinds of digital footprint. For instance, liking a brand or a product offers a proxy for consumer preferences and purchasing behavior; music-related Likes reveal music taste; and liked websites allow for approximating web browsing behavior. Consequently, Like-based models offer a good proxy of what could be achieved based on a wide range of other digital footprints such as web browsing logs, web search queries, or purchase records (11).

Human personality judgments were obtained from the participants’ Facebook friends, who were asked to describe a given participant using a 10-item version of the IPIP personality measure. To compute self-other agreement and external validity, we used a sample of 17,622 participants judged by one friend; to calculate interjudge agreement, we used a sample of 14,410 participants judged by two friends. A diagram illustrating the methods is presented in Fig. 1.

Methodology used to obtain computer-based judgments and estimate the self-other agreement. Participants and their Likes are represented as a matrix, where entries are set to 1 if there exists an association between a participant and a Like and 0 otherwise (second panel). The matrix is used to fit five LASSO linear regression models (16), one for each self-rated Big Five personality trait (third panel). A 10-fold cross-validation is applied to avoid overfitting: the sample is randomly divided into 10 equal-sized subsets; 9 subsets are used to train the model (step 1), which is then applied to the remaining subset to predict the personality score (step 2). This procedure is repeated 10 times to predict personality for the entire sample. The models are built on participants having at least 20 Likes. To estimate the accuracy achievable with less than 20 Likes, we applied the regression models to random subsets of 1–19 Likes for all participants.

Results

Self-Other Agreement.

The primary criterion of judgmental accuracy is self-other agreement: the extent to which an external judgment agrees with the target’s self-rating (17), usually operationalized as a Pearson product-moment correlation. Self-other agreement was determined by correlating participants’ scores with the judgments made by humans and computer models (Fig. 1). Since self-other agreement varies greatly with the length and context of the relationship (18, 19), we further compared our results with those previously published in a meta-analysis by Connely and Ones (20), including estimates for different categories of human judges: friends, spouses, family members, cohabitants, and work colleagues.

To account for the questionnaires’ measurement error, self-other agreement estimates were disattenuated using scales’ Cronbach’s α reliability coefficients. The measurement error of the computer model was assumed to be 0, resulting in the lower (conservative) estimates of self-other agreement for computer-based judgments. Also, disattenuation allowed for direct comparisons of human self-other agreement with those reported by Connely and Ones (20), which followed the same procedure.

The results presented in Fig. 2 show that computers’ average accuracy across the Big Five traits (red line) steadily grows with the number of Likes available on the participant’s profile (x axis). Computer models need only 100 Likes to outperform an average human judge in the present sample (r = 0.49; blue point).† Compared with the accuracy of various human judges reported in the meta-analysis (20), computer models need 10, 70, 150, and 300 Likes, respectively, to outperform an average work colleague, cohabitant or friend, family member, and spouse (gray points). Detailed results for human judges can be found in Table S1.

Computer-based personality judgment accuracy (y axis), plotted against the number of Likes available for prediction (x axis). The red line represents the average accuracy (correlation) of computers’ judgment across the five personality traits. The five-trait average accuracy of human judgments is positioned onto the computer accuracy curve. For example, the accuracy of an average human individual (r = 0.49) is matched by that of the computer models based on around 90–100 Likes. The computer accuracy curves are smoothed using a LOWESS approach. The gray ribbon represents the 95% CI. Accuracy was averaged using Fisher’s r-to-z transformation.

How accurate is the computer, given an average person? Our recent estimate of an average number of Likes per individual is 227 (95% CI = 224, 230),‡ and the expected computer accuracy for this number of Likes equals r = 0.56. This accuracy is significantly better than that of an average human judge (z = 3.68, P < 0.001) and comparable with an average spouse, the best of human judges (r = 0.58, z = −1.68, P = 0.09). The peak computer performance observed in this study reached r = 0.66 for participants with more than 500 Likes. The approximately log-linear relationship between the number of Likes and computer accuracy, shown in Fig. 2, suggests that increasing the amount of signal beyond what was available in this study could further boost the accuracy, although gains are expected to be diminishing.

Why are Likes diagnostic of personality? Exploring the Likes most predictive of a given trait shows that they represent activities, attitudes, and preferences highly aligned with the Big Five theory. For example, participants with high openness to experience tend to like Salvador Dalí, meditation, or TED talks; participants with high extraversion tend to like partying, Snookie (reality show star), or dancing.

Self-other agreement estimates for individual Big Five traits (Fig. 2) reveal that the Likes-based models are more diagnostic of some traits than of others. Especially high accuracy was observed for openness—a trait known to be otherwise hard to judge due to low observability (21, 22). This finding is consistent with previous findings showing that strangers’ personality judgments, based on digital footprints such as the contents of personal websites (23), are especially accurate in the case of openness. As openness is largely expressed through individuals’ interests, preferences, and values, we argue that the digital environment provides a wealth of relevant clues presented in a highly observable way.

Interestingly, it seems that human and computer judgments capture distinct components of personality. Table S2 lists correlations and partial correlations (all disattenuated) between self-ratings, computer judgments, and human judgments, based on a subsample of participants (n = 1,919) for whom both computer and human judgments were available. The average consensus between computer and human judgments (r = 0.37) is relatively high, but it is mostly driven by their correlations with self-ratings, as represented by the low partial correlations (r = 0.07) between computer and human judgments. Substantial partial correlations between self-ratings and both computer (r = 0.38) and human judgments (r = 0.42) suggest that computer and human judgments each provide unique information.

Interjudge Agreement.

Another indication of the judgment accuracy, interjudge agreement, builds on the notion that two judges that agree with each other are more likely to be accurate than those that do not (3, 24⇓–26).

The interjudge agreement for humans was computed using a subsample of 14,410 participants judged by two friends. As the judgments were aggregated (averaged) on collection (i.e., we did not store judgments separately for the judges), a formula was used to compute their intercorrelation (SI Text). Interjudge agreement for computer models was estimated by randomly splitting the Likes into two halves and developing two separate models following the procedure described in the previous section.

The average consensus between computer models, expressed as the Pearson product-moment correlation across the Big Five traits (r = 0.62), was much higher than the estimate for human judges observed in this study (r = 0.38, z = 36.8, P < 0.001) or in the meta-analysis (20) (r = 0.41, z = 41.99, P < 0.001). All results were corrected for attenuation.

External Validity.

The third measure of judgment accuracy, external validity, focuses on how well a judgment predicts external criteria, such as real-life behavior, behaviorally related traits, and life outcomes (3). Participants’ self-rated personality scores, as well as humans’ and computers’ judgments, were entered into regression models (linear or logistic for continuous and dichotomous variables respectively) to predict 13 life outcomes and traits previously shown to be related to personality: life satisfaction, depression, political orientation, self-monitoring, impulsivity, values, sensational interests, field of study, substance use, physical health, social network characteristics, and Facebook activities (see Table S3 for detailed descriptions). The accuracy of those predictions, or external validity, is expressed as Pearson product-moment correlations for continuous variables, or area under the receiver-operating characteristic curve (AUC) for dichotomous variables.§

As shown in Fig. 3, the external validity of the computer judgments was higher than that of human judges in 12 of the 13 criteria (except life satisfaction). Furthermore, computer models’ external validity was even better than self-rated personality in 4 of the 13 criteria: Facebook activities, substance use, field of study, and network size; and comparable in predicting political attitudes and social network characteristics. Because most of the outcome variables are self-reports, the high external validity of personality self-ratings is to be expected. It is therefore striking that Likes-based judgments were still better at predicting variables such as field of study or self-rated substance use, despite them sharing more method variance with self-ratings of personality. In addition, the computer-based models were aimed at predicting personality scores and not life outcomes. In fact, Likes-based models, directly aimed at predicting such variables, can achieve even higher accuracy (11).

Discussion

Our results show that computer-based models are significantly more accurate than humans in a core social-cognitive task: personality judgment. Computer-based judgments (r = 0.56) correlate more strongly with participants’ self-ratings than average human judgments do (r = 0.49). Moreover, computer models showed higher interjudge agreement and higher external validity (computer-based personality judgments were better at predicting life outcomes and other behaviorally related traits than human judgments). The potential growth in both the sophistication of the computer models and the amount of the digital footprint might lead to computer models outperforming humans even more decisively.

According to the Realistic Accuracy Model, the accuracy of the personality judgment depends on the availability and the amount of the relevant behavioral information, along with the judges’ ability to detect and use it correctly (1, 2, 5). Such conceptualization reveals a couple of major advantages that computers have over humans. First, computers have the capacity to store a tremendous amount of information, which is difficult for humans to retain and access. Second, the way computers use information—through statistical modeling—generates consistent algorithms that optimize the judgmental accuracy, whereas humans are affected by various motivational biases (27). Nevertheless, human perceptions have the advantage of being flexible and able to capture many subconscious cues unavailable to machines. Because the Big Five personality traits only represent some aspects of human personality, human judgments might still be better at describing other traits that require subtle cognition or that are less evident in digital behavior. Our study is limited in that human judges could only describe the participants using a 10-item-long questionnaire on the Big Five traits. In reality, they might have more knowledge than what was assessed in the questionnaire.

Automated, accurate, and cheap personality assessment tools could affect society in many ways: marketing messages could be tailored to users’ personalities; recruiters could better match candidates with jobs based on their personality; products and services could adjust their behavior to best match their users’ characters and changing moods; and scientists could collect personality data without burdening participants with lengthy questionnaires. Furthermore, in the future, people might abandon their own psychological judgments and rely on computers when making important life decisions, such as choosing activities, career paths, or even romantic partners. It is possible that such data-driven decisions will improve people’s lives.

However, knowledge of people’s personalities can also be used to manipulate and influence them (28). Understandably, people might distrust or reject digital technologies after realizing that their government, internet provider, web browser, online social network, or search engine can infer their personal characteristics more accurately than their closest family members. We hope that consumers, technology developers, and policy-makers will tackle those challenges by supporting privacy-protecting laws and technologies, and giving the users full control over their digital footprints.

Popular culture has depicted robots that surpass humans in making psychological inferences. In the film Her, for example, the main character falls in love with his operating system. By curating and analyzing his digital records, his computer can understand and respond to his thoughts and needs much better than other humans, including his long-term girlfriend and closest friends. Our research, along with development in robotics (29, 30), provides empirical evidence that such a scenario is becoming increasingly likely as tools for digital assessment come to maturity. The ability to accurately assess psychological traits and states, using digital footprints of behavior, occupies an important milestone on the path toward more social human-computer interactions.

Acknowledgments

We thank John Rust, Thore Graepel, Patrick Morse, Vesselin Popov, Winter Mason, Jure Leskovec, Isabelle Abraham, and Jeremy Peang-Meth for their critical reading of the manuscript. W.Y. was supported by the Jardine Foundation; D.S. was supported by a grant from the Richard Benjamin Trust; and M.K. was supported by Microsoft Research, Boeing Corporation, the National Science Foundation, the Defense Advanced Research Projects Agency, and the Center for the Study of Language and Information at Stanford University.

Footnotes

Author contributions: W.Y. and M.K. designed research; W.Y., M.K., and D.S. performed research; W.Y. and M.K. contributed new reagents/analytic tools; W.Y. and M.K. analyzed data; and W.Y., M.K., and D.S. wrote the paper.

Conflict of interest statement: D.S. received revenue as the owner of the myPersonality Facebook application.

This article is a PNAS Direct Submission. D.F. is a guest editor invited by the Editorial Board.

Data deposition: The data used in the study are shared with the academic community at mypersonality.org.

↵*The sample used in this study was obtained from the myPersonality project. myPersonality was a popular Facebook application that offered to its users psychometric tests and feedback on their scores. Since the data are secondary, anonymized, was previously published in the public domain, and was originally gathered with an explicit opt-in consent for reuse for research purposes beyond the original project, no IRB approval was needed. This was additionally confirmed by the Psychology Research Ethics Committee at the University of Cambridge.

↵†This figure is very close to the average human accuracy (r = 0.48) found in Connelly and Ones’s meta-analysis (20).

↵‡Estimate based on a 2014 sample of n = 100,001 Facebook users collected for a separate project. Sample used in this study was recorded in the years 2009–2012.

↵§AUC is an equivalent of the probability of correctly classifying two randomly selected participants, one from each class, such as liberal vs. conservative political views. Note that for dichotomous variables, the random guessing baseline corresponds to an AUC = 0.50.

(2012) Do people hold a humanoid robot morally accountable for the harm it causes?Proceedings of the Seventh Annual ACM/IEEE International Conference on Human–Robot Interaction (Association for Computing Machinery, New York), pp 33–40

Researchers report trends in emissions of nitrogen oxides in the United States over the past decade. The results suggest challenges to meeting future air quality standards for ozone, according to the authors.