In this project, we use a RandomForest to model for each county (municipio) in Mexico, the probability of identifying a hidden grave (fosa clandestina) using a set of independent variables. In short, the model works -- more on this below.

This work is a collaboration between HRDAG and two partners. The Programa de Derechos Humanos
of the Universidad Iberoamericana created the database of clandestine graves, and they wrote about this analysis in a blogpost here. DataCívica developed the database of predictor variables, and they have written a complementary analysis in a post here. Their reports provide substantially more context -- this post is primarily technical.

In the model below, we're predicting whether one or more clandestine graves will be discovered. Known graves are necessarily a subset of all graves because there are certainly graves that are not discovered. Therefore it is possible that there are graves with different perpetrators or different causes that would not be predicted by this model. This means, for example, that if only predictions like these were used to drive future searches for graves, then it’s possible (even likely) to miss graves that are unlike those that have been found before but are probably still out there.

This is an epistemological challenge. The interesting question is "where are the hidden graves?" But what the model answers is "in which counties will we discover graves that are like the graves we have found in the past?"

In this context, similarity --- graves like the ones we have found in the past --- means that the social production of knowledge about the graves found in the past will be similar to graves found in the future. The question of "findability" should therefore be central to thinking about these results. How were these graves found? Is there something about these counties, or these graves, which makes them findable, while other graves remain unfindable? Do these graves hold the victims of perpetrators who share some methods, while other perpetrators are more successful at hiding graves? We do not address those questions here.

The very high precision in the prediction may mean that counties in which graves become known are wildly different values on many variables relative to the counties which are thought to have very low probabilities of graves. This provides a direction for future research; more on this below.

We can predict the counties (municipios) that have known clandestine graves (fosas clandestinas) in 2013 and 2014. Here are the steps we took:

The dependent variable, organized by the team from the Universidad Iberoamericana, contains data from 2013 and 2014 for municipalities in which graves were found and for municipalities the team defined as very unlikely to have graves (2013: 46 yes, 94 no; 2014: 80 yes, 94 no; it is a coincidence that the two "no" counts are equal). I call the counties for which the Ibero team defined whether a grave was found or not the "counties with known grave status."

A number of predictor variables were organized by the team from Data Cívica. We will not interpret these variables here. In prediction, the predictor variables are not considered important, which is quite different from inferential analysis. Instead, the quality of the prediction is measured by how close the predicted values are to the observed values. The ten most relevant variables are listed for reference.

For both years, the model was built by dividing the counties with known grave status randomly into two groups. Two-thirds of the counties were randomly selected to be used to train a Random Forest model, and the remaining one-third of the counties were held out and used for testing. I repeated the random division, training, and testing 500 times.

For both 2013 and 2014, in each of the 500 iterations, the testing data was predicted perfectly. That is, looking only at the counties that were hidden from the model training, the model was able to predict which counties would have known graves, and which counties the Ibero team defined as having very low probability of a grave.

By perfect prediction, I mean no false positives, no false negatives, AUC=1.0, brier loss < 0.01, in every iteration. Keep in mind that this result is on held-out testing data.

In the final step, I used the model trained with data from 2013, and with the predictor data from 2014 to predict the grave status for all counties in 2014. The prediction was almost perfect: all counties with graves were correctly predicted, but a few counties without graves had predictions slightly greater than zero. This finding implies that models from a previous year can successfully predict where graves will be discovered in the current year. This is the most exciting result, and is worth additional research.

The most urgent application is to map the clandestine graves from 2016, and retest the model for that year. If it is as accurate as it is for 2013 and 2014, we could use its predictions for 2017 to guide the search for additional graves.

It would be interesting to continue identifying counties in which there is a very low probability of finding a grave. This would be particularly useful for 2015 and 2016. These counties enable the model to distinguish between patterns of predictor variables associated with finding graves, and patterns associated with a low probability of graves. The more counties we can define as non-grave finding, the more generalized the model can become, and the better its predictions will be about which counties are the best targets for additional searches.

defmake_dcfo(targ_year):''' read and merge datasets, impute missing values, and define fields_to_use '''# prep DataCivica independent variables dc=pd.read_csv("../input/final_ibero.csv.gz",encoding='latin1')dc=dc[dc.year==targ_year]# drop variables that are all NULL dc.dropna(axis=1,how='all',inplace=True)dc.set_index(['inegi'],inplace=True,drop=False)# convert to integer dc.fedlib=dc.fedlib.map(lambdaf:{'SI':1,'NO':0}[f]).astype(int)# convert categorical to dummiescarr=pd.get_dummies(dc.carretera,prefix='carr')dc=pd.merge(dc,carr,left_index=True,right_index=True)dc.drop(['carretera'],axis=1,inplace=True)print("\nfound {} recs with {} non-null independent vars for year == {}".format(len(dc),len(dc.columns),targ_year))# prep Ibero's list of municipios with known fosas fo=pd.read_excel("../input/fosas_patrick.xlsx",sheetname=str(targ_year))fo.rename(columns={'Clave de municipio (INEGI)':'inegi','Presencia de fosas':'fosas'},inplace=True)iftarg_year==2013:assertlen(fo[fo.fosas==1])==46eliftarg_year==2014:assertlen(fo[fo.fosas==1])==80# keep only the muni code and the fosa countfo=fo[['inegi','fosas']]fo.set_index(['inegi'],inplace=True,drop=False)print(fo.info())# set these for the rest of the runfields_to_exclude=set(['year','inegi','munname','entidad','fosas'])fields_to_use=[fforfindc.columnsiffnotinfields_to_exclude]# merge Ibero's fosas data with DC's independent variables dcfo=pd.merge(dc,fo,how='left',on='inegi')dcfo[dcfo.fosas>=1]=1print(dcfo.fosas.value_counts(dropna=False))# fill in all the missing values. RandomForest cannot use municipios # with missing values, so we replace missing values with the column's# median value. This is a standard approach for missing data with RandomForests# and I don't think it makes much difference here. However, if we can # improve the predictor variables to reduce the missing data, that would be# slightly better. imputer=Imputer(missing_values="NaN",strategy="median",axis=0)dcfo_X=imputer.fit_transform(dcfo[fields_to_use])dcfo_X=pd.DataFrame(dcfo_X,columns=fields_to_use)dcfo.drop(fields_to_use,axis=1,inplace=True)dcfo=pd.concat([dcfo,dcfo_X],axis=1)returndcfo,fields_to_use,fields_to_exclude

In [4]:

defdcfo_split(dcfo,train_pct=0.5,verbose=False):''' divide dcfo with known fosa state randomly by train_pct '''dcfo['rand']=np.random.random(len(dcfo))labeled=dcfo[pd.notnull(dcfo.fosas)]train=labeled[(labeled.rand<=train_pct)&(labeled.fosas>=0)]test=labeled[(labeled.rand>train_pct)&(labeled.fosas>=0)]dcfo.drop(['rand'],inplace=True,axis=1)dellabeledifverbose:print("train:")print(train.fosas.value_counts())print("test:")print(test.fosas.value_counts())print("test and training ready.")returntrain,test

In [5]:

deftrain_and_chk(train,test,fields_to_use,tol=1,verbose=False,auc=False):''' create and test a RF model '''defpredict_and_chk(model,df,tol=1,verbose=verbose,auc=auc):''' convenience function to chk training and testing classification '''Y=model.predict(df[fields_to_use])cf=sklearn.metrics.confusion_matrix(df.fosas,Y)assert(cf[0,1]+cf[1,0])<=tolifverbose:print(cf)ifverboseandauc:Y=model.predict_proba(df[fields_to_use])[:,1]fpr,tpr,thresholds=sklearn.metrics.roc_curve(df.fosas,Y,pos_label=1)auc=sklearn.metrics.auc(fpr,tpr)print('AUC:',auc)brier=sklearn.metrics.brier_score_loss(df.fosas,Y,pos_label=1)print('brier:',brier)# This is the algorithmcls=RandomForestClassifier(n_estimators=200)# This trains the modelmodel=cls.fit(train[fields_to_use],train.fosas)# this checks the training: it should be close to perfectpredict_and_chk(model,train,tol=tol)predict_and_chk(model,test,tol=tol,auc=auc)returnmodel# example useprint("2013:")dcfo,fields_to_use,fields_to_exclude=make_dcfo(2013)train,test=dcfo_split(dcfo,verbose=True,train_pct=0.66)model=train_and_chk(train,test,fields_to_use,tol=1,verbose=True,auc=True)print("\n2014:")dcfo,fields_to_use,fields_to_exclude=make_dcfo(2013)train,test=dcfo_split(dcfo,verbose=True,train_pct=0.66)model=train_and_chk(train,test,fields_to_use,tol=1,verbose=True,auc=True)deldcfo,fields_to_use,fields_to_exclude,train,test,model

The code in the train_and_chk function above estimates a RandomForest with the training data, checks that the training data has been properly classified, and then checks that the testing data has also been properly classified.

In each case, "properly classified" is measured by the confusion matrix (also see the simpler explanation here but note that sometimes the Yes/No rows and columns are interchanged -- in both of these explanations they have the TP in the upper left and the TN in the lower right, which is the opposite of how it is shown here, sorry). A confusion matrix's upper left cell shows the number of municipios that the Ibero's team classified as having no fosas and the model classifies as having no fosas; these are called "true negatives" (TN); the upper right cell shows the number of municipios that are classified by the Ibero team as having fosas but by the model as not having them, these are "false negatives" (FN); in the lower right corner are the municipios that Ibero's team has identified as having fosas and in which the model predicts them, "true positives" (TP); the final cell in the lower left are "false positives" (FP).

TN FN
FP TP

The confusion matrices shown here have all the cases in the upper left and lower right, which indicate perfect prediction. This is tested in each iteration.

In [6]:

# this cell runs the split and train functions some number of times# (500 is set here). defiterate_RF(targ_year):dcfo,fields_to_use,fields_to_exclude=make_dcfo(targ_year)print('{} fields to use:\n'.format(len(fields_to_use)),fields_to_use)importances=pd.DataFrame(fields_to_use,columns=['field'])print(importances.info())preds=dcfo[['inegi','munname','fosas']].copy()pfields=list()loops=list()foriinrange(500):ifi%10==0:loops.append(str(i))clear_output()print(','.join(loops))pfield='p{:02}'.format(i)pfields.append(pfield)train,test=dcfo_split(dcfo,train_pct=0.66)model=train_and_chk(train,test,fields_to_use,tol=0,verbose=False,auc=False)importances[pfield]=model.feature_importances_preds[pfield]=model.predict_proba(dcfo[fields_to_use])[:,1]preds['prob']=preds[pfields].mean(axis=1)preds['std']=preds[pfields].std(axis=1)preds.sort_values(['prob'],ascending=False,inplace=True)print(preds[preds.fosas==-1][['inegi','munname','prob','std']].head(10))importances['prob']=importances[pfields].mean(axis=1)importances['std']=importances[pfields].std(axis=1)importances.sort_values(['prob'],ascending=False,inplace=True)print(importances[['field','prob','std']].head(10))print('done.')

This function divides the data into training and testing sets and estimates a Random Forest model 500 times. The primary goal of the function is to makes sure that in all the random divisions of the data into training and testing sets, the testing set is always predicted perfectly.

As a secondary benefit, the ten counties classified as most likely to have graves are listed. The probabilities estimated are not very high -- they are all below 0.5 -- so this is not a strong indication that there will be graves found in these counties. Nonetheless, these are the counties most similar to the counties in which graves were found.

The next table describes the ten variables which have the greatest impact on the model. This paper explains how to interpret the variable importances; for the purpose of this analysis, we only want to know approximately which kinds of measures are most interesting for the prediction.

I am very grateful to my HRDAG colleague Dr Kristian Lum for helpful suggestions about how to explain this work. I want to thank the teams from the Universidad Iberoamericana and from DataCívica for their work organizing the data and thinking about what it means.