We’ve made a set of utilities especially for this dataset,
enron_utils. We’ll include these as well.

We have downloaded the data and preprocessed it as suggested by
Ishiguro et al.
2012.
The results of the scirpt have been stored in the results.p.

enron_crawler.py in the kernels repo includes the script to create
results.p

importenron_utils

Let’s load the data and make a binary matrix to represent email
communication between individuals

In this matrix, \(X_{i,j} = 1\) if and only if person\(_{i}\)
sent an email to person\(_{j}\)

withopen('results.p')asfp:communications=pickle.load(fp)defallnames(o):fork,vino:yield[k]+list(v)names=set(it.chain.from_iterable(allnames(communications)))names=sorted(list(names))namemap={name:idxforidx,nameinenumerate(names)}N=len(names)communications_relation=np.zeros((N,N),dtype=np.bool)forsender,receiversincommunications:sender_id=namemap[sender]forreceiverinreceivers:receiver_id=namemap[receiver]communications_relation[sender_id,receiver_id]=Trueprint"%d names in the corpus"%N