Maybe I'll add some breakouts of individual episodes later today if I get some time, but here are the overall word clouds like the ones I made for Downton Abbey. Mad Men has noticeably fewer outliers towards the top:

And the ones that are are actually appropriate. (My dissertation actually has a bit on the origins of focus groups in the 1940s).

Tuesday, March 6, 2012

In my last post, I looked at first names as a rough gauge of author gender to see who is missing from libraries. This method has two obvious failings as a way of finding gender:

1) People use pseudonyms that can be of the opposite gender. (More often women writing as men, but sometimes men writing as women as well.)

2) People publish using initials. It's pretty widely known that women sometimes publish under their initials to avoid making their gender obvious.

The first problem is basically intractable without specific knowledge. (I can fix George Eliot by hand, but no other way). The second we can get actually get some data on, though. Authors are identified by their first initial alone in about 10% of the books I'm using (1905-1922, Open Library texts). It turns out we can actually figure out a little bit about what gender they are. If this is a really important phenomenon in the data, then it should show up in other ways.

I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.

But now that I see some concerns about gender biases in big digital corpora, I do have a bit to say. Partly that I have seen nothing to make me think social prejudices played into the scanning decisions at all. Rather, Google Books, Hathi Trust, the Internet Archive, and all the other similar projects are pretty much representative of the state of academic libraries. (With strangeexceptions, of course). You can choose where to vaccum, but not what gets sucked up the machine; likewise the companies.