Statistical Interests in Large Cities

I always thought that there were some kind of schools in statistics, areas (not to say universities or laboratories) where people had common interest in term of statistical methodology. Like people with strong interest in extreme values, or in Lévy Processes. I wanted to check this point so I did extract information about articles puslished in about 35 journals in statistics, probability and econometrics. I got all the information in files extracted from http://scopus.com/

I decided to focus on 90 locations. Each time I have a string which is the same as the name of one of my 90 cities, I keep it. So if there is a Prof. Ann Arbor, I will consider that person as a city. Here is the graph of all locations, with the number of “articles“. Or contributors. If four people in San Francisco published toegher an article, the article appears four times in my dataset. I did spend some time with Cambridge, and I decided to move Cambridge, MA to Boston, MA. Just for convenience.

I did spend some time on some cities, such as Paris, or London, where zip code was sometimes attached to the city name. I also had to fix some problems… But after a few minuts, I was able to locate those cities.

Then, I wanted to extract information about all publications. Keywords are interesting, but over 266,567 “publications“, it is hard to use (sometimes it is not file, somethimes it is extremely general, or extremely specialized). So I decided to extract words from the title of the contribution.

For bayesian statistics (publication with the word bayesian in the title)

For nonparametric statistics (publication with the word nonparametric in the title)

For stochastic processes (publication with the word processes in the title)

(the problem here is that we cannot visualize the red circles: if in a city, no one published on a given topic, it would be strong red, but tiny, or even null… so we won’t see it). It decided to keep the top 250 words that appeaared in titles, I removed standard common words, such as it, the, of, etc.

On the graph below, the green level is the theoretical counts of each word, under some independence assumption. The dark line is the observed one. For instance, in San Francisco, on top, we have words that were not used a lot (e.g. processes: given the total number of publications, it would make sense to have 6 or 7 publications with the word processes, but there were 0 publications actually), and below words that were intensively used. Intensively (such as method and structure, the last one was expected two or three times, but it appeared in 25 publications) compared with the other cities,

In Boston, MA, we got

In New York City, NY

In Paris (France),

But to be honest, I was disapointed. I mean, yes, I can see on the previous graph, for instance, that there are a lot of people working on stochastic processes, with the words Brownian and Markov. But for most cases, I can hardly get an interpretation…

I tried a finaly graph, on interconnexions between authors. The first point is that it is common to have joint publications with colleagues, in the same city. The largest the point, the more joint papers,

But we can add here cross publications: the thinner the line, the less joint publications between two places,

We can see that I missed in the first part the Cambridge-Boston distinction, since Cambridge should now stand for Cambridge, UK. But the line is clearly too large to be explained here by collaboration betweem Cambridge, UK, and Boston, MA. But still. a lot of them can be explained, with Hong-Kong and Shanghai, or Mexico and Guanajuato.

If someone has better ideas to import properly the locations (or affiliations, it might be fun to focus on universities) and perhaps the abstract (more than the title), I’d be glad to try the same study in Economic journals…