Wednesday, November 17, 2010

It was a grand plan. I was going to learn all sorts of new things: Pentaho, R, Processing. All sorts of new things. In the end, I got caught up in getting something done and learned none of those things. Again.

Using trusty old Python with the beautiful NetworkX module and the shockingly fast--if a bit rough around the edges--graph visualization tool Gephi, I pulled Crunchbase data to create a social network map of how venture investors coinvest. You can skip straight on down if you want, playing with it is more fun than reading about it (and more fun than learning Pentaho, it turns out.)

I can't believe Crunchbase didn't rate limit me* but most of the data is from their excellent database. I augmented it with info individuals have made available on AngelList. I didn't include any non-public info, even though I know of several excellent angels who didn't make the map because they've kept their activities under the radar.

I then had way too many nodes to make any visualization make any sense. So I did two things: any person mentioned as an investor who was also a venture firm employee was folded into the firm. I also made some fixes I knew of (merging my friend Roger Ehrenberg's IA Capital into IA Ventures, for instance.) I know some venture partners invest as angels outside their firms, but since this is a map of social connections, I think the step not only makes sense, but weights individuals more accurately. (Roger Ehrenberg, for instance, would not get the weight his activities deserve if his investing activity was split among three entities.)

Then, again to make it manageable, I took out any investors with fewer than five investments. Ran it through Fruchterman-Reingold. Colored venture firms red, people green and others (corporates, incubators) blue. Made node size proportional to number of investments.

The result is below, in Zoom.it. Some things that stand out:

- The network is incredibly connected. If you go into the "core", where the Sand Hill Road firms are, there are so many edges, they are indistinguishable. Generally, in this visualization, the drawn edges are more or less decorative, because there are too many to have them make sense.

- Because of the dense interconnectivity, there are not many noticeable subnetworks, from 50,000 feet. Here's a map key, such as it is, showing some areas that are distinguishable. The separation between biotech and the core is no more noticeable, to my eye, than that between web 2.0 and the core. I do find that the further I get from my own node, the less I know about the investors.

Map key:

I should note the usual caveats. Crunchbase data is not a complete record of investment activity, in fact it tends to be severely self-selecting. I assume both non-US and non-Internet-tech are underrepresented. I know non-VC investment is underrepresented. Also, my few fixes are not all-encompassing. This was a project I had time for because of a couple of long train rides. I do have the raw dataset (both gephi, graphml and pickled networkx graphs) for the entire network. If you want them, let me know.

Drag and zoom. Find your friends.

------* Or maybe because I was hitting their API while on the Acela, they figured it couldn't possibly be programmatic. In any case, to my fellow train passengers, I apologize for hogging the bandwidth.

Extantproject--Thanks, will check it out. The main thing I would have liked in a visualization program would be a way to have it be able to export to a format that's interactive (beyond drag and zoom) in the browser. Seadragon is pretty cool and it looks like it might be extensible to do that (not by me, though...) Does Cytoscape have that capability or is it moving in that direction. Some of the stuff they're doing at Malariapedia (http://mbw.molgen.mpg.de) is amazing. Want.

Sachin--

Would be pretty easy to generate from the database I now have if given a list of companies to filter by. Not sure how much info it would give you, though. It would also be pretty easy to generate with both investors and companies, but it would probably be somewhat similar to my previous foray into data visualization, the chart Brent Halliburton put together with data I generated (here: http://www.adexchanger.com/venture-capital/ad-technology-funding/ ). Happy to share data (the raw data is in a mysql database so if you let me know what format you want it in, it would be pretty easy to generate), or re-run either way if you have a list of companies.

Also, to anyone: as I look at the visualization, I wish it had more explanatory power. If there's anyone out there who actually knows what they're doing with this sort of thing (this is my first go), I'd be happy to give you data, generate data, whatever. I think that the social network of investors is an important factor in what gets funded and what does not... maybe there's something to be gleaned from it.

Hi Jerry - The data visualization you and Brent put together is kind of what I was envisioning. I had not seen that before. A MySQL data dump would work fine. Would like to play around with it with some visualization tools.

Funny Sebastien was here first :). I'm linkfluence cofounder and we would be interested to work with you on this dataset and may be publish an interactive version of it (like on this map : http://maps.linkfluence.net/blogopole/2009/).

Happy to share. Let me know what you want and how to get it to you. There's a lot more data than was represented in the graph posted.

I believe I tried iGraph (python) a year or a bit more ago. Managed to overwhelm my puny little macbook pro because the graph was too big. That's why I was excited to find networkx, it's pretty lightweight, it seems. At least, it worked.

Thanks for offering the data! What are my options? I am comfortable using Python to work with large text files/.csv files etc. So if you could email me the data in a compressed folder of some kind I would be perfectly happy with that...

Thank you so much for publishing all this amazing work! I would be most grateful if you could share with me the gephi file or the raw data.You can find my email address on LinkedIn - happy to connect with you.