Ramblings on social science, social networks, statistics, data analysis, computing, game theory and alike

Three days ago Nature published a note commenting on an recent heated social media discussions whether MS Word is better than LaTeX for writing scientific papers. The note refers to a PLOS article by Knauf & Nejasmic reporting a study on word-processor use. The overall result of that study is that participants who used Word took less time and made less mistakes in reproducing the probe text as compared to people who used LaTeX.

I find it rather funny that Nature picked-up the topic. Such discussions always seemed rather futile to me (de gustibus non disputandum est and the fact that some solution A is better or more “efficient” than B does not necessarily lead to A becoming accepted, as is the case with QWERTY vs Dvorak keyboard layouts) and far away from anything scientific.

As it goes for myself, I do not like Word nor its Linux counterparts (LibreOffice, Abiword etc), let’s call them WYSIWYGs. First and foremost because I believe they are very poor text editors (as compared to Vim or Emacs): it is cumbersome to navigate longer texts, search. The fact that it is convenient to read a piece of text in, say, Times New Roman does not mean that it is convenient to write using it. Second, when writing in WYSIWYGs I always have an impression that I am handcrafting something: formatting, styles and so on. It is like sculpturing: if you don’t like the result you need to get another piece of wood and start from the beginning. All that seems to counter the main purpuse for which the computers were developed in the first place, which is taking over “mechanistic” tasks and leave “creative” ones to the user.

I like that the Nature note referred to Markdown as an emerging technology for writing [scientific] texts. If do not know, Markdown is a lightweight plain text format, not unlike Wikipedia markup. Texts written in Markdown can be processed to PDF, HTML, MSWord and so on. More and more people are using for writing articles or even books. It is simple (plain text) and allows to focus on writing.

Last, the note still contains a popular misconception that one of the downsides of LaTeX is a lack of spell checker…

Share this:

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.

What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

This produces the following (red lines are for passengers who did not survive):

It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.

If only the segments could be smooth curves, e.g. splines or Bezier curves…

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.

Share this:

These are slides from the very first SER meeting – an R user group in Warsaw – that took place on February 27, 2014. I talked about various “lifehacking” tricks for R and focused how to use R with GNU make effectively. I will post some detailed examples in forthcoming posts.

Here are the slides from my Sunbelt 2014 talk on collaboration in science. I talked about:

Some general considerations regarding collaboration or the lack of it. I have an impression that we are quite good at formulating arguments able to explain why people would like to collaborate. It’s much less understood why we do not observe as much collaboration as those arguments might suggest.

Some general considerations about potential data sources and their utility for studying collaboration and other types of social processes among scientists. In particular, I believe this can be usefully framed as a network boundary problem (Lauman & Marsden, 1989).

Finally, I showed some preliminary results from studying co-authorship network of employees of the University of Warsaw. Among other things, we see quite some differences between departments in terms of propensity to co-author (also depending on the type of co-authored work) and network transitivity.

Comments welcome.

Share this:

And so I wrote a post on the Future of ___ PhD yesterday. Today I just learned about this shocking story about a political science PhD looking to be employed as an assistant professor at the University of Wrocław and facing shady realities of (parts of) of Polish higher education… Share and beware.

Share this:

Fill-in the blank in the title of this post with a name of scientific discipline of choice. Nov 1 issue of NYT features a piece “The Repurposed Ph.D. Finding Life After Academia — and Not Feeling Bad About It”. The gloomy state of affairs described in the article mostly applies to humanities and social sciences, at least in the U.S., but I’m sure it applies to other countries as well. I’m sure it does to Poland too. More and more people are entering the job market with a PhD (at least in Poland as evidence shows). At the same time, available positions are scarce and the pays are low. It is somewhat heart-warming to know that people are self-organizing into groups like “Versatile Ph.D” to support each other in such difficult situation.

The article links to several interesting pieces including the “The Future of the Humanities Ph.D. at Stanford” discussing the ways of modifying humanities PhD programs so that humanities training will remain relevant in the society and economy of today. Definitely a worthy read for higher education administrators and decision makers in Poland.

Share this:

Google Reader was one of my main way of reading Internet. It was great to read news and updates from many websites. For example, I had my own “R bloggers” folder within Google Reader long before Tal Galili created R-bloggers.com. Unfortunately, Google is killing the Reader on July 1. There are several alternatives to the Reader, just search for “google reader alternative”. Meanwhile, I switched to Feedly. It’s pretty cool, although there is a couple of things that annoy me a lot, e.g.: too many content (feed/item) recommendations and keyboard shortcuts are different than in Google Reader. The mobile app (I use Android) is also great although a bit heavy for my Samsung Ace. Nice features include being able to (1) push feed items to Instapaper or Evernote, (2) save selected items for later reading.

And so, I just browsed my Feedly “Saved for later” folder and here are a couple of interesting items from last 30 days:

Recent issue of Science brings a very cool paper by Luís M. A. Bettencourt explaining the scaling properties of cities: how things like GDP, crime, traffic congestion etc. depend on city size. Descriptively the relationships seem to follow a simple power-law relation (see this presentation by Geoffrey West). However, as the paper shows, explaining it is not that simple and involves considering many types of interactions and interdependencies.

To finish on a somewhat less geeky note, Warsaw National Museum has a temporary exhibition of Mark Rothko featuring his works from National Gallery of Art in Washington DC, which is a first Polish exhibition of Rothko’s works ever. Accompanying the exhibition, there is a lovely childrend’s guide by Zosia Dzierżawska.

Share this:

R has a built-in collection of 657 colors that you can use in plotting functions by using color names. There are also various facilities to select color sequences more systematically:

Color palettes and ramps available in packages RColorBrewer and colorRamps.

R base functions colorRamp and colorRampPalette that you can use to create your own color sequences by interpolating a set of colors that you provide.

R base functions rgb, hsv, and hcl that you can use to generate (almost) any color you want.

When producing data visualizations, the choice of proper colors is often a compromise between the requirements dictated by the data visualisation itself and the overall style and color of the article/book/report that the visualization is going to be an element of. Choosing an optimal color palette is not so easy and its handy to have some reference. Inspired by a this sheet by Przemek Biecek I created a variant of an R color reference sheet showing different ways in which you can use and call colors in R when creating visualizations. The sheet fits A4 paper (two pages). On the first page it shows a matrix of all the 657 colors with their names. On the second page, on the left, all palettes from RColorBrewer package are displayed. On the right, selected color ramps available in base R (base package grDevices) and in the contributed package colorRamps. Miniatures below:

Below is a gist with the code creating the sheet as a PDF “rcolorsheet.pdf”. Instead of directly reusing the Przemek’s code I have rewritten the parts that produce the first page (built-in color names) and the part with the ramps using the image function. I think it is much simpler, less low-level for-looping and a bit more extensible. For example, it is easy to extend the collection of color ramps by providing just additional function name in the form packagename::functionname to the funnames vector (any extra package would have to be loaded at the top of the script).