Friday, August 28, 2009

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

Tuesday, August 25, 2009

'It’s stunning to see a genetic modification like this,' developmental geneticist Douglas Mortlock of Vanderbilt University in Nashville, Tenn., says of the new study, published online July 16 in Science. 'This is the gene that makes wiener dogs short-legged.'

Thursday, August 13, 2009

While not directly related to genetics, this is an excellent example of well-designed data representation. The New York Times reports the results of a survey of average time spent on various activities through the day by different groups of people.

The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...

Logan recently emailed me an article in the New York Times about single-molecule DNA sequencing and I realized I knew next to nothing about the new and emerging technology that will change the way we do association studies (that is, if we're still even trying to find genetic associations in the first place). The Wellcome Trust posted a news feature a few weeks back giving brief explanations and short videos on DNA sequencing, starting with the old Sanger method, then the second generation 454 and Illumina (Solexa) technologies. They also give a quick overview and and link to some of the 3rd generation technologies in the pipeline, including Pac Bio, Oxford Nanopore, and Complete Genomics.

Wednesday, August 12, 2009

The Systems Biology Graphical Notation (SBGN) project is an effort to standardize the graphical notation used in diagrams of pathways, biochemical processes, and cellular processes studied in systems biology.

SBGN defines a comprehensive set of symbols with precise semantics, together with detailed syntactic rules defining their use and how diagrams are to be interpreted. By standardizing the visual notation, SBGN can serve as a bridge between different communities in research, education, publishing, and more. The real payoff will come when researchers are as familiar with the notation as electronics engineers are familiar with the notation of circuit schematics. If researchers are saved the time and effort required to familiarize themselves with different notations, they can spend more time thinking about the biology being depicted.

Tuesday, August 11, 2009

Hadley Wickham, creator of the previously mentioned R plotting system ggplot2 and author of a forthcoming book from Springer, is teaching a workshop in data visualization using R, ggplot2, and GGobi. Unfortunately this workshop conflicts with IGES and ASHG this year, but he mentioned the possibility of holding a workshop here at Vanderbilt if there is enough interest. Leave a comment or email me if you'd be interested in attending this workshop if it is held at Vanderbilt.

Thursday, August 6, 2009

That's the title of a good article published yesterday in the New York Times about the emergence of statistics being in huge demand in the career market, becoming "the sexy job in the next 10 years" as Google's chief economist puts it. Now I just need to find one of these don't drink and derive t-shirts...

Wednesday, August 5, 2009

I've used this a little bit recently. Pubget indexes essentially everything that PubMed does, except you get the PDF you're looking for right away. Lots of other useful tools as well. I sent one email to the Pubget team and CC'd the biomedical library, and a few days later they've worked it out so PubGet recognizes Vanderbilt's subscriptions. If you're at Vanderbilt, go to http://vanderbilt.pubget.com/, otherwise just use http://pubget.com/, and select your institution from the dropdown list, or email them if it's not there.

The one thing I've found is that they don't index things as quickly as PubMed, so you might have a hard time finding Advance Online Publications using Pubget.