Meta

Month / November 2014

James Watson is seriously facing the risk to go broke. After his comments on the linkage between race and intelligence in 2007, when he claimed that Africans were genetically less intelligent than Caucasians, the American molecular biology pioneer suddenly ran into isolation, drawing the contempt of public opinion and academics. Now, his budget is dangerously low, and he decided to auction Nobel Prize medal to fuel his finances and to make a couple of donations. Evidently, and despite his advanced age, the need to clean up his public profile is still very strong. As I have read this on The Guardian, my mind went back to 2007, when I was an undergraduate staring in disconcert at such unbelievable comments by the man whose discoveries caused me, and thousands students like me, to join biology.

As extensively explained on The Independent, Watson proposed that the IQ tests, conducted on Afro- Americans, confirmed a significant racial divide in intelligence, and discussed some connotations in welfare policies. He claimed genes responsible for human intelligence determination could be found within a decade, to provide an experimental support to his statements.

Despite controversy understandably focused on racism at the time, I have always found quite curious that intelligence could be “written in our genes”. Before any consideration on the social implications, we should reflect about the scientific bases of what Watson says: is DNA able to determine how smart we are? During the past decade, as the sequencing capability grew exponentially, the belief that any possible answer in biology could be found in the DNA became dominant. Enthusiasts, and molecular biology advisors, eagerly celebrated the golden age of genomics, proposing a bright future made up of genome wide- screenings, personalised medicine, and other disturbing GATTACA- like scenarios.

Everyone seemed pretty sure that any phenotype could find his direct counterpart in the genetic code, firmly trusting in the neo- Darwinian commandment claiming the existence of a simple relationship between genotype and phenotype. According to this view, even a very complex and hard- to be determined phenotypic trait as intelligence must be the effect of some gene. Everything is thus very easy: one day we will discover the genes controlling intelligence, creativity, love and even football addiction. You don’t need a degree to understand how much improbable is this. Luckily, the application of complex systems theory to molecular biology and evolution is telling a different story, and the current challenge is to understand how the phenotype is determined by independent contributions at genetic, protein, cell and macroscopic level.

The very first mistake James Watson did was not his racist outbursts, but his giving in to the lure of reductionism. Intelligence is the result of complex interactions at neuronal level, and human brain’s huge plasticity is our winning strategy in evolution. Over the years, no convincing proofs of the existence of genes controlling intelligence have been provided, and the main trend in brain research is to focus on brain’s impressive ability to change and improve. Moreover, IQ test are highly controverted, because their ability to predict the potential of a mind is all but demonstrated. The American molecular biologist applied a reductionist approach to a pretty complex matter, by using a very weak indicator, since there are no genes controlling intelligence, and the IQ itself is just pointless.

Watson’s creepy positions are thus the direct consequence of a kind of “genomic delirium of omnipotence”. It confirms that in Science, and in life itself, terrible things may happen if you choose the simplest route, indulging in simple answers to hard questions, and leaning on shallow descriptors of complex phenomena.

Next week, December 1st- 4th, the big names of informatics will be online to discuss about coding and software development. The hack.summit() is an initiative aimed at supporting non- profit coding, and it is promoted by the online consulting service hack.hands(). On their website, you can scroll the list of participants, and the names are truly impressive.

Google Glass creator Tom Chi, Bitcoin inventor Bram Cohen, Brian Fox, who created the GNU Shell, Python Software Foundation president Alex Gaynor and even the Microsoft Executive Vice- President Qi Lu, will be online and available to talk with attenders. You’ll have the chance to interact and ask questions to speakers, getting in touch with an impressive cohort of top- level developers.

Registration is quite simple. You can decide wether donate a small amount, or use the Pay With a Tweet system. Revenues will be used to support nonprofits projects. Initiative’s goals are in fact to raise money for coding nonprofits, educate programmers of all languages and skill-sets and encourage mentorship among software developers.

I will definitely attend this, hoping to hear some good tip from the guy who’s running the foundation supporting my favourite language Python.

Up until now, the initiative broke a new record, recoiling more than 36k participants, and registration are still growing. So, do not miss this chance and join this amazing meeting and to meet your programming idol.

How to become a bioinformatician? Many people, at different career stages, are trying to answer this question, looking for the best path to achieve the required knowledge to become a bioinformatician. Many academic institutions are provided with Bionformatics degrees at undergraduate and postgraduate level, and you can easily find free courses on the internet to get a fair introduction to bioinformatics. Anyway, the high development-rate of this field, and the huge diversity in its applications, tend to weaken the effectiveness of official courses. If Science was a war, bioinformatics would be guerilla, a merciless battle where you are alone, fighting in a jungle of algorithms, big data, statistics and software solutions.

One of the blogs I enjoy the most, is Guillaume Fillon’s “The Grand Locus“. Guillaume is a group leader at CRG in Barcelona, and linked his informal group’s page to a personal blog, where he writes mostly about bioinformatics and biostatistics, putting in an amazing, and very effective mix of scientific insights and personal experience. In one of his latest posts, Guillaume tries to give an answer to a question that is becoming very common as the interest in Bioinformatics grows: “How to become a bioinformatician?”. The answer given on The Grand Locus is pretty unexpected, since no particular path, language to learn, skill-set, course or strategy are suggested, but just three simple tips on how to change your mind before to start your “journey in bioinformatics”. A great point indeed. Actually, you really have to get out from your “comfort zone”, and try over and over with a lot of things you used to ignore before. Also, you must understand the importance of collaboration and community, and definitely need to become addicted.

Anyways, some more practical tips may be needed, and a discussion about the very first things to learn to move the first steps in bioinformatics could turn out useful to many. Honestly, I am not the one to give suggestions. I am in bioinformatics from three years only, still submitting my first papers, and am most reasonably in the need of some good hint. That is why I want to share a couple of ideas with you, asking to put some good criticism in this quick receipt I am going to write down.

I fear I must apologize with the experienced users who will get to read this. I will go in detail, describing a lot of well-known things for computational biologists, and I understand that this could result a bit boring.

First ingredient: the minimal biological knowledge.

Computational biology is a strongly interdisciplinary field, recoiling the interest of scientists with radically diverse backgrounds. Also, the number of possible applications is pretty high, since biology is a very vast and heterogeneous area of study. Trivially, the very first thing to do is to get linked with the required biological knowledge. Scientists with a non- biological profile, will need to train on biology basics. The trick is to focus on the key concepts of biology (cell structure, molecular biology, genome organization, evolution…) without being overly fussy. On the long run, anyone tends to achieve a very high expertise level by working in biology. At the beginning, you are not really asked for a big knowledge, but you need to understand the sense of what you are going to do. Have some internet courses, read some introductory books (cell biology books are great for this, because they summarize the key concepts of biology at sub- cellular and histological level) and surf in wikipedia (that is quite despised, but still very useful if you can use it properly).

This may apply to biologists as well. Of course, if you are just changing your role in your lab, you will be hopefully quite aware about your work. Anyways, during my long application round, I have understood that is quite common , for a bioinformatician, to range over different projects and subjects. IMHO, one of the main points a bioinformatician should work on, is the capability to get rapidly into a totally new biological field.

Second ingredient: get yourself to love statistics.

Math in bioinformatics is very important. Despite I kinda reject the idea that statistics is the only math branch you will need (complex systems, fractal geometry and logical algorithm development may be very needed in evolutionary studies), we can definitely assume statistics and data analysis as the common denominator of almost all bioinformatics projects. This requires an effort, since many project are supplemented with Bayesian statistics that is an advanced topic, and classically taught at the end of any statistics course. The best is to attend a good course, or to patiently afford a big, heavy, but updated statistics book. I am exploring DeGroot’s Probability and Statistics, considered by many as the most complete book around.

Third ingredient: the software quartet.

On the computer side, there are four elements to keep in mind: Scripting, Unix, R and Databases. Honestly, the very first thing you need to do, most likely before adding any other ingredient, is to focus strongly on a Scripting language. A scripting language is what will ease and speed up your work, is the thing by which you will be free from excel, a real Swiss Army Knife you won’t be able to do without. Scripting languages are high-level programming languages designed for quick application. Their syntax is pretty easy, and are quite fast to be learned. I have started and love Python, but Perl is also very used, and you can eventually consider Ruby, that is pretty spread in Asia.

The second member in this quartet is R. As many will surely know, R is a development environment dedicated to statistics. One year ago, I was pretty sure that R was not needed if you know Python and use mathematical and statistical libraries (NumPy and PANDAS). After having joined in a project involving NGS data analysis, I had to change my mind. R is provided with a huge set of applications, the Bioconductor Suite, that are aimed to, very useful in, and deemed as standards in NGS and experimental data analysis. Actually, one could also consider using R as a scripting language itself. Very personal opinion: not the case. I am still a beginner with R, and I may be biased since I love Python, but I think that a scripting language is still the best to process information. Also, consider that many software distributions and online databases are provided with APIs (Application Programming Interfaces) that allow to implement your scripts to extend the functionalities. For instance, Ensembl has a Perl API, Uniprot can be queried programmatically by Python, Perl and even Java. In structural biology, Rosetta is provided with a Python interface (PyRosetta), and PyMol is written in Python and allows the creation of plugins.

Third, comes the Operative System. If you play cool and are geeky enough, you are most likely viewing this post on a Mac. Good point, maybe not the best though. In bioinformatics, Unix-based systems are very used, and you will very often be required to have experience in Unix-like environment in many job advises. Mac OS is a Unix- like environment, and you can easily learn how the filesystem works and practice the BASH language, that is the command-line language for Unix. Unfortunately, Mac OS is not really optimized for programmers, and you may experience some bad trip. For instance, I am trying over and over to get this huge iMac in my lab to understand that I need to fucking link mySQL to Python, in order to install a library to fetch genomic sequences from the internet. No way. I warmly suggest to take your courage and install linux. Ubuntu, user-friendly and always beautiful, is the best choice.

The last member of this quartet is the Database. You need to learn how to deal with the main online databases, and will probably need to create your own to explore and analyze your data. It is very important to understand how a DB is organized, and how the management system work. Boldly, we can say that the dominating database form is the relational database. You can consider learning a bit of SQL (structured query language) and practice with SQL- based software, such as MySQL or SQLite. Relevantly, you can link your scripts in Perl, Python, Ruby and even R with an SQL- based database. Very useful indeed.

Fourth ingredient: stay very tuned

Bioinformatics is a rapidly-developing science, and novelties come pretty often. Together with the basic knowledge of fundamental algorithms (FASTA, BLAST, CLUSTAL…) and file formats, you must improve your attitude to find new algorithms and decide what is the most proper for your work. On this blog, I try to share and review new software, because this is of great interest in bioinformatics. Basically, you should work on your geekery and interactiveness. Search a lot, discuss on internet, rummage around the web. This will help you quite a lot.

Fifth ingredient: the computer awesomeness

Many people I use to talk with, wet lab guys in particular, are surprised at how I am able to spend so much time on the computer. Terminal work can be wearisome, and you need to find the best deal with the machine. Over the time, I realized I have developed a set of habits that speed me up in my computer work. This is very personal, as anyone will find his/her own best way to work on a PC. I can tell that I have found very useful to train myself in using the keyboard over the mouse, in keeping my desktop beautiful, my screen clean and my files ordered, and a couple of tricks to stay comfortable and zen when working. The point here is that you should not limit to focus on learning new notions, but also to improve your customs to get more effective.

The legendary Russian chess player Garry Kasparov use to tell that is very important, in a game strategy, to bring all your heavy pieces at the center of the chessboard. This way, you will have the full control of the game and your opponent’s moves. This is a very good idea in bioinformatics too. I am fairly sure that the first thing to do, is to put yourself at the center of the chessboard. Even considering the huge diversity of bioinformatics applications, and the breakneck speed with which they evolve, a proper set of skills will help you not to drown. After so many words, the main point is just: keep calm and learn to program.

Rosetta project’s success can be considered as the best achievement of space exploration in European history, a victory for the mankind and for anyone supporting European unification, as I do. Unfortunately, not everyone seems to appreciate, or to understand the scale of this event.

What I am going to report, it’s just shocking. The Italian channel Rete4 played a crucial and cussed role in the rise and establishment of Berlusconi’s 20 years long rule. Owned by the Mediaset group, the news channel used to faithfully and slavishly spread the conservative propaganda, with a long list of mystifications, and continuous attacks to minorities and homosexuals. As Berlusconi turned his foreign policy into a deeply anti-European and pro-Russian action, his media followed him with unjustified and false attacks to the European Union. They never dealt that much with Science too, supporting very often creationism and funding cutoffs to research. After Rosetta landing, Rete4 decided to give its opinion. What it follows is the literal translation of the Rete4 commentary.

If we had ever found a truly fascinating celestial body, even more than the moon, that would be a comet, with its shiny tail, visible during the day too. (…) Even if almost nobody knew that, from 2004 the ESA (European Space Agency) is working to waste us this image. Ten years ago, Rosetta probe was launched, and after a very long and lonely voyage, has now reached the 67P comet, more than 800 Kms away from the Sun. Here’s the pictures, the crude pictures, of a big dusty stone. On this rock, a robot will land down in a few hours, and it will make an hole on the surface. It almost hurts to know that the drill was built in Italy. A stone, nothing more than a stone, not even with the dark fascination the terrible comet had in the movie Armageddon. Scientists are most likely the only ones to get excited for this. They explain that Rosetta mission will bring us insights on the origins of life, that probably came on the Earth right on the tail of a comet. The mission costed, until now, more than 100 millions euro. Honestly too much, even to retrieve an archaeological relic of the universe.

There are three main things I want to point out about this creepy comment. The first is evident: the low cultural level of the person who wrote this. The superficiality with which the whole fact is reported, and the poor language, are proving that the editorial staff just ignores the basics of Science that are expected at a School level. The second point is also relevant, even if not very evident. The commentary highlights the difference between the beauty of art and religion, and the cruel realism and rationality of Science. This is, unfortunately, very common in Italy. For historical reasons, Science has always been deemed as a minor culture, respect the dominating humanistic culture. The third thing to underline, is about the money. They point out that the project costed more than 100 million euro, but they “forget” to mention that the amount has been spent in ten years and, doing the math, any european payed 3.50 euro to get this extraordinary result.

I understand that if I keep reporting all the defects of my country, readers will run into bore, but I think there are some points of interest in this story. The “war on Science” is a common feature of conservative policies all around the world, and Sarah Pailin’s awkward statement on Drosophila research is just one of the many possible examples. In any context, the war on Science proceeds with the same strategy. The trust in Science is weakened in public opinion, by diffusing misconceptions, superstitions and conspiracy theories. This effort is reinforced by a systematic cut-off policy in public education. This provides the suitable ground to decimate the public funding in research. The ultimate aim is to kill the critical thinking, in order to better control the public opinion and be eased in populistic and racist campaigns.

Differently from the US, where the importance of research funding is still well understood at government level, in Italy this attack worked a lot. Schools are in the red, public research funding just ceased to exist, our school students are the most ignorant in Europe, and racism is spreading out with alarming rapidity.

Luckily, things are changing as the Berlusconi’s rule went belly up. The most of commenters on social networks are literally rising up against the Mediaset’s TV channel, whose comment didn’t killed the interest in Rosetta project historic achievement.

A “Machine Learning” algorithm is defined as an algorithm able to change its structure and functioning according to the data submitted. In other words, a machine learning algorithm is capable to learn from data and be refined after implementation. Nowadays, many structural biology (e.g. psi-pred, jpred), bioinformatics (HMM-based software) and systems biology (network analysis and db comparison) algorithms rely on machine learning methods, and an insight of the basic principles underlying them is very useful to all those that are working on software development. Unfortunately, an extensive study of such an advanced topic may be pretty tough for someone with a biological background.

Surfing on YouTube, I have been really pleased to find the mathematicalmonk’s channel. Actually, I have no clue on who this guy is, but I am pretty sure that he did a good work with his tutorials. Along with other advanced mathematical topics, Machine Learning is explained in a 160 videos playlist, where the author explains the base concept of Machine Learning with simplicity and great clearness. The course goes through all the major topics needed for an introduction to machine learning methods, and it’s a perfect point to start your exploration in the machine learning.

Above this post, you can play the introductory video, to get an idea about the topic and the kind of lessons proposed. The whole playlist can be found following this link.

The paper I am going to share today collected my attention because it merges two fundamental topics that are quite a lot undertaken in evolutionary biology: plant genomics and the rule of cis-regulatory elements in evolution. Plant biology provides an excellent framework to perform studies in genomic evolution. Even if the knowledge acquired is applied mostly in agricultural biotechnology, that is a field that trills me a lot, plants represent a perfect environment to understand general- validity principles in genomic evolution. The role of cis- regulatory elements, and their contribution to organism differentiation, are generally understood to be very relevant, but I sense that this topic is quite neglected and a bit obscure.

That is why I have particularly enjoyed the reading of this brand new paper, authored by Zachary H. Lemmon and co-workers, and developed in a collaboration between the University of Wisconsin and Ithaca University (NY). In plant biology, one of the main points of interest is obviously the process of domestication, and its analysis under a molecular point of view. In this paper, maize domestication is analyzed by a genomic comparison among domesticated and non domesticated species within the Zea genus.

To examine the differences in gene regulation during maize domestication from its wild progenitor, teosinte, an allele specific expression analysis is performed on pure lineages and hybrids in different trans and cis regulatory regimes. The investigation focuses on three tissues (ear, leaf and stem) from different developmental stages. RNA-seq analysis provide the confirmation of the consistent cis regulatory divergence in genes that are significantly correlated with the ones under selection during domestication and crop improvement. This suggests the important role for cis regulatory elements in maize evolution.

As the authors argue the relevance of this result for plant biology, we can understand that this study highlights the importance of regulatory genome in evolution, and the great potential that plant biology have as framework for evolutionary biology.

How to keep your work reproducible and replicable? Once you have finished up with your genome- wide analysis, NGS data mining, coding, homology modeling or biostatistics, how can you make your entire job available and testable by other people? The need of a proper strategy to guarantee the reproducibility of research, is a major question in almost any branch of Science, and becomes dramatic in computational research. The large amount of methods available, and the massive quantity of information produced, tend to stultify the efforts in keeping our work replicable and reproducible.

Even if anyone working with computers develops very personal working habits, there is one trick or two to improve them, in order to render your work more reproducible. Broadly speaking, this is what is pointed out in a very recent paper published on Plos Computational Biology. More than a couple of tricks, a real decalogue is proposed to improve the reproducibility of your work. Ten Simple Rules for Reproducible Computational Research, that the authors argue to be pretty effective. As I suggest you to read carefully this very good paper, I just list and discuss each rule.

Rule 1: For Every Result, Keep Track of How It Was Produced. Annotations are fundamental. Very often, one ends up tagging data quickly, just to not forget where they are from. But an extensive and explanatory legend will ease your co-workers and reviewers in understanding what you have done.

Rule 2: Avoid Manual Data Manipulation Steps. Take your time, be patient and write down a couple of code lines. Manual data manipulations are the first source of human error and reduce the verificability of your work.

Rule 3: Archive the Exact Versions of All External Programs Used. Boring and way too much clever, but sometimes fundamental.

Rule 4: Version Control All Custom Scripts. This is something that people tend to underestimate, but still very important. I actually need to improve this part too, to get started you can have a check here.

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. A fair tabbed file, or a CSV is always an act of love towards your collaborators.

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Never had to use randomness, but seeds can homologate analyses.

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. Do not share summarized datas only, but let your reviewers take stock of all the steps you did.

Rule 9: Connect Textual Statements to Underlying Results. Results and their interpretation must be clearly connected.

Rule 10: Provide Public Access to Scripts, Runs, and Results. Summarizing, keep your work transparent, and no one gets hurt.

When I was at the high- school, my italian literature professors used to teach me that “a text is good if self-explanatory”. This means that readers must be able to understand your writing even if you are not there to explain it. This is more or less the simple principle one can adopt to improve the reproducibility of computational analysis.

An improve in working habits can definitely help, even if is not going to be enough. The role of journals, and the need to set up shared rules to impose a major transparency and reproducibility, is also widely discussed. Ultimately, as clearly pointed out in this article on Science, success in reproducibility improvement will come by the collaboration between scientists and journals. A more sustainable working habits, and a major transparency in published results.