The Nutritious Rice for the World (Rice) project, a World Community Grid BOINC project, ended a few weeks ago. BOINC (Berkeley Open Infrastructure for Network Computing) is a non-commercial program and infrastructure which allows volunteers to donate their computer’s spare computing resources to take part in very interesting, computing intense scientific projects. Many people around the world contributed their CPU-resources to help figure out the structure of proteins of the most common strains of rice. In the end, about 25,761 years of CPU-time were contributed to the project. IBM heavily contributed to this project through their World Community Grid (WCG) program, offering Rice a massive userbase and community.

Rice is one of the most common foods in various parts of the world. It’s in the interest of us all to find varieties and breeds of rice which are most nutritious or resistant against pests; the project’s goal is to find out which varieties of rice interbreed with others to give the best results so that we’ll get new strains of rice which are harder, better, faster, stronger.

Dr. Ram Samudrala

A lot of BOINC-users who contributed to the project (like myself) are now asking themselves a lot of questions. Who are the people behind the scenes? How much work is necessary to get a project like this into operation? What was IBM’s role? What will happen with the contributed results? And after all, who will benefit from the project?

Tell us a little about yourself and how you got involved in the Rice-project.Ram: I’m a professor researching computational biology at the University of Washington Seattle. My overarching interest has been to understand and model how the genome of an organism (genotype) specifies its behaviour and characteristics (phenotype). We develop computational algorithms to this end that are applied to whole genomes and we work on many organisms. Rice was specifically chosen since our collaborators at the Beijing Genomics Institute had just finished sequence (and we annotated the refined version) and I also got a $1.9 million grant from the US National Science Foundation (NSF) to predict the structure and functions of all proteins encoded by the rice genome. We developed algorithms to do this and we applied it to all rice proteins. Then IBM came along and offered us the means to redo some of our calculations on the most difficult proteins using the WCG and then we ported our code over to work on the Grid.

When was the first time you considered using voluntary distributed computing for your project?Ram: Since the days of SETI@home, and since we built our own local clusters to do structural computational biology, but porting our code to BOINC was always a inertial challenge.

Did you consider using other DC-infrastructures except BOINC, like distributed.net? If yes, why did you decide using BOINC?Ram: No, we used BOINC since it was what was supported by IBM WCG.

Have you considered asking the NCSA for computing resources?Ram: Yep, but it’s a cumbersome process, like applying for a grant, and again, porting software to work on different architectures. The barrier is that we get grant money to do research and not develop software. I have used NIST supercomputing resources in the past.

You said you would need 200 years of computing time using your available resources. Besides voluntary distributed computing and the University of Washington, were there other universities or institutes directly contributing computing-resources to your project?Ram: Not for this project, no.

Rice BOINC Splashscreen

You were using algorithms from the Protinfo website. Which one did you actually use, how much effort did you put into customizing it for using it in BOINC? Can you tell us if those algorithms and implementation are released under a free license?Ram: It’s the Protinfo AB algorithm, which is our ab initio or de novo simulation protocol. IBM spent a fair amount of time porting the code
to work with BOINC. The original algorithms/software are all freely available without any claim of copyright (i.e., in the public domain).

Could you explain “de novo” and “ab initio” for non-scientists, please?Ram: “De novo” and “ab initio” generally are translated to mean “from first principles”. In the old days, this used to mean using pure physics energy potentials for protein folding. These days, to us, it means any set of general principles that is not biased to a particular protein or organism.

If the algorithms you used are under a free license, did you already manage to publish the modifications, if there are any?Ram: The modifications involving the porting are with IBM and they are unpublished.

(Ed. note: Since the software was released in the public domain there’s no requirement to publish the modifications.)

IBM helped you out in customizing the protein-prediction algorithms for various platforms. Can you tell us how much they contributed?Ram: All the customisation was done by IBM engineers. We just gave them the original software and ran sanity checks on the output. I’m a strong free software and anti IP proponent, to the degree that I encourage commercial use without restrictions on the software (people can always use the public domain versions if they want to).

Rice Terraces

How much time did you save by using the World Community Grid’s infrastructure compared to if you would’ve set it up all on your own, like other projects do?Ram: IBM took about six months or so to port our software, so I presume it would’ve required that kind of an investment. Keep in mind that they had a lot of prior experience with BOINC. IBM now maintains the code and does the PR and runs the predictions for us. I’d say this would be a full time programmer/sysadm type of person and if I had that extra money, I’d rather spend it on someone doing the basic research.

If there are flaws about BOINC, which would you like to be addressed first?Ram: I can’t think of any in the way we did it with IBM, but without IBM, the PR machine has to be powerful to get people on board. It’s more than just recruiting people, but also motivating them as IBM does with badges and giving them a sense of community and providing a support infrastructure. This is hard for a research lab to do on their own (it can be done, but is it really the best use of our talents is the questions).

Programming and debugging is an iterative process. Looking at your sourcecode-repository, how many releases of the software were necessary until you got the cow flying?Ram: For this case, internally we probably had about 10 or so iterations in total, but the basic science part of the software is something that has evolved over 18 years.

How did you do beta-testing, did you use the publicly available beta-projects at WCG? Or, were you actually just doing it in your lab?Ram: It was mostly in our group. We just submitted sequences for which we knew the answers and we did a dry run initially with the same sequences.

I’m curious there – were these structures predicted by other algorithms or was that done the hard way, using X-ray crystallography?Ram: These were done the hard way, at the bench. These are our gold standard for when we know we’re right or wrong, so we benchmark our methods against all this. When we did the rice project, we did sequences with known answers to see how well things would work and that there was no chance of anything going wrong.

Dr. Ling-Hong Hung

How was is like getting in touch with the community? Was the feedback helpful? How many people from your team were actually dealing with the community?Ram: At its peak, we had 3 people dealing with the community, our sysadm and project lead Michal Guerquin, our programmer and scientist Ling-Hong Hung, and myself. Opening our software to the Grid and the community definitely presented some challenges, which I believe will be the focus of our first paper. An interesting tangent of that is that we’ve had to port some of our analysis software to work on GPUs so we could handle all this data. So some good technological developments here that we’ll be writing about shortly.

Michal Guerquin

A lot of people are concerned about “Frankenfood”. Your project’s website explicitly states that this is not about genetic engineering, but about finding the most nutritious rice-strains for interbreeding with other rice-crops. Is there anything you’d like to explain to people who are still concerned?Ram: We’re simply extending what farmers have been doing for millenia in a more rational way, and also what has been going on in nature for billions of years. The problem to us is scientific and all knowledge that is produced (which from our end will be completely free and transparent) can be used in various ways according to the will of the people. But we have governments and politicians to handle the deeper societal implications. What I mean by this is that people should petition their representatives, as they are doing successfully in many parts of the world, to decide where to go with genetically modified organisms, which I see as ultimately having a socioeconomic/political solution.

Your project is one of the very few with a fixed end, almost all other projects are handing out work-units for new phases. How comes that you’re finished now? Is everything from the rice-genome now analyzed from a computational point-of-view and nothing else left to do?Ram: Not at all. We obtained a huge amount of data and we’re now pressed to analyse it. I honestly can say that we were overwhelmed with this data. My goal as a scientist though is not just to develop technical tools and produce large tables and graphs but try to come up with something tangible that is prioritised and can be tested at the bench that really changes the make up of rice in a desired manner. The computations and the Grid are the means by which we arrived at this step, but our job now is to figure out where the best low hanging fruit is in collaboration with rice researchers (which we are doing with researchers around the world including IRRI, Phillipines). [Ed. note: IRRI, International Rice Research Institute]

Focussing on the data: Now that you know how those proteins really look like, where do you draw a line and say “this protein is more nutritious than others”? My basic understanding is that the nutritious parts in rice is actually carbohydrates (starch), proteins and some fat. How do I have to imagine this analysis?Ram: So the proteins we’re talking about are gene products, that carry out almost all the functions in rice (or any other organism). So we use
the protein to refer to a molecule that does this, rather than the nutrition use of the word “protein” which refers to these biological molecules broken down and aggregrated (see “Protein” and “Protein (nutrient)” in Wikipedia).

By nutrition we mean anything that leads to higher range of bioavailable substances like dietary minerals and vitamins. In rice, examples include elements like iron or organics like vitamin A. Incidentally the “golden rice” GMO is a product of Monsanto that has higher beta-carotene, a precursor to vitamin A (“Golden Rice” at Wikipedia). We’d like to get to something like that by crossbreeding without the use of genetic engineering, working on both micro and macronutrients.

So in the end, we need to be able to create a rice strain that does have enriched nutrients and is perhaps better than current strains in
terms of yield and/or hardiness. Before we go off and start crossing rice, there are a number of molecule biology bench experiments that
can be done to say whether predictions we make about the activity of certain proteins will be correct so we’d do them first.

Do you plan to publish all your results in an Open Access Journal?Ram: Yep, that would be the ideal. Publishing in Open Access Journals also sometimes costs money. I’m not a big fan of the “pay to publish”
model—it’s not a lot of money and some scientists have grants to do this, but it’s not a good principle.

Thank you very much for this interview!Ram: Thanks; I enjoyed the questions!

Dr. Ram Samudrala is a tenured Professor at the University of Washington, Seattle. He’s head of the Nutritious Rice For The World project and one of the inventors of protein prediction algorithms. He’s a notorious contributor of scientific papers and generally a very nice guy I’d like to buy a drink.

Dieses Werk bzw. Inhalt ist unter einer Creative Commons-Lizenz lizenziert.
The rice picture is copyrighted and CC-BY-SA by Flickr-user kadaoor.
The rice-paddy picture is copyrighted and CC-BY by Flickr-user ~MVI~
The pictures of the teammembers were used by permission of the Rice-team.
The BOINC splashscreen is copyrighted by IBM and the World Community Grid and was used with permission.

The NCSA wrote a very easy to understand, yet quite complete article with explanations about David Baker’s Rosetta project, an theoretical approach to deduct a protein’s structure using computer-simulations.

Things I learned from this article:

The code does not start with a “flat” protein-molecule, starting to wiggle it around, but with a “homologous known protein structure” as a starting point. I don’t understand if that’s good or bad, but it limits the permutations to be checked.

David created a portal known as Robetta, where other biologists can submit their models to be crunched.

The Great Internet Mersenne Prime Search reports that they’ve probably found the 44th Mersenne prime:

“On September 4, 2006, a computer reported finding the 44th known Mersenne prime. Verification will begin shortly, probably taking a week or so to complete. If it is verified, this will be GIMPS’ tenth prime!”

The probably heaviest request from users during the last BOINC user-survey was definitively “Introduce a more fair credit-system”. It’s still kind of frustrating that some projects hand out lots of credits per CPU-hour where others are more close-fisted with their credits. And there’s also the issue that we have “calibrating” BOINC-clients which sail around the known credit-issues and manipulate the claimed credits for a work-unit.

Some people consider this cheating, others claim that this is self-defence – their argument is “why should we get less credit in total even if we crunch more data per day?”

Both have a point, so some projects finally decided to go away from the naïve BOINC credit-scheme (which is based on the internal benchmarking algorithm) and create their own, CPU-hour based scheme.

Why would someone who’s into science care about the credit-system anyway? There’re several reasons: Motivation and individual success is the absolute base for public voluntary distributed computing, something which some people out there didn’t understand yet. If you want to build up and maintain a large user base you need to give them incentives. Credits, public blessings, and – important – constant reports about the project’s success which show more than just how many percent of the project is already done like RC5 does. (OK, i have to admit, there isn’t much to report in the RC5-project, but you get my point, don’t you?)

And that’s the reason why you have to care about how many credits you issue and how you discuss the credit-issue in public – never underestimate the so-dubbed “credit-whores” – they’re your user-base and might wander of to projects which hand you more credits. If you’ve lost a user you’ll never get him back – most probably.

Be opportunistic and go for the high-performers even if they’re just after the credits. Be nice to your users and give them real reports every couple of weeks. Participate in the fori and give your users feedback. If possible, organize parties to meet your users (no one ever said anything about that you should pay). Optimize your science-application and be as fair as possible with the credits. Take rants and criticism serious. If people start optimizing your science-application: Embrace the changes and let them take part in the validation-process.

Remember: Even if tuning your science-application to be as efficient as possible takes a lot of effort, remember: Your users will thank you because they can crunch more data and you push your project onto a new level.

Corrections:

(1)

Bernd Machenschalk from the Einstein@Home-project correctly pointed out that they did not change the credit-system but only the calibration of the system they’ve introduced with the S5-run.

Sony and folding@home start working together to develop a native folding@home application for the Playstation 3 which uses the Cell CPU. The BBC claims that 10,000 Playstation 3 consoles would make a raw processing-power of “a thousand trillion calculations per second” – read: 1-Peta-FLOPS.

“A graphical interface, also being developed between Sony and FAH, will eventually allow users and the scientists to look at the protein from different angles as it folds in real-time.
The graphics application is currently undergoing tests and is expected to be finished by September.
When the program is released to PS3 owners, the scientists say they will be able to “address questions previously considered impossible to tackle computationally”.

Ben Owen of the Einstein@home science-team posted an update about the ongoing efforts; this posting was done in the Science-forum, not on the frontpage, where one would expect it.

He reports that the National Science Board officially certified the project as “reached the initial design goal” which means “we’re officially in business”. He also points out that the S5 raw-data is twice as good as S4 was; they had some problems with the precission of their interferometers, mostly due to construction-works outside the L1-site.

RALPH@home is the official alpha test project for Rosetta@home. New application versions, work units, and updates in general will be tested here before being used for production. The goal for RALPH@home is to improve Rosetta@home.

So if you got any spare CPU-cycles and want to help improving cutting-edge rosetta@home applications, go and sign up!