1) Single-model methods and quasi-single-model methods are disadvantaged in comparison to clustering methods on big sets of models.2) Unfiltered CASP dataset may be not optimal for evaluation of QA methods (performance of clustering methods is overestimated when models are widespread in quality).3) Structure evaluation is domain-based while QA is target-based. How can we switch to the domain-based evaluation for QA1?

In order to address these issues in the coming CASP10 evaluation we suggest the following changes to the procedure of QA prediction.

1.1. Prediction Center (PC) ranks all server models submitted on a target with the naïve consensus method within 2 days after closing server prediction window on the target. As the analysis of CASP9 results shows, the correlation of the naive_consensus score with the GDT_TS is expected to be very high on the whole set of server models (up to 0.97 on the average), and therefore we can expect the ranking to reflect quite adequately the real quality of models. The ranking from the naïve consensus method will not released to predictors but used as a guidance for PC's preparation of model test sets.1.2. For each target, PC sorts server models in 30 quality bins and a) releases for quality assessment 30 (or/and 60) models – one (or/and two) from each bin. This way the released representative models will cover the whole range of model accuracies in the full model set. Alternatively, b) PC releases 30 (or/and 60) randomly picked server models. One way or the other, we will release for quality estimation a subset of models that is small enough to eliminate advantage of clustering methods over the single-model ones. Prediction window for 1.2 will be open for 3 days for all groups (server- and regular- deadline).1.3. After closing stage 1.2, we will release best 150 models (according to the naïve method’s ranking) submitted on the target. This way really bad models will be eliminated from the dataset and all QA methods will receive an input dataset that is likely more similar to the datasets from real life applications. Prediction window for 1.3 will be open for another 3 days for all groups.1.4. After closing stage 1.3, we will release all server models submitted on the target. This models will not be further used for QA prediction, but may be used by regular TS predictors.

To address the evaluation issue (3), we suggest to calculate global per-domain quality score 0≤S≤1 from the per-residue distance deviations di using the S-score technique (Levitt and Gerstein, PNAS 1998):𝑆=1/(|𝐷|) ∑ (𝑖∈𝐷) [ 1/(1+(𝑑i /𝑑0 )^2 )],where di is the predicted distance error for residue i, d0 – parameter, D – evaluation domain. We plan to use this score in addition to the global scores as submitted by the predictors in evaluations of whole targets.

Please let us know what you think about the suggested changes by posting your comments.

Dear Organizer:I have two questions about these changes.1)From your article, i think that a sever-group have 8 days for one target in a complete cycle.2days(prediction window),then 3days(1.2),at last 3days(1.3).Is this understanding correct?2)After 1.1, you will sort models in 30 bins,these bins from naive consensus rank and cluster. Is it your meaning?Look forward to your reply and hope for your help.Best wishes~

1) QA server (and regular) groups will have 6 days in a complete prediction cycle. Phase 1.1 - is not a prediction period but rather time needed for Prediction Center to run naive consensus method and prepare different sets of models for release. Actual quality assessment prediction, i.e. stage 1.2., for each target will start on (Target_release date + 3 days for server tertiary structure prediction + 2 days for preparation of datasets). Then it is 3 days for 1.2, and additional 3 days for 1.3. 2) This is correct.

Obviously these changes will not make any big difference to sepearate consensus and other QA methods. Anyone who wants to do well will just collect a sufficiently large set of own predictions and add them to the set of target to be evaluated and then consensus methods will work very well. (They probably works very well anyhow, as long as the prediction center does provide randomly selected models).

This will ofcourse add burdon to all webservers as many people will try to gather their own predictions.

What I think it a much better solution, is actually to change the target for QA evaluation, by (1) Focus on selection of the best models (here consensus methods to quite badly) and (2) only check the correlation on the top 25% of models

Firstly, I agree with Arne's points, but I accept the new rules should level the playing field somewhat.

Also what is the rationale behind omitting the target sequence from the server submission data? Many methods (both single and consensus) make use of the target sequence to evaluate model quality, for instance when checking secondary structures. Indeed simple target sequence coverage is an important aspect of gauging model quality. It is also useful for guiding the formatting for QMODE 2 prediction output. The target sequence can be pieced together from the models, but its handy to have the original sequence the model was built for, so why leave it out of the submission? In reality anyone with a 3D model will also have the sequence so what are we testing by leaving it out?

Arne: 1. We do realize that if one wants to cheat here - he potentially can do that, but I guess this will not be a trivial task. First, I am not that sure that padding a model test set with your own predictions will help much. Second, collecting a lot of predictions from different servers may be problematic during the CASP season because of time restrictions. And finally, based on the analysis published in the CASP9 evaluation paper, we expect the results to be in the specific ranges for different types of QA methods and different dataset sizes. If someone's results will stick out of the pack - we will request the server software from the group leader and recheck accuracy of the method by running it locally at the Prediction Center. Results of this recheck will be publicly announced at the meeting. 2. Concerning the checking of correlation on a selected percentage of top models. We did this already in CASP9 and the proposed change to the testing procedure aims at doing this more rigorously.

Liam: I am not sure where "omitting the target sequence from the server submission data" comes from. We never planned to change the format of the submission data.

Can I just confirm that you will be binning in uniform width GDT-TS bins and not uniformly interms of the number of models per bin. If you don't do it this way, or if you sample randomly fromthe whole population of models then (I think) a trivial way to game the system will simply be to find themost similar pair of structures and then assume that these are the two most similar structuresto whatever structure you originally selected as reference. Then you would simply calculate the similarityof the other 28 models to either or both of these models to get a good correlation with your originalranking.

Actually thinking about it more, even uniform GDT-TS bins won't completely eliminate the problem,because the volume of conformational space increases rapidly as the RMSD from a single referencepoint increases. So two models that are both 3.0 A RMSD from a reference structure will be expected tobe less similar than two models 1.0 A RMSD from the same reference. This is basically why clustering worksin the first place.

The only way to eliminate the signal completely would be to, for example, sample the models such thatthe two nearest neighbour distances for all models are close to a constant. I suspect the ideal sampling would bebased on having equal volume cells after a Voronoi polyhedron construction. Or you could just give out2 models per target !