WARNING!!! All merging statistics included in this tutorial have been calculated using POINTLESS and AIMLESS. These CCP4 programs are
periodically subject to small changes. It is therefore possible that some of the merging statistics contained in this tutorial
appear slightly different from what the user obtains with his/her version of CCP4. These differences are very minimal and should be considered
as primarily due to different numerical or algorithmical approximations.

DATA SETS INCLUDED WITH THIS TUTORIALS

For convenience all tutorials can be carried out inside a same directory and specific results
can be collected in separate sub-directories. We will assume that "blend_tutorial" is the name of
the overall directory and that it has been created in the $HOME area ("/home/james" in our case). Different directory
names and locations can be used, but appropriate modifications will have to be applied to the paths
displayed in this document.
Data downloaded using the above links, when unpacked, yield a directory named "data". For the rest of
the tutorials this directory can be moved inside "blend_tutorial".
There are three groups of data in the sub-directory “data”.
One group is formed by crystals of insulin, the second by crystals of
lysozyme and the third by lysozyme soaked with bromine.

The first tutorial will be illustrated with and without use of the CCP4
interface. Other tutorials will be illustrated either with or without
interface, but the user
should have no problem re-running them as it suits her / him best.

TUTORIAL 1.

14 DATA SETS OF CRYO-COOLED INSULIN CRYSTALS

The CCP4 interface associates a project directory to an
already-existing physical directory in your computer file system. Thus we
need to create a directory before anything else. Let's call this directory “insulin”
(it is assumed you are in the blend_tutorials directory):

mkdir insulin

Next we need to associate a
project directory called “BLEND_TUTORIAL_INSULIN” to the insulin
directory just created. You can do this using “Directories &
ProjectDir” in ccp4i. Now that the project directory has been
created let us change to the “BLEND_TUTORIAL_INSULIN” one using
“Change Project” in ccp4i. We are ready to start
practising with BLEND.

Insulin data have been collected at the Diamond Light Source
synchrotron and integrated using MOSFLM. Angular rotation
range varies from data set to data set, as these have been collected
from different crystals. Please, explore some of these integrated
data clicking “View Any File” in ccp4i and selecting the “mtz
MTZ” field in “File type”.

In the CCP4 GUI, BLEND can be found either in the “Data Reduction and
Analysis” or the “Program List” section on the left. Once initiated,
the BLEND section looks like the following:

To
begin with we run BLEND in analysis mode, with the main goal of
creating clusters of data that can, potentially, merge well together.
After having filled the blank fields with the appropriate paths and
keywords, the above interface should resemble this:

All keywords with their meaning are explained in BLEND documentation.
After the run in analysis mode, we can look at the output using ccp4i
“View Files from Job → View Job Results (new style)”:

Cell
sides, cell angles, cell volume and mosaicity can be viewed for all
the 14 data sets included in this insulin tutorial. Numerical values
can be explored within the same view. More files have been produced
by this run of BLEND in analysis mode. These are listed by selecting
“View Files from Job” → “blend_1”:

“blend_1” is a directory under the “insulin” directory,
containing all files produced by this run of BLEND. The most
important among these files is possibly the dendrogram (files
“tree.ps” and “tree.png”). Both PNG and PS files
display the same dendrogram. They can be viewed using any graphical
tool, but, at present, only the PS file can be visualised using the
CCP4 GUI:

The
numbers in red are the Linear Cell Variation (LCV) values (and the
associated absolute LCV in Å) for
the top five clusters. They give an indication of cell similarity
among all crystals included in the specific cluster; thus,
ultimately, they can be associated with isomorphism between different
data sets. In the above dendrogram we can see two clusters with
similar cell variability (0.44% and 0.48%, corresponding to 0.36 Å and
0.39 Å respectively).The variability is
more than doubled when these clusters merge into the overall cluster
containing all 14 data sets; this is probably an indication of some
minor form of non-isomorphism. Tests done so far with BLEND show that
some structural dissimilarities can be noticed for values of LCV
higher than 1 – 1.5%.

LCV
values for all clusters are reported in the file “CLUSTERS.txt”:

This
is, essentially, a numerical version of the dendrogram and, as we
will see, is often useful for executions of BLEND in synthesis mode.
Another
convenient file in directory “blend_1” is “BLEND_SUMMARY.txt”,
which is essentially a table reporting cell parameters and other
quantities for each data set.
The unique serial number assigned by BLEND to each data set can be found
in the “FINAL_list_of_files.dat” file. For us this file looks
like the following:

The
first column contains full path to all input data files; the second
contains the serial number assigned to input files; the fourth and
fifth column list initial and final image numbers, while the third
lists the last image BLEND will include for all subsequent scaling
and merging jobs. Numbers in these column have been calculated
through procedures to get rid of radiation damaged images. In the
specific case the only data set where images will be discarded is
data set 9, for which BLEND suggests to remove from image 108 to
image 122. The last column include resolution cuts suggested by BLEND
for all subsequent scaling and merging jobs. These are worked out
using intensity averages decay with resolution, and are controlled by
keyword ISIGI.

All
input data files were assigned a serial number; this is not always
the case, because sometimes individual data sets are excluded as they
contain multiple wedges or other irregular features, and BLEND does
not accept them in the group to be further analysed. There might also
be reasons for which the user does not want to consider some of the
files included in the data directory (for example if they belong to
an unwanted experiment). In this case the user could consider
file “mtz_names.dat”, also included in directory “blend_1”,
to re-run BLEND using less input data. File “mtz_names.dat” looks
like the following:

In order to show how to make use of this file
suppose we are not any longer interested in the last
4 data sets. We first copy “mtz_names.dat” into a file called
“original.dat”, in a newly-created directory called
“amended_insulin”:

At
this point the last 4 files can be deleted from “original.dat”
file which, now, looks like the following:

Next,
we associate this directory to a new project directory in ccp4i,
called “BLEND_TURORIAL_AMENDED_INSULIN”, and select this project
directory. BLEND can now be executed again in analysis mode, this
time selecting the “original.dat” file as starting point, rather
than the whole directory containing all 14 insulin data sets:

The
usual files will be produced (still in directory “blend_1”, but
inside directory “amended_insulin”), although this time BLEND
will have dealt with 10, rather than 14 data sets, and the dendrogram
will look different from the one previously obtained:

As
numbering of the 10 data sets has not been modified, one can see that
clusters for these data include the same elements of corresponding
clusters in the previous example, considering that data sets
numbered 11 to 14 have been excluded from the analysis.

Having
shown how BLEND in analysis mode can be re-executed using a
subset of data, let us go back to the previous working directory
selecting “BLEND_TUTORIAL_INSULIN” project directory in ccp4i.
Let's try and increase or decrease image cutting due to radiation
damage, by using the RADFRAC keyword. An input value of 0.85 means
that we would like to keep only images for which intensities have, on
average, been dampened by less than 15% of their initial value. Input
for RADFRAC can be included via ccp4i as shown here:

This
time “FINAL_list_of_files.dat” shows some difference in the
number of images discarded. For data set 9 discarded images range 102
to 122, while before it ranged 108 to 122, and images 93 to 100 are
now discarded from data set 7, while no images were previously discarded:

Let's
modify RADFRAC once more and use 0.65, rather than 0.85. As a result:

Images
discarded are now less because we have allowed for intensities to be
dampened, on average, by up to 35% of their starting value.
A similar data exclusion can be applied in resolution terms, by
changing value to the keyword ISIGI. For instance, let's re-run BLEND
in analysis mode (with RADFRAC 0.75 - default value as the box
"Tolerated fraction of damaged average intensity" is not selected),
fixing ISIGI to
3:

Results
should be produced in directory “blend_4”. The
“FINAL_list_of_files.dat” looks like the following:

Resolution
cutoffs have now changed. For the first data set the highest is
2.304, while previously it was 1.920. We expect an increase if the
value of ISIGI is lowered. This is normally the case. But in the
specific insulin example data are exceptionally good, and their
signal-to-noise ratio is quite high to the very highest resolution.
Thus, results will not change here if values lower than 1.5 are used
for ISIGI.

Let's
pay a closer look at the results of cluster analysis, as described in
file “CLUSTERS.txt”:

Cluster
numbers and corresponding data sets are listed in column 1 and column
6 respectively. The second column reports how many data sets form the
specific cluster, while in column 3 it is found the exact height at
which data sets merge to form the cluster. Columns 4 and 5 lists all LCV and aLCV
values. The dendrogram encourages people to look for grouping of
individual data sets. Larger groups attract one's attention more than
smaller groups. In this sense the presence of two clusters, cluster
(1 2 3 4 14) and cluster (5 6 7 8 9 10 11 12 13), is immediately
perceived. Very soon, though, we realise that the second cluster is
made up of two smaller clusters, cluster (5 6 8 10) and cluster
(7 9 11 12 13). Finer details and even smaller clusters can be singled
out by proceeding to the lower part of the dendrogram. This way of
proceeding through the dendrogram interpretation can have some
validity, but is highly subjective. In BLEND it is found that
a more logical, straightforward and objective way of interpreting the
dendrogram is to assume that each node can represent a valid data
set. Subsequent operations of space group assignment (POINTLESS)
and scaling (AIMLESS) determine the degree of validity. This
interpretation is accomplished with BLEND synthesis
execution mode. From “CLUSTERS.txt” we read that everything
merges at height 13.670. If the user wishes to produce results for
all nodes under a certain height, it will suffice to specify this in
the GUI. For instance, to produce results for all nodes in the
insulin example, the BLEND section of the GUI will look like this:

After
execution, several files will be produced in directory “blend_5”
(this is the fifth job for the GUI under project directory
“BLEND_TUTORIAL_INSULIN”), including another directory packed
with files, called “merged_files”:

In
this directory one can find unscaled (produced by POINTLESS) and
scaled (produced by AIMLESS) files corresponding to all nodes
selected in the dendrogram. Final overall merging statistics can be
found in the “MERGING_STATISTICS.info” file:

Values
are sorted on increasing completeness, followed by lower Rmeas.
Higher completeness is often associated with high redundancy
(multiplicity), but also with poor isomorphism, because many more
data sets are needed for both completeness and multiplicity to
increase. For example, cluster 12 is 100% complete, as it is cluster
4, but this last cluster has lower Rmeas, quite certainly because it
is the union of just two data sets, while cluster 12 includes nine
data sets. A good result is represented by cluster 2, which has Rmeas
= 0.088 and it is 99.7% complete. Cluster 5 is an even better choice;
it has same Rmeas than cluster 2, but a much lower Rpim. It is a
nearly complete set and has high multiplicity. This is what should
probably be used. A visual version of the above table is provided by
the “Rmeas_vs_Cmpl.png” or “Rmeas_vs_Cmpl.ps” plot:

From
the overall view provided by this plot we can see that 2 and 5 are
the best data sets in terms of statistics and completeness (they are
barely visible because they overlap on top of each other). Unscaled
and scaled files related to all results, in the “merged_files”
directory, can be inspected with the usual CCP4 tools (try, for
instance, to view “scaled_005.mtz”). POINTLESS and AIMLESS logs
for all unscaled and scaled files are also included in this
directory. For instance, the final part of the AIMLESS log
file for cluster 5 shows:

There might be reasons for
which the user wishes to calculate results for certain nodes only,
rather than for all nodes of the dendrogram. This can be done using
two numerical values, rather than one, in BLEND synthesis.
Suppose we only want to calculate data for cluster (5 6 8 10). This
cluster merges at a height of 0.744 (check “CLUSTERS.txt”). Two
numerical values in between which 0.744 and no other nodes is
included are 0.745 and 0.743. Thus, to produce a synthesis just for
this cluster, one can run BLEND as follows:

Statistics and related files
for only one data set are contained within the “merged_files”
directory. This job could have been executed more easily in combination
mode, simply selecting the specific cluster in the graphical interface
(see later).
Suppose we want to extend
resolution of the previous result from 1.93 to 1.80. BLEND can
be again executed in synthesis mode using the appropriate
"RESOLUTION" keyword:

Check statistics under the
new “merged_files” directory, they should have changed to include
the desired resolution cutoff.
"MERGING_STATISTICS.info" allows people to gain a better
understanding of the ability individual data sets have to form
larger, more complete and isomorphous groups. For example, in job
number 5, it clearly emerges that cluster 5, composed of data sets 7,
9, 13, has good merging statistics and it is very complete. The two
additional data sets joining 7, 9, 13 and forming cluster 10, i.e.
data sets 11 and 12, make the situation worse, in terms of merging
statistics.

It is possible to find out which data sets make the statistics worse by
running BLEND in combination mode.
For example, we might want to investigate whether data set 11 or data set
12 make statistics for cluster 5 worse. First we try with 11.
The group formed by data sets 7, 9, 11 and 13 is equivalent to cluster 10
without data set 12. This can be specified in the graphical interface, to
save typing time (this option becomes very useful when handling several
data sets). As we would like to compare results with what has been computed
for cluster 10 in BLEND job 5, we also fix the highest resolution to the same value
used in that run, 1.92 &Aring:

Merging statistics for this specific combination can be found in the sub-directory
"combined_files" of directory "blend_8" (as Job 8 was the BLEND job just executed).
Next we try with 12. If the "earlier directory" from which the new BLEND job will
be executed coincides with the output directory of Job 8, all results will be collated
in a same directory. This becomes very useful when many different groups are to be
investigated. The interface for the new job is, thus:

Checking
file "MERGING_STATISTICS.info" in sub-directory "combined_files" of directory "blend_9"
leads to the conclusion that
statistics are better when using data set 11 (group 1 because we ran combination job
for group 7, 9, 11, 13 first):

Thus we can decide to
"filter out" data set 12 from cluster 10, in order to achieve a
better composite data set.

b) Without CCP4I

Like any other CCP4 program, BLEND
can be executed without using ccp4i, simply typing commands at the
keyboard in an interactive command shell. This, quite often, makes
the program more flexible.
Here we will briefly repeat all
steps previously carried out with ccp4i. First a new directory, named
for example "insulin_no_ccp4i" will be created in order to keep
this part of the tutorial separate from the rest:

mkdir insulin_no_ccp4
cd insulin_no_ccp4

Next we run BLEND in analysis mode
using all 14 files in the "data/insulin" directory:

blend -a ../data/insulin

After start the program
immediately halts, waiting for keywords input; in this specific
instance we simply push the "Enter" key once, thus forcing BLEND
to use default values. At completion the program produces the
following files:

whose meaning and content has been previously explained.
To show how "mtz_names.dat"
can be re-used for a different run of BLEND in analysis mode, and
similarly to what done previously in this tutorial, let's create a
directory called "amended_insulin" and copy "mtz_names.dat" in it:

Then "original.dat" is edited
to remove the last four lines, thus creating a different input,
consisting only of 10 out of 14 data sets. Next, the program is
executed using "original.dat" as input:

blend -a original.dat

Results of this run are similar to
what described before. To carry on let's move back one level to
"insulin_no_ccp4i":

cd ../

If a log file is needed, then a
keyword file has to be prepared before execution. Let's call this
file "bkeys.dat", and play both with RADFRAC and ISIGI
keywords, changing from their default values 0.75 and 1.5 to the new
values 0.65 and 3.0.
The file looks like this:

RADFRAC &nbsp 0.65
ISIGI &nbsp 3.0

Then BLEND is executed again using
"mtz_names.dat" as input:

blend -a mtz_names.dat < bkeys.dat > blend_analysis.log

After completion,
"blend_analysis.log" can be either read using any editor, or
viewed graphically with the CCP4 program "logview":

logview blend_analysis.log

To run BLEND in synthesis mode and
produce scaled data for all nodes of the dendrogram, type:

blend -s 14 < bkeys.dat > blend_synthesis.log

At completion the result can,
again, be viewed using "logview":

logview blend_synthesis.log

All unscaled and scaled mtz files,
and all POINTLESS and AIMLESS logs, are contained in the
"merged_files" directory. Logs can be viewed using "logview"
or "loggraph". For example:

loggraph merged_files/aimless_005.log

In order to execute BLEND in
combination mode, so to scale groups of data sets not corresponding
to any cluster, it is handy to copy and paste numeric strings from
the sixth column in file "CLUSTERS.txt". Another, perhaps quicker, possibility is
to use the special syntax created to summarize groups of data sets (see
documentation).

For instance, the group of data sets (7, 8, 9, 10, 13) is simply the union of
clusters 4 and 5. This can be represented with the two cluster numbers within
square brackets, "[4,5]". The command line to run BLEND in combination mode over
this group would be, in this case (using same keywords file as before):

blend -c [4,5] < bkeys.dat

Or, group (7, 9, 11, 13) is essentially cluster 10 with the exclusion of
data set 12. This can be summarized with the following string, "[10] [[12]]",
where the double-squared brackets include all data sets to be excluded during
execution:

blend -c [10] [[12]] < bkeys.dat

Many more runs in combination mode can be executed within the same directory.
All files produced can be found in directory "combined_files".

TUTORIAL 2.

28 SWEEPS OF LYSOZYME CRYSTALS

How to obtain complete data sets, while preserving isomorphism with BLEND

These
data come from hen egg lysozyme with two different procedures. The
first group of 12 10-degree sweeps has been collected with pure
lysozyme crystals at room temperature from crystallisation plates.
The second group of 16 10-degree sweeps has been obtained from
crystals soaked in a solution of sodium bromide; data have been
collected at the Diamond Light Source synchrotron with a wavelength
close to the bromide peak. Data collection went further than 10
degrees for each data set, but we have limited sweeps in order to
replicate similar situations met when radiation damage acts fast. For
this tutorial in situ data have been stacked with data collected at
low temperature and from a slightly different structure (additional
bromide content) in order to demonstrate the interplay between
completeness and non-isomorphism.

Let's
create a new directory, called "lysozyme", in the
"blend_tutorial" directory, and associate the new project
directory "BLEND_TUTORIAL_LYSOZYME" to "lysozyme":

mkdir lysozyme_test

Next,
we run BLEND in analysis mode using all mtz files in
"data/lysozyme":

In
the group of files produced by this run the following dendrogram
(file "tree.ps" or "tree.png") is found:

The
overall Linear Cell Variation (LCV) has a high value. This is
certainly caused by the couple of crystals in the isolated branch on
the left of the dendrogram, as it can be confirmed by looking at cell
parameters plots in BLEND log file (new style):

Data
consist of many short sweeps (not useful to form a complete data
set), therefore we can get rid of sweeps 19 and 21 without loosing
much data and re-run BLEND in analysis mode. One way of
executing BLEND using the same data without sweeps 19 and 21
is by renaming file "mtz_names.dat" as "original.dat" and
deleting the lines pointing to these sweeps:

As
we can see, the new LCV is much smaller, although the value 1.68% (and the absolute LCV value 1.54 Å)
still indicates some non-isomorphism among all crystals. We know this
is the case because crystals have been assembled from
differently-prepared structures. Indeed, LCV values computed
separately for the left branch and for the right branch of the
dendrogram return 0.69% and 0.72%, respectively. Incidentally, data
sets group 1 – 12 has different features from group 13 – 26
(remember, two sweeps missing from the initial group!) because it
corresponds to a different collection from a different group of
crystals. This coarse division, observed in the two separate
dendrogram's branches, can also be detected by looking at the different
mosaicity trends in BLEND log file:

Scaled
data for all nodes of the new dendrogram can be computed running
BLEND in synthesis mode. We know that all data should belong
to space group P 43 21 2. We also would like to
use the same resolution for all data, say 1.9 Å.
Accordingly, the program is executed using the following keywords:

CHOOSE SPACEGROUP P 43 21 2
TOLERANCE 100
RESO HIGH 1.9

Keyword
TOLERANCE is set to a high value to bypass the default tendency of
POINTLESS to stop execution when cell parameters differ too
much. Considering now the dendrogram, using only one value for the
height means that the program would process all nodes below this
value. In the "CLUSTERS.txt" file it can be seen that these have
values smaller or equal to 28.861; thus any value larger than this
produces the desired scaled data:

This
run will take a while to complete (many clusters!). At the end of
execution a new "merged_files" directory is created with scaled
files and merging statistics tabulated. Here is a view of the final
statistics in BLEND log:

Scaled
files corresponding to these results are included in the
"merged_files" directory, and they are ready for further
investigations. Some of the statistics can be improved by filtering
out certain sweeps from specific clusters. A visualization of the
above table is provided in the following picture,
where clusters with completeness around 90% or better are represented
by grey circles and the corresponding Rmeas value is typed
underneath:

Clusters
21, 22, 23 and 25 have clearly high Rmeas values,
while values for clusters 15 and 24 can, perhaps, be improved. The
main way to improve statistics at this point is by running BLEND
in combination mode. Let's try, for instance, to improve the
statistics for cluster 15. As the Rmeas value jumps from
0.111 for cluster 11 to 0.154 for cluster 15 and as cluster 15 is
simply cluster 11 with the addition of sweep 23, it is
straightforward to impute the reason for the increase to the bad
quality of sweep 23. But there might also be other possibilities. For
example sweep 23 does not match well some of the sweeps composing
cluster 11; by excluding these sweeps, then, statistics might
improve. In conclusion we can try all combinations of data sets forming
cluster 15, excluding one data set at a time. For all
combinations after the first we will have to make sure that the path
to an earlier BLEND directory points to the results directory of the
previous combination. So for the first combination:

For
the second combination:

To
summarize, for all combinations tried to improve statistics for
cluster 15, we have:

Final
statistics can be viewed in the "MERGED_STATISTICS.info" file
under "lysozyme/blend_8/combined_files" or simply using the
interface:

The
lowest Rmeas is the one for group 3. Thus, discarding sweep 23 from
cluster 15 causes Rmeas to improve from 0.154 to 0.111, while
completeness decrease just of a little from 95.8% to 93.9%. This is essentially
cluster 11; our initial guess pointing at data set 23 as the one making things worst
turned out to be correct.
Similar combinations can be tried on all other clusters. Results are shown in
the following picture:

Values
shown in red correspond to those cases where filtering out one or
more sweeps has made it possible to lower Rmeas.

Try
to reproduce the same results by executing BLEND in
combination mode. As a hint you can find it useful to know that the
sweeps discarded are 1, 15, 19, 22 and 23.
To conclude this tutorial, use BLEND in combination mode with
just one element at a time to reproduce scaled data and statistics
for individual sweeps. It is interesting to observe that the 4 data
sets excluded from the previous filtering job do not always
correspond to those having worst Rmeas values.

A
final remark: all scaled mtz files do not contain structure factors,
but only scaled intensities. To create structure factors and proceed
with phasing, model building and refinement, you need to run the CCP4
program TRUNCATE or any other software that does an equivalent
job.

TUTORIAL 3.

12 CRYSTALS OF LYSOZYME SOAKED WITH BROMINE

Each one of these data sets is not complete. The anomalous signal produced by the heavy bromine atoms can be detected
even with partial data sets, but effective phasing can only happen reliably with good anomalous completeness. For this
reason it is necessary to merge data from these multiple crystals to obtain full anomalous completeness. The resulting
high redundancy makes the anomalous signal even stronger, provided isomorphism is conserved.
In this tutorial we will use BLEND to first explore resolution and completeness of data and then to obtain complete
data sets that will be subsequently used to find the heavy atom substructure. The CCP4 GUI will not be used, so that the
whole tutorial is simplified, but the GUi should be used if you are unfamiliar with command line execution.

First let us create a new directory, called "lysBr" inside the "blend_tutorial" directory:

mkdir lysBr

Data in this case are stored in directory "lysBr" under "blend_tutorial/data". But they are spread over 12 different directories.
In order to prepare data for BLEND we have two options; either copy all data into a single directory, or edit a file with paths
for all 12 input files. In this case the second option is better because all individual input files have been produced by XDS
and have identical name, "INTEGRATE.HKL". A quick way to prepare the input file for BLEND is to use the unix command "find". First
let us move inside the "lysBr" directory (our working directory, in this case):

cd lysBr

Next, use "find" starting from "data/lysBr" directory and re-direct the results into a new file called "original.dat". This file
is the initial input for BLEND (the "sort" command after the pipe symbol "|" is to sort directory names; it's not needed in general,
as input file order really does not matter in BLEND, but here it helps the tutorial description):

BLEND can now be executed in analysis mode using "original.dat" as input (remember to hit the "enter" key twice, as we hav not prepared
any keywords file):

blend -a original.dat

The tree produced should look like the following:

The linear cell variation for all crystals has a small 0.69% value, which means all crystals are fairly isomorphous. Among all files produced
by BLEND in the present working directory the new "xds_files" directory and "xds_lookup_table.txt" file have been produced. This happens every
time input data come from XDS integration files. BLEND makes use of POINTLESS to turn them into mtz files. All new files,
including POINTLESS log files, are included in the "xds_files" directory. The "xds_lookup_table.txt" helps users to keep track of
file origin, because all XDS files are re-named in the process. In this specific case, for example, directory "xds_files" includes
the following files:

In order to estimate data resolutions we can run BLEND in combination mode on each individual data set. Strictly speaking
this is not needed, and merged data corresponding to all clusters can be calculated straight away, but exploring individual
data sets (in those instances where data wedges and symmetry allow scaling) can help to figure out data quality. We know that
the space group for these samples of lysozyme is P 43 21 2 and can fix this in the keywords file. Thus,
a new file, named "bkeys.dat" can be edited to include the following line:

All files produced by these 12 runs of BLEND are contained in a directory called "combined_files". Let us rename this directory as
"individual_files" because it includes process data from individual data sets:

mv combined_files individual_files

File "MERGING_STATISTICS.info" inside "individual_files" is a summary of overall statistics:

Group

Comple-

Multi-

Reso

Reso

Reso

Number

Rmeas

Rpim

teness

plicity

CC1/2

Mn(I/sd)

Max

1

0.072

0.044

69.80

1.80

1.47

1.47

1.47

2

0.073

0.038

67.30

3.00

1.40

1.40

1.40

3

0.299

0.114

92.60

5.40

1.64

1.75

1.43

4

0.090

0.049

74.80

2.50

1.50

1.50

1.50

5

0.070

0.046

75.70

1.70

1.39

1.39

1.39

6

0.214

0.133

85.20

2.00

1.64

1.83

1.52

7

0.193

0.117

70.70

2.10

1.50

1.60

1.50

8

0.055

0.033

60.90

2.10

1.48

1.48

1.48

9

0.123

0.083

80.00

1.70

1.49

1.49

1.49

10

0.558

0.380

75.20

1.60

1.89

2.02

1.49

11

0.084

0.053

72.00

1.90

1.40

1.40

1.40

12

0.078

0.044

83.30

2.60

1.39

1.39

1.39

From the "Completeness" column we can see that the only complete data set is data set 3. Its resolution, as suggested by the
"Reso CC1/2" and "Reso Mn(I/sd)" columns is around 1.65 Å. Resolutions are higher for the other partial data sets and it
seems reasonable to push resolution for combined data at 1.6 Å. Let us, therefore, run BLEND in synthesis mode to produce
merged data for all nodes of the dendrogram at this resolution. The corresponding line to add to the BLEND keywords file is
"RESO HIGH 1.6". We also want to turn anomalous handling of data on in AIMLESS, as we would like to use scaled data
for subsequent phasing. In the end, the updated "bkeys.dat" looks like this:

CHOOSE SPACEGROUP P43212
RESO HIGH 1.6
ANOMALOUS ON

Then we execute BLEND as follows (the value 10 is high enough in the dendrogram to include all nodes):

blend -s 10 < bkeys.dat

All results are, this time, dumped in directory "merged_files". We can appreciate overall statistics by viewing the
"MERGING_STATISTICS.info" file inside this directory. These statistics, though, give no information we can use to
judge the quality of data for phasing. This information is still not extracted and organised by BLEND because the
whole issue is currently under investigation. Future addition in "MERGING_STATISTICS.info" will include such information.
For now we can extract specific lines from all AIMLESS logs using the unix "grep" command. For instance, anomalous
completeness is important to judge the quality of anomalous data. We can extract it
with the following unix line:

grep "Anomalous completeness " merged_files/aimless*.log

The output produced is the following:

It is immediately seen that clusters with satisfactory anomalous completeness are 5, 6, 7, 9, 10, 11
(greater than 90%). These are data sets on
which we can try the procedure to find heavy atoms substructure. These files will have to be compared with the only individual
data set with the highest anomalous completeness, data set 3. A "grep" line similar to the one previously used can show this:

grep "Anomalous completeness " individual_files/aimless*.log

Resulting output:

The value of 76.9% for the anomalous completeness of data set 3 is nowhere near an acceptable 90%, but it's the only decent one
obtainable with indvidual data sets, and it is with this one that all merged ones will have to be compared. The message
to be taken home is that when only partial data are available it is still possible to combine them and obtain a complete data set,
rather than wait for new crystals and another data collection. Furthermore, the redundancy implied by merging data together can
increase the anomalous signal and, thus, strengthen the phasing power.
In what follows we have used the SHELX family of programs to prepare data, measure anomalous signal and find the heavy
atoms substructure (Sheldrick (2010)). These programs work with input files of a specific format ("sca" files).
In order to prepare these files we need, first, to turn scaled intensities into structure factors using the CCP4 program TRUNCATE.
Then we use "MTZ2SCA" to turn mtz files into sca files. Everything will be executed inside a new directory called
"anoproc". In this directory we need to create seven more directories, one for each data set to be processed. The whole series of
unix commands to be typed is:

The procedure to produce sca files ready to be used in SHELX will be now shown only for one of the data sets involved, data set 6,
but it is meant to be repeated in exactly the same way for all other data sets. First we move to directory "lysBr_006". Then we run
TRUNCATE, using command lines, in the following way:

After having entered this line execution halts as TRUNCATE expects some keywords. We can use default values and
just type "END" (and hit the "Enter" key). After execution a file "truncate_006.mtz" should appear in the directory. To translate this
file into a sca file we use MTZ2SCA in a very simple way:

mtz2sca truncate_006.mtz

Now also "truncate_006.sca" should be present in the directory. As mentioned before, the procedure just applied on data set 6 should be
repeated for the other 6 data sets in the "anoproc" directory. Once all sca files have been obtained, programs SHELXC and
SHELXD can be executed to obtain the heavy atoms substructure. The heavy atom for this specific case is bromine (BR - 8 sites) and we have
tried 1000 trials in SHELXD. Results are tabulated as follows:

Data set

Sufficient d"/sig ?

Solution ?

CCall

CCweak

Cfom

ind_003

yes (?)

yes

30.37

15.47

45.85

005

yes

yes

43.64

26.23

69.87

006

yes

yes

37.24

22.07

59.31

007

yes

yes

43.05

25.97

69.02

009

yes

yes

34.96

18.59

53.54

010

yes

yes

44.68

27.25

71.92

011

yes

yes

42.59

25.57

68.16

The phasing power of bromine in a stable structure like lysozyme is clear from the relatively high CCall (> 30)
for data set "ind_003". It is also clear, though, that the anomalous signal becomes much more effective when multiple
isomorphous crystals are merged together.