Now will guess any cif file starting with data_xxxxxx as cif file (before only if xxxxxx was a pdb code). Now also checking that the readFieldsTitles method finds some fields, otherwise throwing exception. Phenix cif files will now be read but will fail with a file format exception.

Introducing chain clusters and interface clusters. We group chains in the same way as before (based on 100% seq identity) but now we have a more flexible scheme where we could introduce chain clustering based on seq-alignment/rmsd when reading from PDB file (not implemented yet). Interface clustering is implemented based on rmsd of superposition of both interface sides.

Important change: now interface search goes up to 6th shell of neighbouring cells. This solves bug CRK-101 where no interfaces were found for 1was. It also solves the case of many PDB entries that need 3rd shell (e.g. 24 entries in a test of ~2000 pdbs) and rare but occurring cases that need 5th shell (e.g. 2 entries in test of ~2000 pdbs). It seems that PISA uses 3rd shell and thus they miss interfaces occurring at >3 neighbours. Tested that this doesn't add any overhead at all to the runtime (the overlap check is very fast).

Moved to core.structure.io the classes that belong there (except for the parsers themselves, we can't move them yet because there are some protected methods needed in core.structure). Removed some warnings from other files.

Moved geometry related calculations to the GeometryTools class: rmsd, screw component, axis type, axis-angle etc. Like that the calculations can be used from anywhere. Introduced the OptSuperposition class as output of the rmsd calculation.
Improved the rmsd calculation in PdbChain (and corrected some bugs). Also improved a lot the rmsd unit testing.
All tests in PdbChainTest are now done with cif files and not anymore with pdbase.
Implemented an optimal superposition method in ChainInterface and some output for that in enumerateInterfaces with the intention of finding the pseudo-symmetric relationship between NCS related chains in an interface.
All tests passed.

Fixed important bug: transfToCrystal(Matrix4d) was not transforming the translation vectors properly. Found yet again another duplicate method for the transformation of vectors, now removed. Hopefully there are no other bugs lurking around...

Fixed bug with SIFTS file parsing: after a minor re-formatting of the SIFTS file in August 2013, the parser was crashing. Made it a bit more robust: it should work with both pre and post August 2013 formats.

Fixed bad bug (but with little consequences): the matrix for calculating eigenvectors (for calculation of rotation axes) was transposed, resulting in a few operators in a few space groups to have wrong axes (for instance in H 3 2 or P 32 2 1). Fortunately we weren't using the axes calculation for anything other than output in test programs (that's why it didn't have consequences). Added testing for comparison of axis vectors versus screw translation components (both must be in same direction).

Fixed bug in screw/glide calculation: we were only using the original space group transforms (with crystal translations 0,0,0) as the reference for the rotation character (screw/non-screw), all transforms with same id and different crystal translations were considered to be of same character. That's wrong: in order to calculate the screw character one needs to also take into account the crystal translations. This problem affected any space group containing rotation axes on diagonals.

New feature: now we can classify the operators into their different types, including all screws with correct screw translation components, improper rotations, glides with translational components. New enum class introduced for that.
Many improvements in SpaceGroup and its testing.

Fixing bug in jaligner: the NeedlemanWunschGotoh method used by owl.core.sequence.PairwiseSequenceAlignment had a bug causing infinite hanging when using very long sequences. The issue would happen when either of the 2 sequences to align was longer than 32768 (16 bit) and the other sequence would match it after the 32768 region, a test case is included in this commit to demonstrate it. The problem was an overflow: an array used for the traceback procedure was declared as short[] (16 bit), changing it to int[] (32 bit) the issue disappears.
As the development in jaligner seems to be stopped I've simply downloaded the source, fixed the issue there (plus removing the loggers) and repackaged in a jar (jaligner-bugfixed.jar) that includes the sources.

1)Javaversion variable is added in build.xml to change the version easily.
2)Pymol exited with error status 139 due to very long selection string written in the form of every residue(3+4+5+8+9+10). It is fixed by writing in the form of (3-5+8-10)

Fixed bug CRK-128: in a non-crystallographic entry with cofactors it was crashing at getCofactors because we weren't checking for non-crystallographic entry there. The crash was a "cannot invert matrix" error, happening because the cell parameters where 0,0,0. Funnily enough this would not happen in my local machine (OpenJDK java 1.6.0_27), but it did happen in merlin (OpenJDK java1.6.0_24)

Fixed bug: was null pointing on pdb files with a TER record positioned just before a HETATM record in the middle of a poly chain, now added more checks for duplicated PDB chain codes assignments which will catch that problem plus possibly others

Fixed 'I 1 2 1' issue: it was actually a bug rather than a "feature", we were parsing everything in symoplib but some space groups happened to have alternative names and 'I 1 2 1' is the alternative name of 'I 2" which we were allowing. Now parsing and storing in lookup map both names.
Fixed also a bug by which the last space group in file was not added to the lookup-by-name map

Several fixes:
- bug fix: was always throwing exception in pdb parser for no secondary structure when not protein
- new reasonable-crystal-cell check when loading PDB entries
- safer pdb file parsing: now all substrings should be checked. For instance now swiss-model output files are parseable

Fixed issue: now setting unknown atom type whenever we can't guess the element. If we read from PDB file and there is an unguesseable atom that we then have to write out again the code would fail because in writeAtoms we assume we have atom types for everything.

Fixed bug in ASA calculation: was taking in finding neighbours procedure the default vdw element radius rather thatn the one set in setRadius: NOTE this changes slightly the area values we are getting.
Fixed issue in parsing of atoms when no atom type can be guessed: now whenever we can't guess we write a warning, also setRadius will use default radius if element not known (instead of null pointing)

Fixed minor issue: we were using residue 0 in order to create an empty selection but since now we use PDB residue numbers residue 0 is valid. We now use the "none" keyword used in pymol to create empty selections

Fixed bug: was null pointing in parsing 2bi6 mmCIF file. It had a multi-line quoted value in a _struct_conf non-loop element. We weren't handling that correctly.
Also added a new peptide-linked 3-letter code het residue

Fixed bug: very strange PDB entries can have no atoms observed in a chain at all (e.g. 1oax) or all atoms of a chain in one alt location only (e.g. 3nji). We throw now exceptions for them as in my opinion they don't follow the data-modelling standards properly.

Bug fix: Checks if the method is Crystallographic before getting the unit cell. The can be errors in mmCIF files, where a cell is defined but the space group is not defined for non-crystallographic methods. Hence the check needed!

Small fix to be sure that the uniprot ver reported in error message is the actual one that the japi server returned, at the moment they seem to be having issues and 2 calls to the server return different versions!

Fixed important issues with searching and alignment of homologs:
- ids and coverage were calculated wrongly, based on all BlastHsps for a hit. Now based on single Hsps
- fixed bug in BlastHsp.getQueryCoverage: was miscalculating the coverage by 1 unit
- the useHspsOnly parameter in alignment didn't make much sense. We need always to use the hsp segments only, extending to the full sequence is dangerous and in any case does not make sense at all if the clustering is done in hsps only

Now also checking the translation component of the SCALE matrix, and if different from (0,0,0) throwing an exception. Some entries still have a shifted origin (e.g. 1hga, 1hgb, 1hgc). Let's hope they fix this in next remediation.

Important fix to interface calculation: finally total symmetry redundancy elimination is working!
- fixed a bug present since rev 1573: translation vectors of transforms were not stored properly and thus wasn't comparing the right operators
- extracted TransformIdTransform private class to new class CrystalTransform. Rationalise some of the code thanks to that (still more to be done)
- better debug output in interface calculation to show duplicates info

Another important optimisation: now the ASA calculation for the list of interfaces is a lot more efficient by only calculating uncomplexed ASAs once per chain code per transform id.
Not supporting anymore calculation of interfaces areas with NACCESS

Important change in interface calculation:
- another massive gain in run-time and memory usage: we first use the bounding boxes translations to see if it is worth translating the whole unit cells
- thanks to that we now go to 2nd neighbors! (still performance is better than before with only 1st neighbors!). With that we fixed the long-standing bug by which we miss interfaces beyond the 1st neighbor
- included a few examples of 2nd neighbor problem in test
- re-done the symmetry redundancy elimination, it is there but it has a very little effect on overall performance. Now we only don't use an operator when we are very sure its partner was in the right place. In effect we could get rid of it because it wouldn't really affect the performance, we rather keep it as it is good to remember that there is a lot of symmetry redundancy

IMPORTANT changes and fixes in interface calculation:
- very important bug fix: BoundingBox was not always telling properly whether 2 boxes overlap. This not only fixes the important bug but also boosts the speed of interface calculation
- optimisation of PdbChain.getAICGraph not to return an empty graph whenever no overlap of boxes is found (instead of generating empty distance matrix and checking its emptiness): this alone gives the biggest performance boost to interface calculation, around 5x faster thanks to this alone
- fine optimisation of max/min calculation in BoundingBox
- made sure that bounds caching is done properly
- introduced the right solution for symmetry redundancy elimination: operators multiplication is the identity. Before we had an ad-hoc solution that wasn't comprehensive at all
- in any case the symmetry redundancy elimination is right now almost turned off because we introduced new conditions to re-check equivalent operators to fix cases like 2gsg, 3ka0, 1vzi, 1g3p and 1eaq, where some interfaces are missed due to an equivalent operator making the molecule fall in a cell where there's no contacts
All in all this totally changes the interface calculation: 1) by a massive boost in performance 2) by making it more correct, surely there were cases with missing interfaces, we were just lucky not to find them in testing

Fixed bug in PDB file parser: checking for TER records was done with a trailing space "TER ", that was fine for PDB deposited files but not for others (for instance phenix files). Because of that it was failing to parse properly some multi non-poly chains phenix files. Now removed the trailing space and problem seems to be gone. PDB parsers test passed

Bug fix: was null pointing when a Deuterium atom was present in a standard amino acid (not clear anyway whether that is standard PDB practice). Now checking for unknown atoms in standard amino acids when getting their radii. Will output error and continue with their standard vdw radius.

Now checking the SCALE1,2,3 PDB records when loading PDB data, in order to detect PDB entries that are not in the standard crystal frame (after remediation of 2011 only 148 non-virus entries are in nonstandard frame, marked with REMARK 285)

Fixed bug: for a uniref100 cluster member, was taking the representative's uniprot id/tax id instead of the member's. Database was not correctly modelled -> need also the tax ids for cluster members (and then to query the tax ids from members and not from representative)

Fixed bug in uniref xml parser: was not taking the right uniprot representative for old style (e.g. ver 1.0) xml files. They also contain several uniprot ids sometimes: first one being the active representative and the remaining being the inactive ones.

Fixed issue: was only considering axes in a, b, c. Now we properly check whether two operators are in same axis, by storing in a HashMap all operator ids per axis. Was affecting some exotic space groups like P 2 3.

Implemented some basic symmetry redundancy elimination in interfaces search. Moved interfaces-search to a separate class.
New pymol output file in enumerateInterfaces with all interfaces in a single session

Introduced alternative parsing mode for PDB files: if no SEQRES present now two modes of parsing possible, one where the sequence is padded with Xs based on numbering and one where sequence is taken as is in ATOM lines whatever the numbering

Introduced some more experimental code for enumeration of assemblies. BEWARE many things are UNFINISHED, namely in InterfaceGraph class the code to find induced and symmetry related interfaces is just a draft. There are many fundamental issues not solved yet. Be careful especially with implementations of equals and hashCode methods in InterfaceGraph, SubunitId, InterfaceGraphEdge etc. They alter "quietly" the behaviour of everything.

Fixed bugs in DaliRunner
- it wouldn't run from cmview when the two chains had same pdb code and pdb chain code
- it wouldn't work with newer versions of DaliLite as apparently the output file names and formats changed

Fixed bug: was calculating ASAs/BSAs with Hydrogen atoms when they were present. Calculating without them is more common practice and most importantly makes more comparable prots with/without hydrogens. Now removing Hydrogens before interfaces calculation

Fixed a few bugs:
- PdbChain: copy() was missing some important fields, PDB file writing method was printing SEQRES for nonpoly chains too
- naccess BSA calculation had a few problems
- PdbAsymUnit: a few important issues with PDB/CIF chain codes

Important change: for maximum compatibility in no-SEQRES case we are now back to using the existing numbering if there's nothing wrong with it (no negatives or ins codes). If there's something wrong with numbering then we set the SEQRES to be the observed sequence and renumber.
Now behaviour is as it was before revision 1329. With the difference that a lot more PDB files can be read.

Another improvement to PdbfilePdb parsing: now we can also detect some ambiguous cases in alignments by checking the contiguity of the 3D chain. This fixes 2 cases from our test data set: 1dki and 2ofz. Now only 2 warnings for the whole dataset (both because of ambiguous assignment in HIS tags, safe to ignore. Could even be that CIF got them wrong)

Now the re-numbering of residue serials in PDB files has been improved. Instead of aligning the sequences whenever the serials don't match SEQRES, we first try to see if there is a shift in the serials with respect to SEQRES (happens in many cases). Only if we can't find a shift that fully matches SEQRES we proceed to re-aligning. We also treat properly now the ins codes. Mapping is as close to CIF as possible. The parsers test passes for the whole cullpdb20. For some entries like 2ofz the mapping does not coincide 100% with CIF as there are ambiguities in the alignment.

Important change: now the PDB file reader will try to read and fix the numbering of PDB files. Whenever the alignment is wrong it will realign and renumber using the jaligner package. The result of reading original PDB files now will be the same as that of reading CIF files, including proper mapping of classical PDB numbers to SEQRES residue serials (as in CIF).
There is one change of behaviour in comparison to before: when no SEQRES present the sequence is taken to be that of ATOM lines instead of padding it with Xs.
Still in this version the re-alignment is not perfect as there are some times when ambiguities occur and they are not solved (e.g. in 2nwr where alignment in an unobserved loop can be at two possible places for a GLY). That is anyway a rather minor problem (coordinates are still fine, just the chain is not ordered correctly at 1 or 2 points) and rare (~1% of files)

A few bug fixes and improvements in pdb data parsing.
- bug fix: in some cases exptl method field has more than 1 value (e.g. 2krl). In cif files this was causing a null pointer. We now parse it properly (taking first one as the exptl method) in both cif and pdb files
- improved very slightly the cif file parser moving out of loops the index getters
- drop fullLength as a field in Pdb, now we have only getFullLength()
- pdb file atom parsing is now column based and not regex based (hopefully will make it slightly faster)
- now parsing the element column of pdb files: now the atom type detection comes from the appropriate field in pdb/cif/pdbase. If in pdb file and not present we still try to guess it as before

Fixed bug: was not checking whether the uniprot japi was actually returning all requested records for ids given when using getMultipleEntries. Now checking, logging it and removing the not-found ids from the homolog list.

overwriting default contactTypes.dat with the one needed for CMView release versions (only contains standard contact types). The old version with all esoteric contact types is renamed to contactTypes.all.dat

Bringing back the PISA classes for interface, molecule and residues. We store things first into them and only then convert to our own ChainInterface if needed. This is a much better approach less prone to errors. Tests pass.

Fixed some bugs with parsing symop.lib and conversion of transformations. Properly testing SpaceGroups and parsing now. Still one test (transpose equal inverse) does not work for trigonal and hexagonal groups.

Implemented some parallelisation in ASA calculation. It doesn't scale very well (could measure x2 speed-up with 4 CPUs in a Core2 Quad) but at least we have it, now this is starting to be better than NACCESS!

Finally ASA/BSA comparison to PISA test passes. Have to exclude one case where slight difference interface area values led to different sorting and so comparsion doesn't work. This will still be a problem whenever that happens (whenever different sorting of PISA vs ours)

Fixed bug: NACCESS (weirdly) calculates slightly different ASA values for same molecule in different orientation (i.e. it depends on the choice of axes)!! That produces then some strange BSA values (including some negative) and slight discrepancies. Thus now we run NACCESS always for each of the 2 isolated molecules of the interface and then the complex (and so losing efficiency).

Fixed bug: the bsa values of residues of the different interfaces were being updated on the same references, we have to copy before creating ChainInterface objects in getAllInterfaces.
Implemented interface pdb file outputting in enumerateInterfaces
Still interfaces tests don't pass because of 2 problems: a) order of first/second molecule not necessarily same as PISA (and in case of same chain codes can't catch it with chain codes), b) in many cases there are many minor discrepancies, usually associated to a lot of small negative bsa values.

Fixed bug: bsa values were wrongly calculated.
Now sort of interfaces is always descending.
Some more testing for individual residues asa and bsa values.
Still an issue: individual bsas are not matching pisa's even though total interface areas are fine.

Now PISA interfaces using same classes as our own calculated interfaces. Removed all Pisa specific classes (except for connector and parser). Introduce new class ChainInterfaceList which also simplifies some things.
PdbAsymUnitTest passes.

Now elimination of redundant interfaces not based only on count of edges, but also on identity of each of the contacts (atom's serials, atom codes, residue serials, residue types). Added more test cases. Seems to be working fine.

First fully working version (this time hopefully for real...) of interface enumeration. Works for many examples, still there are problems in many others due to area discrepancies with pisa, but those are minor things. The elimination of duplicates is based on chain codes and number of contacting atoms, it is of course possible that 2 different interfaces happen to have the same chain codes and same number of contacting atoms (but unlikely). Would need a more fine grained comparison.
New test class to compare automatically to PISA output.

Committing a version that 1st) properly places the symoped generated units of the first unit cell in the first unit cell (and not elsewhere like before) 2nd) does not miss any interfaces because of just doing half of the neighbours (e.g. in 7odc it wouldn't work). Now no interfaces are missing but lots of duplicates are found. So not a fully working version yet.

Fixed bug: using the negative side of the 26 translations was causing an incorrect number of interfaces reported for 1pmm and 1pmo. Not sure why but for the moment reverting to doing the positive side.
Changed cutoff to 5.9, then both 1pmm and 1pmo match pisa.
Better reporting of symops (now based on matrices converted to algebraic notation), before was not totally correct.
Still problems with the code as it does not match pisa for some examples (probably many), e.g. for 7odc reports 4 instead of 6 interfaces.

Changed default cutoff to coincide with PISA, entry 1pmm was a good clue (one interface has a 1 atom contact with 6 cutoff).
Now going to the negative side (instead of positive) when translating the unit cell to coincide with PISA operators.

First implementation of enumeration of interfaces (based on crystal transformations), equivalent to PISA's. This is still a very rough implementation (far too slow) and not totally correct: some of the interfaces are duplicated. Needs more work but still good as a starting point.

Implementation of class for KendallsTau Correlation Calculation
Implementation of class for Check for Correctness of Casp server model files
Implementation of class for Calculation of geometric scoring

Now catching the case when the best translation contains stop codons. This is usually due to a wrong genetic code, but for the moment it's still very difficult to know the genetic code from encoding organelle+organism taxonomy, so this is the temporary solution.

Fixed bug: not allowing anymore the presence of gaps when choosing the representative CDS to uniprot match. Now the nucleotide alignments should be always correct (before they were shifted if there were gaps in CDS-to-uniprot)

Now returning null for the representative CDS when the gene encoding organelle is not nucleus/plasmid. Before the whole program would stop, which wasn't ideal. Eventually we will need to use the proper genetic code when needed.

Fixed bug: when retrieving (or reading from cache) embl cds sequences and embl dbfetch doesn't have a certain identifier, then we were adding nulls to the list of emblcds sequences of the UniprotHomologList (resulting in a null pointer down the line)

Fixed bug: getNucleotideAlignment was nullpointing when encountering a null return from getRepresentativeCDS() (which happens whenever there is no representative CDS for the particular UniprotEntry). Now checking for nulls before using the homolog in the alignment.

Fixed bug: was nullpointing at retrieveUniprotKBData because of a single uniprot id can have multiple blast hits and was using the uni ids as unique identifiers => the lookup map was failing. Now the lookup map is from uniIds to lists of homologs(hits)

Fixed bug in TcoffeeRunner. The t_coffee process was hanging after spawning. This didn't happen before, it can be related to the version of t_coffee (I'm using 8.14). Basically t_coffee seems to be doing something weird with stderr and stdout, java doesn't like it and makes the process hang (as usual java has a lot of problems in this area, see StreamGobbler class). The way to solve it is use the -quiet switch of t_coffee that can be used to redirect output to a file or have a complete quiet terminal.

Fixed blast parsing bug: was parsing the Hit_def tag instead of the Hit_id tag for the subject id. When formatdb is run with option "-o T" then indices are properly generated and Hit_id contains the correct tag.

New SiftsConnection and SiftsFeature classes for SIFTS pdb to uniprot mapping. Implemented the HasFeatures interface in Pdb. In the future we can hopefully use that mechanism for all features (sec. strucuture, scop and so on)

bugfix: hard-coded paths to files aapairsBounds.dat and contactTypes.dat had to be updated following the refactoring (this was already fixed by previous commit); introduced proper error messags if files are not found

New package runners for Runner classes. At the moment only contains the new class NaccessRunner (moved out of Pdb class)
Introduced a config file for tests so that one can set externally the executable paths and any other necessary data to run tests (at the moment only implemented in PdbTest.java). PdbTest now should run anywhere (if you have a local pdbase installation!)
Moved the consurf parser out of Pdb to its own class ConsurfConnection in the connection package.

Adding new package owl.connections with connector classes to online databases and services. Starting with PhosphoSitePlusConnection for accessing a database of posttranslational modifications and PDomainsConnection for accessing a very useful webservice which calculates structural domains using a variety of methods including Scop, Cath, DomainParser, ProteinDomainParser and others.

Reorganised the project with a src folder for java source files.
Added a jars dir with all jars needed for the project.
Added .project and .classpath pointing to relative path of jars.
The project should now work out of the box after a check-out with eclipse. No need to setup external jars or anything.