Saturday, July 06, 2013

Fastx is a useful tool for quality clipping, getting quality statistics etc. Just running with -i and -o parameters, it will complain if you have file in sanger fastq format. there is an undocumented parameter that takes the quality offset. For instance if you want to run a sanger fastq file, then do the following:
./fastx_quality_stats -i tmp -Q 33 -o tmp.out
But if the file is in Solexa format, then you dont have to specify the -Q option.

However, there is always a need to have a script in your tool box that can convert one fastq format to another. I found a nice script online probably written by Heng Li

It takes about 30 mins to run in a single processor mode running on the head node of a cluster with 49 GB memory took approximately 1 hour for a 25 GB input file.

Sunday, June 23, 2013

Trouble comes as a package when you start your lab. In my effort to resurrect our old database Eumicrobedb.org, I face new challenges every day. What earlier seemed to a be a cake walk, does not look that easy. I have installed so many open source programs in my last stint that I lost count of it. When running this software, I realize that many more dependencies need to be installed. Not to mention my troubles dealing with a crashed server and subsequent loss of all installations:((((

recently I was stuck with easy installations samtools-0.1.19 and bedtools-2.17.0 for more than 3 days. Earlier versions compiled fine in my SUSE linux machine but the same sources give trouble in Red Hat Linux.

While compiling BEDTools, it complained about "undefined reference to `gzopen64'". Searching the forums I figured out that it is complaining about zlib. I checked, it was there in my server and also in the path. However, finally changing the makefile with the following compiled the program.

Look for the line in Makefile 'export LIBS'

and add the following: export LIBS = YOUR_PATH/libz.so.1.2.7 -lzdo a make clean and do makeIt should compile fine.Samtools:With Samtools My problem lasted for a very long time. It always exited with errorsamtools error bam_import.c:76: undefined reference to gzopen64I tried re-installing zlib, added zlib path to LD_LIBRARY_PATH, installed latest version zlib, nothing worked.Finally I changed the line in Makefile that says :CFLAGS= -g -Wall -O2 toCFLAGS= -g -Wall -O2 -L /usr/local/lib (This is where my zlib libraries are located)viola.... It got installed.NOTE: Install as regular user. May be later you can copy the binary files to the system directories.

Sunday, May 19, 2013

Installing CEGMA can be a bit of a challenge for most of the users including me! I had installed it earlier, but this time it really gave me a very hard time in RH 6.2. Apart from installing a set of pre-requisitites, installation of genewise is bit of a challenge - especially so because now its status is archival. The INSTALL instruction that comes with the package is not very useful..

go to wise-2.4.1/src/HMMer2then replace 'getline' to 'my_getline' in sqio.creplace 'isnumber' in src/models/phasemodel.c into 'isdigit'

Then go and check the makefile under each directory under src and relace 'glib-config --libs' to 'pkg-config --libs glib-2.0' and also glib-config --cflags' to 'pkg-config --cflags glib-2.0'
using the following command:

Thursday, April 18, 2013

Network File system is a great way to use your resources efficiently especially when they are connected to each other by network. There is a great article here [http://www.tldp.org/HOWTO/NFS-HOWTO/] that guides you about how to achieve this. For the impatient, I am putting a shorter step by step version here, so that you can quickly achieve it.

As a pre-requisite you need to have root access in all your servers you are trying to connect.

1. edit /etc/exports file to have the following entries:
/usr/data nnn.n.n.n(rw,sync)
/usr/data mmm.m.m.m(rw,sync)

where /usr/data in the local machine is the directory you want to share with machines with ip nnn.n.n.n and mmm.m.m.m. Here rw means read write, sync is to synchronise.
You can change this file in all the servers you want to share items with.

2. make entries in your /etc/hosts file which should read something like:
127.0.0.1 xxx.localdomain localhost
nnn.n.n.n yyy.localdomain -> where your machine has the name 'xxx' and the ip address of the machine you are trying to have in network is nnn.n.n.n and the name of that machine is: 'yyy'. This way you can keep adding the number of machines in the network that you wish to add.

3. In newer machines you can make the following entries into your /etc/hosts.deny file:
portmap:ALL
lockd:ALL
mountd:ALL
rquotad:ALL
statd:ALL
All these commands are harmless, and nothing happens if your system does not have a particular protocol. But it is a great way to protect your machine.
4. In /etc/hosts.allow file add the following lines:
service:nnn.n.n.n , mmm.m.m.m
lockd:nnn.n.n.n , mmm.m.m.m
rquotad:nnn.n.n.n , mmm.m.m.m
mountd:nnn.n.n.n , mmm.m.m.m
statd:nnn.n.n.n , mmm.m.m.m
5. Run rpcinfo -p to check if portmapper, rquotad, mountd, nfs, amd, status, nlockmgr is running.

(where nnn.n.n.n and mmm.m.m.m are the other two machines in your network you wish to give nfs access to)

6. Run exportfs -ra to make nfsd to read /etc/exports

7. Check /proc/filesystems and see if there is an entry on nfs. If it is there then you are good to go

9. Finally mount the file system. Make sure you have mount package is installed in your computer.
be a root in machine xxx.x.x.x and you want to have /usr/data drive of machine nnn.n.n.n to be mounted in /share/data in your local file system -> do the following:
mount nnn.n.n.n:/usr/data /share/data
and keep adding mount drives for other machines as well.

Do the same for other machines if you want to have drives from your machine xxx.x.x.x to be mounted else where. Now finally you are done!!

Friday, March 22, 2013

Lately I have been asked by multiple people to solve the training problem of Augustus for their organism data. Although I have done it earlier, this time, I faced unusually long time in solving this issue. Augustus wants you to provide a training dataset either in genbank format or in gff format. It typically wants users to remove duplicate genes; proteins with 70% identity, in order to get over the over fitting problem. Also, remove any kind of overlapping genes from your training files. If you could not find a script that can do it efficiently, please contact me directly I will provide you one (sorry, I am yet to set my source code page up).

I have generated a genbank file from my gff and the annotation files (again script can be sent upon request). Howver, few things I have overlooked and they were leaving empty lines between 2 genomic LOCUS entries. If empty space is left then the randomSplitter.pl that is used to split the gb training file into 2 random parts fails. You either get all your data in .test file or all of them in .train file. Suppose you decide to move on with your .train file with all the data and .test as a empty file. Surprises that you will end up with std::bad_alloc() error from augustus, that has nothing to do with the memory of your system. So, if you have an autogenerated genbank file with empty lines, open the file in vim editor and get rid of empty lines using :g/^$/d. Once done, try splitting your *.gb file with randomSplit.pl. This will do the magic. Now try to train your program and it should run file.Here is a check list of commands what you should be doing to train for Augustus.perl scripts/new_species.pl --species=sojaeperl scripts/randomSplit.pl psv5.gb 30./bin/etraining --species=sojae psv5.gb.train# mock prediction./bin/augustus --species=sojae psv5.gb.test | tee firsttest.outgrep -A 22 Evaluation firsttest.out

This is what you get when you run the last command. This is a pretty good indicator that your training is almost good to go.