#If It is in the hash table , We are just gonna take its ID and concatenate with the Id of other sequence that was exaclty the same.

+

#if it is already in the hash table, we're just gonna concatenate the ID of the current sequence to another one that is already in the hash table

else:

else:

sequences[sequence]+="_"+seq_record.id

sequences[sequence]+="_"+seq_record.id

Revision as of 18:43, 9 July 2011

Description

I want to share my script using biopython to clean sequences up , you should know analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.

Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length) and remove sequences which have too many unknown nucleotides (N) ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.

Script

from Bio import SeqIO
def sequence_cleaner(fasta_file,min_length=0,por_n=100):
#create our hash table to add the sequences
sequences={}#Using the biopython fasta parse we can read our fasta inputfor seq_record in SeqIO.parse(fasta_file, "fasta"):
#Take the current sequence
sequence=str(seq_record.seq).upper()#Check if the current sequence is according to the user parametersif(len(sequence)>=min_length and(float(sequence.count("N"))/float(len(sequence)))*100<=por_n):
# If the sequence passed in the test "is It clean?" and It isnt in the hash table , the sequence and Its id are going to be in the hashif sequence notin sequences:
sequences[sequence]=seq_record.id#if it is already in the hash table, we're just gonna concatenate the ID of the current sequence to another one that is already in the hash tableelse:
sequences[sequence]+="_"+seq_record.id#Write the clean sequences#Create a file in the same directory where you ran this script
output_file=open("clear_"+fasta_file,"w+")#Just Read the Hash Table and write on the file as a fasta formatfor sequence in sequences:
output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()#sequence_cleaner("Aip_coral.fasta",10,10)"You should call the function 'sequence_cleaner', there are 3 basics parameters:
"#1st: your fasta file " #2nd: the user define the minimum length (default value 0 ( means you dont care to minimum length)
"#3rd: the user define the % of N is allows (default value 100 ( means you dont care to 'N' in your sequences))"FYI If You dont care to the 2nd and the 3rd parameters you just gonna remove the duplicate sequences
"