SRGD - Data and Methods:

SRGD was initially developed by Wang & Brendel (The ASRG database). Using their set of 395 splicing related genes in Arabidopsis (ASRG395) as a starting point, Chen & Brendel (Identification and Survey of Splicing-related Proteins in 10 Plant Species (unsubmitted)) surveyed 10 plant genomes for splicing related genes. This page provides links to data sets, software, and scripts used in that work.

Scripts and Pipeline:

A three-round BLASTp search was used to identify pre-mRNA splicing-related proteins in 10 plant species as follows:

Step 1:

Initially, a comprehensive set of 395 pre-mRNA splicing-related proteins in Arabidopsis was downloaded from ASRG database. This set will be referred to as AtSRP (href="/SRGD/Atortho/gdna/ASRG395">ASRG395). Complete sets of predicted protein sequences of 10 plant species derived from from the respective genome annotations were obtained from each species as mentioned in the source of datasets.

Step 2:

AtSRP was then used as the query in local BLASTp search against each of the annotated protein sets. All hits with e-value of less than 10^-20 were retained for futher analysis.The BLASTp result for each species are dowloadable ont the follwoing table, which are refered to At_**(BLASTp result of AtSRP against **), where ** represents one of 10 plant species.

The comand line is:

$formatdb -i hugefasta -p F

$blastall -i infile -d hugefasta -p blastp -o out -m 8

Step 3

In order to identify potential additional homologs not idnetified in the initial search, all hits from the first stage were retrieved, pooled, and then used as the query in a second local BLASTp search against the combined set of all annotated proteins from all 10 species. New hits at a cutoff e-value of 10^-20 were added to the set of candidate plant pre-mRNA splicing-related proteins.The BLASTp result is on the following table as reffered to all_all.

The following table provides links to Blast output files for each species:

Step 4.

All candidates of splicing-related proteins were blastp-searched against themselves in order to obtain pairwise sequence similarities for input into OrthoMCL. The output of BLASTP result from step 3 (also the input of OrthoMCL), and the output of OrthoMCL are provide via the following links:

The result all_orthoMCL can be found under the OrthoMCL directory, which contains all genes from the all2all_blastp.out result. CSV format is also available: all_orthoMCL.csv

Step 5.

For each gene cluster, CIWOG was used to identify the common intron positions and types. For each cluster, two files were built to be processing with CIWOG software. One file contained muscle format of proteins alignments from the same cluster, and another one contained CIWOG required format of information including gene names, gene structures, gene transcription start and stop sites, gene translation start and stop codons, and genome sequences.

The perl scripts were written to process the annotation file, genome file and the gff file to generate the CIWOG information file (The genome sequences in Mt are already included in the gff file). We can download these files from the dataset section and put them into the same folder to run the perl script

Because Lj has different gene names in the annotation file and the PlantGDB, we only use 9 plants in the CIWOG result.

(1) Gff file and xml file were used to generate CIWOG information file and further formatted for each cluster based on the information on PlantGDB.

(2) Muscle was used to generate the alignment file for each cluster.

(3) For each cluster, the alignment and CIWOG information file were saved in a single directory named as the cluster number. The alignment file and Ciwog information file in the same directory were named as the same name as the directory but different suffix. For example, for cluster 1, the directory is named as 1, which composed of two files: 1.aln (alignment file) and 1.ciwog(CIWOG information file), which can be also downloaded at here.