miP3.py Version 1
Instalation
-----------
miP3.py does not require installation. Unpack the zipped file, install the requirements listed in "Dependencies", set up the database as instructed in "Create the result database" and "Create BLAST databases", and you can follow the "Example project" outlined below.
What is it?
-----------
miP3.py is a python program which can find homologue proteins lacking any domain input by the user. This has been used by Magnani et al.[1] to predict possible microProteins. It has to be run from the command line.
Cite
----
If you want to cite this program, use
Magnani, Enrico, Niek de Klein, Hye-In Nam, Jung-Gun Kim, Kimberly Pham, Elisa Fiume, Mary Beth Mudgett, and Seung Yon Rhee. "A comprehensive analysis of microProteins reveals their potentially widespread mechanism of transcriptional regulation." Plant Physiology (2014): pp-114.
Authors
-------
Niek de Klein, Sue Rhee, Enrico Magnani
Contributors
------------
Michael Banf
Contact
------
niekdeklein@gmail.com
Dependencies
------------
Internet connection
Python 2.7.x - obtainable from http://www.python.org/download/releases/2.7/
Biopython - obtainable from http://biopython.org/wiki/Download
BLAST+ - obtainable from
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
Because the BLAST+ program does not work when there are spaces in the filepath when running from python, the program only works when called from
the same directory as where it is located.
MySQL community server and Workbench - http://dev.mysql.com/downloads/mysql/
MySQLdb - http://sourceforge.net/projects/mysql-python/
If you are on a MAC and you get "EnvironmentError: mysql_config not found" http://stackoverflow.com/a/20468916/651779, take a look at http://stackoverflow.com/questions/7475223/mysql-config-not-found-when-installing-mysqldb-python-interface?answertab=votes#tab-top
Also, don't forget to start MySQL. With Unix, you can start it with by typing
mysql.server start
in the command prompt.
Create the result database
--------------------------
Before running the programs you have to make the database. Open MySQL Workbench and click New Connection. In version 6.1 of MySQL Workbench this is the + next to MySQL Connections. Choose a hostname, username and password. Open the newly made connection by double clicking on it. This opens the query screen. Click on File in the upper left and select Open SQL Script. Open MiP prediction SQL FEB 27.sql provided with miP3.py. Finally, execute the query by clicking on the thunderbolt sign above the script.
Create BLAST reference databases
--------------------------------
Before running the program you have to set up the BLAST reference databases. If you installed all the dependencies, you will have installed BLAST+ tools. One of the BLAST+ tools is makeblastdb. Open the terminal and change directory to the bin in the BLAST+ tools folder
cd path/to/ncbi_blast_2.2.29+/bin
Now we can run makeblastdb. We need two databases, one containing all proteins, and one containing the small proteins. We can download all Arabidopsis thaliana proteins from ftp://ftp.arabidopsis.org/home/tair/Proteins/TAIR10_protein_lists/, and we save it as all_proteins.fasta. Then we run (add ./ at the start if it does not recognize the program)
makeblastdb -dbtype prot -out "path/to/outfile" -in "path/to/all_proteins.fasta"
We also need a small proteins database. You can retrieve the small proteins yourself, or you can use all_proteins_to_small_proteins.py provided with miP3.py. You can run this from the terminal if you are in the miP3 folder by (in Windows, use C:/path/to/python/python.exe)
python all_proteins_to_small_proteins.py -p path/to/all_proteins.fasta -o
outfile/path/small_proteins.fasta -s 200
Here we wrote all proteins that are 200 a.a. or smaller from all_proteins.fasta to small_proteins.fasta. Now we can use the small_proteins.fasta to make a small proteins reference database with (add ./ at the start if it does not recognize the program)
./makeblastdb -dbtype prot -out "path/to/outfile" -in "path/to/small_proteins.fasta"
makeblastdb will create three files, a .phr, .pin and .psq file. Make sure that you keep these together.
Lastly, we need to set up a Pfam database for searching protein domains. Download the Pfam reference database from Pfam_LE.tar.gz from http://www.biowebdb.org/pub/cdd/little_endian/ and extract it where you want. If you want to make your own Pfam database, use ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release and makeprofiledb from the BLAST+ tools.
Now that we set up all our reference databases, we can run the program.
*NOTE*
A downside of makeblastdb (and the other BLAST+ tools), when working on Windows any filepaths can not contain spaces. So if the fasta file you want to use to make the database is in a file path which contains spaces, you will have to copy it to a filepath that does not contain any spaces.
Example project
---------------
After installing all the dependencies (see Dependencies), installing the database for the results (See Create the result database) and setting up the BLAST databases (see "Create BLAST databases"), miP3 can be run from the command line using python.
Open your terminal if you use Unix or command line if you use Windows. Change directory to the directory that miP3.py location. In my case, this is C:\Users\example\miP3
cd C:\Users\example\mip3
Then, run the program using python. On Unix this can usually been done
by simply
sudo python miP3.py
Windows does not always add python automatically to the environment
path. If this is the case, use the complete path to the python
executable. In my case, this is C:\Python\python.exe.
C:\Python\python.exe miP3.py
This starts up the miP3.py program, which will ask for different input. Here we go through the different inputs
What is the host of your database (Default: 127.0.0.1):
If you haven't created the database yet, follow the steps from "Create the result database". Here, you had to fill in a host for the database. The default database in MySQL is 127.0.0.1 (local database). If you did not change the host name in MySQL, you need the default, and can just press Enter.
What is the user of your database (Default: root):
When making the database you also had to fill in the username. root is the default in MySQL. If you did not change that, press Enter.
What is the password of your database: *******
Type the password of the database and press Enter. If you didn't give a password when creating your MySQL database, just press Enter.
What is the name of your database (Default: MiP):
If you followed to steps in "Create the result database" the database name is MiP prediction. If you changed the database name, fill in your changed name. Otherwise, press Enter.
If the file is in the same folder: insert name of the file with all the proteins of the organism, else insert the whole pathway of the database with all the proteins(e.g. C:\\exampleFolder\\all_proteins.fasta) and press enter: all_proteins.fasta
I downloaded all Arabidopsis thaliana proteins from ftp://ftp.arabidopsis.org/home/tair/Proteins/TAIR10_protein_lists/ and saved it in the same folder as miP3.py under the name all_proteins.fasta, so I type "all_proteins.fasta" (without quotes) and Enter.
If the file is in the same folder: insert name of the file with all the proteins of the organism, else insert the whole pathway of the database with all the proteins(e.g. C:\\exampleFolder\\small_proteins.fasta) and press enter: all_proteins.fasta
You can make a fasta file containing only the small proteins yourself, or you can run the command line program all_proteins_to_small_proteins.py as described in "Create BLAST databases"
If the file is in the same folder: insert name of the database with all the proteins of the organism, else insert the whole pathway of the database with all the proteins(e.g. C:\exampleFolder\allProtDatabaseExample): allprot
If you followed the steps in "Create BLAST reference databases" you can select the alldatabase. I saved the database files in the same folder as miP3.py under the name allprot, so I type "allprot" (without the quotes) and press Enter.
If the file is in the same folder: insert name of the small proteins database file, else insert the whole pathway of the small proteins database file (e.g. C:\exampleFolder\smallProteinDatabaseExample) and press enter: smallprot
The same as with the all proteins database, I saved the smallprot database in the same folder as miP3.py, so I type "smallprot" (without quotes) and Enter.
If the file is in the same folder: insert name of the PFAM database file, else insert the whole pathway of the Pfam database file and press enter: Pfam_LE/Pfam
The same as with the all proteins database, I saved the Pfam database folder in the same folder as miP3.py, so I type "Pfam_LE/Pfam" (without quotes) and Enter.
If the file is in the same folder: insert name of the transcription factor file, else insert the whole pathway of the transcription factor file(e.g. C:\exampleFolder\transcriptionFactorExampleFile.fas
ta) and press enter: transcription_factor.fasta
Here we need the file containing the proteins of interest. For Arabidopsis thaliana, we can download download all transcription factors of Arabidopsis thaliana from here: http://planttfdb.cbi.pku.edu.cn/download.php. We call this file "transcription_factors.fasta" and save it in the same folder as miP3.py. Then we type "transcription_factor.fasta"
Proteins with which Pfam database domains do you want to select out? These are selected out by default: [list of PFAM domains]
If you do not want to add any domains, press enter. If you do want to add domains, add the ID found here http://pfam.sanger.ac.uk/family/browse?browse=a. When adding several domains seperate them with a ,[space]. Example: MAD, MadL, MAGE. If you want to delete the default list and make your own list write delete, [ID] e.g. delete, MAD, MadL, MAGE:
This includes most of the DNA binding domains, so we just press Enter.
For this example we only
use RAV (AT1G51120.1
), bHLH (AT1G01260.1) and TALE (AT1G23380.1), as
running the program with all transcription factos can take some time.
Lastly, we need a file with interpro domain names that we do not want.
The file containing all the DNA binding names can downloaded from ?. The
Pfam names can be extracted from the third column. We save this in
Pfam.txt. We save all the files in C:\Users\example\miP3.
We downloaded
the BLAST program from
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
The folder was saved in the same location as miP3.py.
Now we can
run the program from the command line (if we are in the same directory
as miP.py) with:
python miP3.py -p all_proteins.fasta -i example_tfs.fasta -f Pfam.txt -o miP_output.csv -b ncbi_blast_2.2.29+/bin/
The result is now written out in the same folder as miP3.py is located. It is a tab delimited file. The first column is the name of the predicted miP. The second column are the transcription factors tat are holomogous of the predicted miP. The third column is are the domains that the predicted miP contains. The final olumn is the length of the predicted miP.
This shows that our example search found KNATM (AT1G14760.2), a homologue of TALE (AT1G23380.1), one of the confirmed microProteins.
Type in the official name of the organism (e.g. Arabidopsis thaliana): Arabidopsis thaliana
Our example is using Arabidopsis thaliana, so we type that and Enter.
With what e-value do you want to BLAST the transcription factors against all the proteins? (default 0.01):
We use default values, so Enter.
With what e-value do you want to BLAST the transcription factors against the small (<200aa) proteins? (default 1):
We use default values, so Enter.
What is the pathway of the directory that contains blastp and rpsblast?: path/to/blast_directory/bin/
Depends on where the BLAST tools are installed. The blastp and rpsblast (and other BLAST tools) are found in the bin/ directory. Important to note, on Windows if the filepath contains spaces it does not work, so you will have to copy the directory to a path without spaces. In Windows, you have to use \ as path separators.
Now the program will run. It will write a couple of files to the folder that miP3.py is located in. Two of these files are MiPsequences.fasta and TFsequences.fasta. This contains the fasta sequences of putative miPs and TFs respectively. To further filter out microProteins we have to run InterproScan for these proteins. How to install local InterproScan can be found here: http://code.google.com/p/interproscan/wiki/InterProScan5RC3#Obtaining_a_copy_of_InterProScan_5. This only works on Linux.
iprscan -cli -i ~/MiPsequences.fasta -iprlookup -goterms -o path/to/outfile/MiPsequences.fasta.ipr_raw -format raw -verbose >~/ipr.run 2> ~/ipr.errors
iprscan -cli -i ~/MiPsequences.fasta -iprlookup -goterms -o path/to/outfile/TFsequences.fasta.ipr_raw -format raw -verbose >~/ipr.run 2> ~/ipr.errors
This will write MiPsequences.fasta.ipr_raw and TFsequences.fasta.ipr_raw to the folder path/to/outfile/. Now run the filter script interproScanHandle.py
c:\Python\python.exe interproScanHandle.py
This will also ask for the database connection and several input files. For the database connection, give the same answers as you gave to miP3.py. This updates the results database with domain information. Finally, we can run the script that makes the results table, MakeCompleteMiPtable.py.
c:\Python\python.exe MakeCompleteMiPtable.py
Again, this will ask for the connection information of the result database. Give the same information as previously. It writes the result to MiP_table.csv, which can be found in the same folder as where miP3.py is located.
1. Magnani, Enrico, Niek de Klein, Hye-In Nam, Jung-Gun Kim, Kimberly Pham, Elisa Fiume, Mary Beth Mudgett, and Seung Yon Rhee. "A comprehensive analysis of microProteins reveals their potentially widespread mechanism of transcriptional regulation." Plant Physiology (2014): pp-114.