Pipeline for the prediction of nuclear localisation

About

Predicting whether a protein localises to the nucleus helps to infer function and can assist in protein annotation.
It can also be used to determine whether secreted parasitic proteins could translocate to the nucleus and interfere with gene expression and other nuclear processes.

This pipeline adopts a number of commonly used nuclear localisation tools and one alternative method to create a consensus of whether a protein has the propensity to be nuclear or not.

Initially proteins are assessed with TMHMM to determine if they have a transmembrane domain and are thus not likely to be nuclear localised. Next, three methods are used to determine nulear localisation. Method 1 uses a combination of PredictNLS and ACCpro to detect surface exposed nuclear localisation signals, Method 2 uses NucPred to ascertain whether or not a protein spends some time in the nucleus and Method 3 uses an alternative technique of mining InterProScan derived Gene Ontology (GO) terms for the presence of terms linked to a nuclear associated function.

Software Requirements

PredictNLS

  1. Download predictNLS from the Rost Lab: https://rostlab.org/owiki/index.php/Packages
    Ideally download either the Ubuntu/Debian or RHEL/SUSE/CentOS packages as the src takes a bit of configuring to get working
  2. Install as per the provided instructions and the binary should now be on your path (default /usr/bin/)
  3. If you want to add NLS patterns of your own add this to /usr/share/predictnls/data/

  4. KVKRx{13}KKPK Potential 0 0 0
    Make sure you put tab separated 0 in or you will get the error:
    unexpected line from file 'xxxxx.txt': 'inputSequence KVKRx{13}KKPK Potential

NucPred

  1. Download NucPred tar archive from the Stockholm Bioinformatics Centre: http://www.sbc.su.se/~maccallr/nucpred/source/
  2. Untar the archive as follows to a directory of your choice:

    tar -xzvf nucpred-1.1.tar.gz

  3. Add this path to the $nucpred_root variable in the nuc_loc.pl script

ACCpro

  1. Download v4 of ACCpro from the UCI http://download.igb.uci.edu/
  2. note that v5 is bundled with the SCRATCH-1D package but as the whole SCRATCH suite needs to be run at once, runtimes are quite long so we have chosen to stick to v4
  3. Untar the archive as before to a directory of your choice
  4. Check the permissions of the extracted folder
  5. Follow the instructions in readme.txt
  6. You will need to make sure that glibc.i686 and compat-libstdc++.i686 (provides libstdc++.so.5) are installed
  7. Add the installation path to the $ACCpro_root variable in the nuc_loc.pl script

InterproScan

  1. Download InterproScan from: ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan
  2. Untar the archive as before to a directory of your choice
  3. Follow the installation instructions. The default databases are sufficient for nuc_loc to run.
  4. Add the installation path to the $interproscan_root variable in the nuc_loc.pl script

TMHMM

  1. Download TMHMM from http://www.cbs.dtu.dk/cgi-bin/sw_request?tmhmm
  2. Untar the archive as before to a directory of your choice
  3. Follow the instructions in README
  4. Add the installation path to the $tmhmm_root variable in the nuc_loc.pl script

Gene Ontology Database


You can either use the public Gene Ontology database with the connection details at http://www.geneontology.org/GO.database.shtml#online or install your own as follows:
  1. Download and install MySQL and MySQL server from http://dev.mysql.com including dev packages if using yum/dpkg etc
  2. Download the full MySQL GO database (12GB) from http://www.geneontology.org/GO.downloads.database.shtml
  3. Follow the installation instructions at http://archive.geneontology.org/latest-full/README
  4. Create a MySQL user to access the database
  5. Add the connection details to the relevant variables in the nuc_loc.pl script

GO_link

  1. Download GO_link from http://bioinformatics.childhealthresearch.org.au/software/go_link
  2. Follow the installation instructions on the site above (note that you will have already installed the Gene Ontology Database)
  3. Perform an analysis using the GO term for nucleus (GO:0005634) or use the nucleus_go_terms.tsv file in the download

Perl Modules


Install the following modules from CPAN
Bio::SeqIO
Cwd
GO::AppHandle
Getopt::Long
Pod::Usage
Venn::Chart
threads
Thread::Semaphore
threads::shared

Code

You can download the two Perl scripts that run this analysis here
Simply untar the archive to a directory of your choice and alter the configuration variables as above

Example

Run perl nuc_loc.pl --help for usage instructions
There is some example data in the example directory. This should give the results in the example/out directory

Contact

The nuc_loc pipeline was written by Richard Francis as part of his PhD in Bioinformatics at the University of Western Australia. Contact Us