GRASP

ABSTRACT

  • Protein sequences predicted from metagenomic datasets are annotated by identifying their homologs via sequence comparisons with reference or curated proteins. However, a majority of metagenomic protein sequences are partial-length, arising as a result of identifying genes on sequencing reads or on assembled nucleotide contigs, which themselves are often very fragmented. The fragmented nature of metagenomic protein predictions adversely impacts homology detection and, therefore, the quality of the overall annotation of the dataset. Here we present a novel algorithm called GRASP that accurately identifies the homologs of a given reference protein sequence from a database consisting of partial-length metagenomic proteins. Our homology detection strategy is guided by the reference sequence, and involves the simultaneous search and assembly of overlapping database sequences. GRASP was compared to three commonly used protein sequence search programs (BLASTP, PSI-BLAST and FASTM). Our evaluations using several simulated and real datasets show that GRASP has a significantly higher sensitivity than these programs while maintaining a very high specificity. GRASP can be a very useful program for detecting and quantifying taxonomic and protein family abundances in metagenomic datasets. GRASP is implemented in GNU C++.

  • Full Article

    README

    GRASP is written in GNU C++ (g++ 4.8.1) and has been tested in

    a 64-bit GNU/Linux system. GRASP contains two components: 'Build' and

    'Search'. 'Build' is used to create a one-time index on the reads

    which the reference is searched against. 'Search' performs the

    actual assembly and search against the pre-built indexing files.

     

    NOTE: GRASP currently only supports individual read length up to

    255 amino acids.

                                                                                                                                    

    Assume that you are in the GRASP directory:

     

    --RUNNING "Build":

      "./bin/Build ./example/reads.faa --work_space=./WorkSpace"

     

      The indexing files are to be written under the specified work space.

     

    --RUNNING "Search":

      "./bin/Search ./example/query.faa ./example/reads.faa --work_space=./WorkSpace --result_space=./Results --write_certificate=1"

     

      The indexing files created in the work space will be read,

      and the results will be written to the result space. By adding

      "--write_certificate=1", GRASP will output the alignments

      between the query sequence and the assembled sequences.

     

      NOTE: if you need to use alternative alphabet or seed length,

      please rerun the 'Build' program with the new parameters.

     

    --RESULTS INTERPRETATION:

      The results can be found in the specified result space, or

      by default "./Results". A default search will create a read

      recruitment file that ends with '.reads'. Using "--write_certificate=1"

      option will create another certificate file that ends with '.aln'.

     

      The recruitment file that ends with ".reads" contains reads that

      have been identified by the search, which are deemed to come

      from homologous sequences of the query. The first column

      indicates the query sequence ID (order in the query file,

      begins with 0). The second column indicates the recruited read ID

      (order in the read file, begins with 0). The third column

      indicates the best alignment E-value achieved by the assembly

      of the corresponding read.

     

      The certificate file that ends with ".aln" contains the detailed

      alignments between the query and the assembled sequences. Each

      alignment segment corresponds to a unique assembled sequence.

      The detailed location where the individual reads contribute to the

      assembly can be found at the end of each alignment segment (included reads).

      It contains a number of tuples (one for each read), while each one

      labels a read ID and its starting position in the assembled sequence.

     

    --PYTHON WRAPPER SCRIPT

      "./scripts/GRASP.py --query=./example/query.faa --target=./example/reads.faa"

     

      A wrapping script written in Python can be found in "./scripts/GRASP.py".

      The script runs "Build" and "Search" together based on the input parameters.

      It requires Python version 2.7 or higher.

     

     


     

    If you found GRASP useful, please cite the following paper:

     

    Cuncong Zhong, Youngik Yang, and Shibu Yooseph, "GRASP: Guided

    Reference-based Assembly of Short Peptides". manuscript in preparation

     

    Contact: Cuncong Zhong (czhong@jcvi.org) or Shibu Yooseph (syooseph.jcvi.org)

     


     

    The authors would like to acknowledge that GRASP includes (part of) the

    BOOST library (htt://www.boost.org) and the libdivsufsort library

    (http://code.google.com/p/libdivsufsort/).

    DOWNLOADS