GRASPx

ABSTRACT

  • Background Metagenomics is a cultivation-independent approach that enables the study of the genomic composition of microbes present in an environment. Metagenomic samples are routinely sequenced using next-generation sequencing technologies that generate short nucleotide reads. Proteins identified from these reads are mostly of partial length. On the other hand, de novo assembly of a large metagenomic dataset is computationally demanding and the assembled contigs are often fragmented, resulting in the identification of protein sequences that are also of partial length and incomplete. Annotation of an incomplete protein sequence often proceeds by identifying its homologs in a database of reference sequences. Identifying the homologs of incomplete sequences is a challenge and can result in substandard annotation of proteins from metagenomic datasets. To address this problem, we recently developed a homology detection algorithm named GRASP (Guided Reference-based Assembly of Short Peptides) that identifies the homologs of a given reference protein sequence in a database of short peptide metagenomic sequences. GRASP was developed to implement a simultaneous alignment and assembly algorithm for annotation of short peptides identified on metagenomic reads. The program achieves significantly improved recall rate at the cost of computational efficiency. In this article, we adopted three techniques to speed up the original version of GRASP, including the pre-construction of extension links, local assembly of individual seeds, and the implementation of query-level parallelism. ResultsThe resulting new program, GRASPx, achieves >30X speedup compared to its predecessor GRASP. At the same time, we show that the performance of GRASPx is consistent with that of GRASP, and that both of them significantly outperform other popular homology-search tools including the BLAST and FASTA suites. GRASPx was also applied to a human saliva metagenome dataset and shows superior performance for both recall and precision rates. Conclusions In this article we present GRASPx, a fast and accurate homology-search program implementing a simultaneous alignment and assembly framework. GRASPx can be used for more comprehensive and accurate annotation of short peptides. GRASPx: Efficient homolog-search of short peptide metagenome database through simultaneous alignment and assembly.

  • Full Article

    README

    GRASPx is written in GNU C++ (g++ 4.8.1) and has been tested in

    a 64-bit GNU/Linux system. GRASPx contains three components:

    'graspx-build', 'graspx-assemble', and 'graspx-map'. 'graspx-build'

    is used to index the reads, and is required to run only once for

    each sequencing data set. 'graspx-assemble' performs the actual

    assembly and output contigs that are homologous to the queries.

    'graspx-map' performs the post-mapping that aligns individual reads

    back to the predicted contigs.

     

    NOTE: GRASPx currently only supports individual read length up to

    255 amino acids and upto 40 millon reads.

     


     

    To run GRASPx:

     

    Assume that you are in the GRASPx directory

    (e.g. /home/your_name/GRASPx_release):

     

      (1) Run 'graspx-build' to generate indexing files:

     

        "./bin/graspx-build ./examples/sample_reads.faa"

        The indexing files are to be written under './WorkSpace'. You can

        specify other indexing files output directories if necessary.

     

      (2) Run 'graspx-assemble' to assemble contigs:

       

        "./bin/graspx-assemble ./examples/sample_query.faa ./examples/sample_reads.faa out_contigs"

        The indexing files created under './WorkSpace' is necessary.

        The assembled contigs will be written in 'out_contigs' as a

        multi-fasta file.

     

      (3) Run 'graspx-map' to align reads to the contigs:

     

        "./bin/graspx-map ./examples/sample_reads.faa out_contigs out_mappings"

        The mapping of the reads will be written in 'out_mappings' as

        a tab-delimited file as 'read_ID, read_header, contig_header'.

     

    Note: please use '--help' to check out more advanced options.

     


     

    If you found GRASPx useful, please cite the following papers:

     

    Cuncong Zhong, Youngik Yang, and Shibu Yooseph, "GRASP: Guided

    Reference-based Assembly of Short Peptides". (2014) Nucleic Acids Research

     

    and

     

    Cuncong Zhong, Youngik Yang, and Shibu Yooseph, "GRASPx: an Efficient

    Referebce-guided Short Peptide Assembler". Manuscript in preparation

     

    Contact: Cuncong Zhong (czhong@jcvi.org) or Shibu Yooseph (syooseph.jcvi.org)

     


     

    The authors would like to acknowledge that GRASP includes (part of) the

    BOOST library (htt://www.boost.org) and the libdivsufsort library

    (http://code.google.com/p/libdivsufsort/).

    DOWNLOADS