HMM-GRASPx

ABSTRACT

Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families.

Full Article

README

1: Runing individual programs

'graspxp-build' is used to construct the index.
Assuming that you are under directory '/HMMGRASPx_home/',
simply type './bin/graspxp-build ./Examples/mix3fams.fa'.
The build program will create a set of indexing files and put into '/HMMGRASPx_home/WorkSpace'.
'graspxp-assemble' is used to perform targeted assembly.
Assuming that you are under directory '/HMMGRASPx_home/',
and that you have finished the previous building step, simply type
'./bin/graspxp-assemble ./Examples/mix3fams.hmm ./Examples/mix3fams.fa raw_contigs.fa'.
The program will generate file 'raw_contigs.fa', which is in the FASTA format
and contains all contigs being assembled in this stage.
'graspxp-map' is used to align the short peptides against the assembled contigs. Assuming
that you are under directory '/HMMGRASPx_home/',
and that you have finished the previous assembly step, simply type
'./bin/graspxp-map ./Examples/mix3fams.fa raw_contigs.fa mapping.list'.

IMPORTANT: Note that this step will directly mapping short peptides against the raw contigs generated from the previous step. We suggest that the raw contigs being first verified by

HMMER3 re-alignment (e.g. between './Examples/mixfams.hmm' and 'raw_contigs.fa') before doing the mapping. This step is optional but important in reducing false-positive predictions.

Below you will find the use of a driver script that automatically include the verification process. The resulting mapping output is a table that contains three fields, namely

the ID of the read in the data set;
the header/name of the read; and
the header/name of the contig where the read is mapped onto.

2: Using the "RunHMMGRASPx.pl" script to streamline your analysis

The script requires HMMER3 (http://hmmer.janelia.org/). This current release should contain a copy of the software under the '/HMMGRASPx_home/ThirdParty' folder.
You need to compile HMMER3. First go to '/HMMGRASPx_home/ThirdParty/hmmer-3', and then type './configure'.
In the same directory, type './make'. If the compilation is successful, you should find that all executables have been put under directory
'/HMMGRASPx_home/ThirdParty/hmmer-3/binaries'.
After compiling HMMER3, you should be able to use the "RunHMMGRASPx.pl" script. To analyze the data given in '/HMMGRASPx_home/Examples',
simply type (assume you are under /HMMGRASPx_home/)
'perl Scripts/RunHMMGRASPx.pl --hmm=Examples/mix3fams.hmm --seq=Examples/mix3fams.fa --out=TestResults --home=./ --index=WorkSpace/ --param=Settings/param'
To find out meaning of the options, please type 'perl Scripts/RunHMMGRASPx.pl'

DOWNLOADS

HMMGRASPx-release_v0.0.3_x86-64.tar.gz
HMMGRASPx-release_v0.0.2_x86-64.tar.gz
HMMGRASPx-release_v0.0.1_x86-64.tar.gz
README

HighResFigure
Figure2.pdf

SupplementaryData
Benchmark.zip
README.txt
OralSearchResults.tgz
Saliva.faa.tgz
OralGroundTruth.tgz
RESFAMResults.tgz
SimMarineGroundTruth.tgz
OralAbundanceCorrelation.tgz
OralDEResults.tgz
OralTargetedAssembly.tgz
SalivaAssembledContigs.tgz