ABSTRACT
Metagenomics is the study of all genomic content present in given microbial communities. Metagenomic functional analysis aims to quantify protein families and reconstruct metabolic pathways from the metagenome and plays a central role in understanding the interaction between the microbial community and its host or environment. De novo functional analysis, which allows the discovery of novel protein families, remains challenging for high-complexity communities. Currently, de novo nucleotide assembly, gene calling, and peptide assembly are three approaches to recover novel genes or proteins. Unfortunately, their informational connection and dependency are little recognized, and each has been formulated as independent problems to solve. In this work, we develop a sophisticated workflow called integrated Metagenomic Protein Predictor (iMPP) to combine these three operations. iMPP leverages their informational dependencies with three novel modules: a hybrid assembly graph generation module, a graph-based gene calling module, and a peptide assembly-based refinement module. iMPP significantly improved the existing gene calling sensitivity on unassembled fragmented reads, achieving a 92% - 97% recall rate at a high precision level (>90%). iMPP further allowed for more sensitive and accurate peptide assembly, recovering more reference proteins and delivering more hypothetical protein sequences. The high performance of iMPP has the potential for obtaining a more comprehensive and unbiased view of the microbial communities under investigation. iMPP is freely available from https://github.com/Sirisha-t/iMPP
DATA DOWNLOAD
Subsampled datasets (DS1-6)
DS1.reads.fq.gzDS2.reads.fq.gz
DS3.reads.fq.gz
DS4.reads.fq.gz
DS5.reads.fq.gz
DS6.reads.fq.gz
Ground Truth References (Subsampled datasets)
DS1_ref.tar.gzDS2_ref.tar.gz
DS3_ref.tar.gz
DS4_ref.tar.gz
DS5_ref.tar.gz
DS6_ref.tar.gz
Simulated datasets (SDS1-3)
SDS1.2x.reads.fq.gzSDS1.5x.reads.fq.gz
SDS2.2x.reads.fq.gz
SDS2.5x.reads.fq.gz
SDS3.reads.fq.gz