Summary: In this work, we identified all genes in three large metagenomes (Great Prairie Soil Metagenome Grand Challenge, Global Ocean Sampling, Human Gut Microbiome) lacking detectable homology to conserved domain databases (CDD, Pfam) and the NCBI non-redundant database. These ORFan sequences were then subjected to remote-homology detection to identify potentially distant homologies to protein families from the Protein Data Bank (PDB).
|CDSs removed containing conserved domain matches (PFAM+CDD)||2,480,274||4,542,071||4,674,912|
|Spurious (singleton, short and repetitive) CDSs removed||2,758,146||11,458,304||6,603,567|
|CDSs removed with BLAST matches to nr database (orphans)||282,869||951,863||1,071,580|
|Candidate functional ORFans||85,422||251,857||146,842|
|ORFan CD-HIT clusters||33,013||73,428||32,078|
|Annotated (HHsuite) ORFan CDSs||21,358||38,900||13,638|
|Annotated (HHsuite) ORFan CD-HIT clusters||7848||10,973||3119|
Metalloprotease ORFan predictions: The following files are predicted ORFans with remote homology to metalloprotease families that also contain a HExxH (zinc-binding) catalytic motif. The corresponding raw sequence data (FASTA) can be found in the files above.
"Orphans" - these are metagenomic proteins that BLAST to homologs in the NCBI nr database but could not be functionally annotated.