Supplementary Data For Lobb et al. (2015)

Summary: In this work, we identified all genes in three large metagenomes (Great Prairie Soil Metagenome Grand Challenge, Global Ocean Sampling, Human Gut Microbiome) lacking detectable homology to conserved domain databases (CDD, Pfam) and the NCBI non-redundant database. These ORFan sequences were then subjected to remote-homology detection to identify potentially distant homologies to protein families from the Protein Data Bank (PDB).

Numbers of CDSs and ORFans at key stages of metagenomic ORFan identification
Predicted CDSs 5,606,711 17,204,095 12,496,901
CDSs removed containing conserved domain matches (PFAM+CDD) 2,480,274 4,542,071 4,674,912
Spurious (singleton, short and repetitive) CDSs removed 2,758,146 11,458,304 6,603,567
CDSs removed with BLAST matches to nr database (orphans) 282,869 951,863 1,071,580
Candidate functional ORFans 85,422 251,857 146,842
ORFan CD-HIT clusters 33,013 73,428 32,078
Annotated (HHsuite) ORFan CDSs 21,358 38,900 13,638
Annotated (HHsuite) ORFan CD-HIT clusters 7848 10,973 3119


ORFan datasets (fasta files)

ORFan functional annotations

Metalloprotease ORFan predictions: The following files are predicted ORFans with remote homology to metalloprotease families that also contain a HExxH (zinc-binding) catalytic motif. The corresponding raw sequence data (FASTA) can be found in the files above.


"Orphans" - these are metagenomic proteins that BLAST to homologs in the NCBI nr database but could not be functionally annotated.

Orphan datasets

Orphan functional annotations