Earth Microbiome Project
The Earth Microbiome Project (EMP) is an initiative founded by Rob Knight in 2010 to collect natural samples and to analyze the microbial community around the globe.
Microbes are highly abundant, diverse, and have an important role in the ecological system. For example, the ocean contains an estimated 1.3 × 1028 archaeal cells, 3.1 × 1028 bacterial cells, and 1 × 1030 virus particles.[1][2] The bacterial diversity, a measure of the number of types of bacteria in a community, is estimated to be about 160 for a mL of ocean water, 6,400–38,000 for a g of soil, and 70 for a mL of sewage works.[2] Yet as of 2010, it was estimated that the total global environmental DNA sequencing effort had produced less than 1 percent of the total DNA found in a liter of seawater or a gram of soil,[3] and the specific interactions between microbes are largely unknown.
The EMP aims to process as many as 200,000 samples in different biomes, generating a complete database of microbes on earth to characterize environments and ecosystems by microbial composition and interaction. Using these data, new ecological and evolutionary theories can be proposed and tested.[4]
Actors
The non-governmental international project was launched in 2010. As of January 2018, it listed 161 institutions, all of them universities and university-affiliated institutions, except for IBM Research and the Atlanta Zoo. Crowdsourcing has come from John Templeton Foundation, the W. M. Keck Foundation, the Argonne National Laboratory by the U.S. Dept. of Energy, the Australian Research Council, the Tula Foundation, and the Samuel Lawrence Foundation. Companies have provided in-kind support including MO BIO Laboratories, Luca Technologies, Eppendorf, Boreal Genomics, Illumina, Roche and Integrated DNA Technologies.[5]
Goals
The primary goal of the Earth Microbiome Project (EMP) has been to survey microbial composition in many environments across the planet, across time as well as space, using a standard set of protocols. The development of standardized protocols is vital, because variations in sample extraction, amplification, sequencing and analysis introduce biases that would invalidate comparisons of microbial community structure.[6]
Another important goal is to determine how reconstruction of microbial communities is affected by analytic biases. The rate of technological advance is rapid, and it is necessary to understand how data using updated protocols will compare with data collected using earlier techniques. Information from this project will be archived in a database to facilitate analysis. Other outputs will include a global atlas of protein function and a catalog of reassembled genomes classified by their taxonomic distributions.[6]
Methods
Standard protocols for sampling, DNA extraction, 16S rRNA amplification, 18S rRNA amplification, and "shotgun" metagenomics have been developed or are under development.[7]
Sample collection
Samples will be collected using appropriate methods from various environments including deep ocean, fresh water lakes, desert sand, and soil. Standardized collection protocols will be used when possible, so that the results are comparable. Microbes from natural samples cannot always be cultured. Because of this, metagenomic methods will be employed to sequence all the DNA or RNA in a sample in a culture-independent fashion.
Wet lab
The wet lab usually needs to perform a series of procedures to select and purify the microbial portion of the samples. The purification process may be very different according to the type of sample. DNA will be extracted from soil particles, or microbes will be concentrated using a series of filtration techniques. In addition, various amplification techniques may be used to increase DNA yield. For example, non-PCR based Multiple displacement amplification is preferred by some researchers. DNA extraction, the use of primers, and PCR protocols are all areas that, in order to avoid bias, need to be performed following carefully standardized protocols.[6]
Sequencing
Depending on the biological question, researchers can choose to sequence a metagenomic sample using two main approaches. If the biological question to be resolved is, what types of organisms are present and in what abundance, the preferred approach would be to target and amplify a specific gene that is highly conserved among the species of interest. The 16S ribosomal RNA gene for bacteria and the 18S ribosomal RNA gene for protists are often used as target genes for this purpose. The advantage of targeting a specific gene is that the gene can be amplified and sequenced at a very high coverage. This approach is called "deep sequencing", which allows rare species to be identified in a sample. However, this approach will not enable assembly of any whole genomes, nor will it provide information on how organisms may interact with each other. The second approach is called shotgun metagenomics, in which all the DNA in the sample is sheared and the random fragments sequenced. In principle, this approach allows for the assembly of whole microbial genomes, and it allows inference of metabolic relationships. However, if most of microbes are uncharacterised in a given environment, de novo assembly will be computationally expensive.[8]
Data analysis
EMP proposes to standardize the bioinformatics aspects of sample processing.[6]
Data analysis usually includes the following steps: 1) Data clean up. A pre-procedure to clean up any reads with low quality scores removing any sequences containing "N" or ambiguous nucleotides and 2) Assigning taxonomy to the sequences which is usually done using tools such as BLAST[9] or RDP.[10] Very often, novel sequences are discovered which cannot be mapped to existing taxonomy. In this case, taxonomy is derived from a phylogenetic tree which is created with the novel sequences and a pool of closely related known sequences.[11]
Depending on the sequencing technology and the underlying biological question, additional methods may be employed. For example, if the sequenced reads are too short to infer any useful information, an assembly will be required. An assembly can also be used to construct whole genomes, which will provide useful information on the species. Furthermore, if the metabolic relationships within a microbial metagenome are to be understood, DNA sequences need to be translated into amino acid sequences, for example with using gene prediction tools such as GeneMark[12] or FragGeneScan.[13]
Project output
The four key outputs from the EMP have been:[14]
- All primary data generated from the Earth Microbiome Project, regardless of their degree of conclusiveness, will be stored in a centralized database called the "Gene Atlas" (GA). The GA will have sequence data, annotations and environmental metadata. Known as well as unknown sequences, i.e. "Dark Matter", will be included hoping that, given the time, the unknown sequences may eventually be characterized.
- Assembled genomes, annotated using an automated pipeline, will be stored in "Earth Microbiome Assembled Genomes" (EM-AG) in public repositories. These will enable comparative genomic analysis.
- Interactive visualizations of the data will be provided through the "Earth Microbiome Visualization Portal" (EM-VIP), which will allow the relationship between microbial makeup, environmental parameters, and genomic function to be viewed.
- Reconstructed metabolic profiles will be offered through "Earth Microbiome Metabolic Reconstruction" (EMMR).
Challenges
Large amounts of sequence data generated from analyzing diverse microbial communities are a challenge to store, organize and analyse. The problem is exacerbated by the short reads provided by the high-throughput sequencing platform that will be the standard instrument used in the EMP project. Improved algorithms, improved analysis tools, huge amounts of computer storage, and access to many thousands of hours of supercomputer time will be necessary.[8]
Another challenge will be the large number of sequencing errors that are expected. Next-generation sequencing technologies provide enormous throughput but lower accuracies than older sequencing methods. When sequencing a single genome, the intrinsic lower accuracy of these methods is far more than compensated for by the ability to cover the entire genome multiple times in opposite directions from multiple start points, but this capability provides no improvement in accuracy when sequencing a diverse mixture of genomes. The question will be, how can sequencing errors be distinguished from actual diversity in the collected microbial samples?[8]
Despite the issuance of standard protocols, systematic biases from lab to lab are expected. The need to amplify DNA from samples with low biomass will introduce additional distortions of the data. Assembly of genomes of even the dominant organisms in a diverse sample of organisms requires gigabytes of sequence data.[8]
The EMP must avoid a problem that has become prevalent in the public sequence databases. With the advancement in high-throughput sequencing technologies, many sequences are entering public databases with no experimentally determined function, but which have been annotated on the basis of observed homologies with a known sequence. The first known sequence is used to annotate the first unknown sequence, but what is happening is that the first unknown sequence is being used to annotate the second unknown sequence and so on. Sequence homology is only a modestly reliable predictor of function.[15]
Notes
- Suttle, C. A. (2007). "Marine viruses — major players in the global ecosystem". Nature Reviews Microbiology. 5 (10): 801–812. doi:10.1038/nrmicro1750. PMID 17853907. S2CID 4658457.
- Curtis, T. P.; Sloan, W. T.; Scannell, J. W. (2002). "Estimating prokaryotic diversity and its limits". Proceedings of the National Academy of Sciences. 99 (16): 10494–10499. doi:10.1073/pnas.142680199. PMC 124953. PMID 12097644.
- Gilbert, J. A.; Meyer, F.; Antonopoulos, D.; Balaji, P.; Brown, C. T.; Brown, C. T.; Desai, N.; Eisen, J. A.; Evers, D.; Field, D.; Feng, W.; Huson, D.; Jansson, J.; Knight, R.; Knight, J.; Kolker, E.; Konstantindis, K.; Kostka, J.; Kyrpides, N.; MacKelprang, R.; McHardy, A.; Quince, C.; Raes, J.; Sczyrba, A.; Shade, A.; Stevens, R. (2010). "Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project". Standards in Genomic Sciences. 3 (3): 243–248. doi:10.4056/sigs.1433550. PMC 3035311. PMID 21304727.
- Gilbert, J. A.; O'Dor, R.; King, N.; Vogel, T. M. (2011). "The importance of metagenomic surveys to microbial ecology: Or why Darwin would have been a metagenomic scientist". Microbial Informatics and Experimentation. 1 (1): 5. doi:10.1186/2042-5783-1-5. PMC 3348666. PMID 22587826.
- The Earth Microbiome Project is a systematic attempt to characterize global microbial taxonomic and functional diversity for the benefit of the planet and humankind. Earth Microbiome Project 2018, retrieved 3 January 2018
- Gilbert, J.A.; Meyer, F. (2012). "Modeling the Earth Microbiome". Microbe Magazine. 7 (2): 64–69. doi:10.1128/microbe.7.64.1.
- "Earth Microbiome Project / Standard Protocols". Archived from the original on 2012-03-16. Retrieved 2012-03-07.
- Jansson, Janet (2011). "Towards "Tera-Terra": Terabase Sequencing of Terrestrial Metagenomes". Microbe Magazine. 6 (7): 309–15. doi:10.1128/microbe.6.309.1.
- "BLAST: Basic Local Alignment Search Tool".
- "Ribosomal Database Project". Retrieved 2012-03-06.
- Meyer, F.; Paarmann, D.; d'Souza, M.; Olson, R.; Glass, E. M.; Kubal, M.; Paczian, T.; Rodriguez, A.; Stevens, R.; Wilke, A.; Wilkening, J.; Edwards, R. A. (2008). "The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes". BMC Bioinformatics. 9: 386. doi:10.1186/1471-2105-9-386. PMC 2563014. PMID 18803844.
- "GeneMark - Free gene prediction software". Retrieved 2012-03-06.
- "FragGeneScan". Retrieved 2012-03-06.
- "Earth Microbiome Project / Defining the Tasks". Archived from the original on 2012-03-16. Retrieved 2012-03-07.
- Gilbert, J. A.; Dupont, C. L. (2011). "Microbial Metagenomics: Beyond the Genome". Annual Review of Marine Science. 3: 347–371. Bibcode:2011ARMS....3..347G. doi:10.1146/annurev-marine-120709-142811. PMID 21329209.