Metagenoms: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Dpuiu (talk | contribs)
Dpuiu (talk | contribs)
 
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Links ==
= Web sites =


* [http://dnaresearch.oxfordjournals.org/cgi/reprint/dsm018v1  Kurokawa article]
* [http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi NCBI Bacterial Genomes] 647 Complete, 1067 in Progress
* [http://www.sciencemag.org/cgi/content/abstract/312/5778/1355 Gill article]
* [http://fames.jgi-psf.org/ JGI Fidelity of Analysis of Metagenomic Samples (FAMeS)]
* [http://www.nature.com/nmeth/journal/v4/n6/pdf/nmeth1043.pdf]
* [http://fames.jgi-psf.org/cgi-bin/dataset_desc.pl?dataset=soil JGI Simultaed High Complexity (SIMHC)]
* [http://nihroadmap.nih.gov/hmp/ Human Microbiome Project (HMP) @ NIH]
* [http://genome.wustl.edu/pub/organism/Microbes/Human_Gut_Microbiome/ WUSTL Human Gut Microbiome (HGMI)] 41 genomes
* [http://www.hgsc.bcm.tmc.edu/microbiome-index.xsp Baylor HMP] list of 353 genomes targeted for sequencing by the 4 centers
* [http://www.jcvi.org/cms/research/projects/hmp/overview/ JCVI HMP] 50 genomes (oral,skin,vagina)
* [http://img.jgi.doe.gov/cgi-bin/pub/main.cgi JGI Integrated Microbial Genomes(IMG)]
* [http://img.jgi.doe.gov/m/doc/about_index.html JGI Integrated Microbial Genomes for Metagenomics (IMG/M)]
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16729 HGMI at NIH]
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/16729_uncultured_bacteria/ HGMI TA] ~150K Sanger traces; trimming points are given
* [http://en.wikipedia.org/wiki/Superfamily Wikipedia Taxonomic Ranks]
  domain 
  phylum 
  class 
  order 
  family 
  genus 
  species
  strain
* Example: Pseudomonas aeruginosa
  domain: Bacteria
  phylum: Proteobacteria
  class:  Gammaproteobacteria
  order:  Pseudomonadales
  family: Pseudomonadaceae
  genus:  Pseudomonas
  species:Pseudomonas aeruginosa group
  strain: Pseudomonas aeruginosa


* [http://fames.jgi-psf.org/ JGI FAMeS]
= Articles =
* [http://fames.jgi-psf.org/cgi-bin/dataset_desc.pl?dataset=soil JGI SIMHC]
* [http://nihroadmap.nih.gov/hmp/ Human Microbiome Project @ NIH]
* {http://www.nature.com/nature/journal/v449/n7164/full/nature06244.html HMP Nature article Oct 2007]


== HMP ==
* [http://dnaresearch.oxfordjournals.org/cgi/reprint/dsm018v1  Kurokawa]
* ~10X more bateria than human cells in the body
* [http://www.sciencemag.org/cgi/content/abstract/312/5778/1355 Gill]
* small-subunit (16S) ribosomal RNA gene-sequence-based surveys :  
* [http://www.nature.com/nmeth/journal/v4/n6/pdf/nmeth1043.pdf JGI]
   - found in all microorganisms  
* [http://www.nature.com/nature/journal/v449/n7164/full/nature06244.html HMP Nature Oct 2007]
   - has enough sequence conservation for accurate alignment  
* [http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/HMPP_Proposal.pdf Human Microbiome Pilot Project (HMPP)]
   - has enough variation for phylogenetic analyses.
* [http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/HGMISeq.pdf Human Gut Microbiome Initiative (HGMI)]  Need for more RefSeqs; sequence the genomes of 100 cultured representatives of the phylogenetic diversity in the human gut microbiota
* [http://www.nature.com/nrmicro/journal/v6/n6/pdf/nrmicro1901.pdf Nature Reviews: Microbiology in the post-genomic era]
* [http://nar.oxfordjournals.org/cgi/reprint/gkm846v1 IMG NAR 2007]
* [http://www.nature.com/nmeth/journal/v4/n1/pdf/nmeth976.pdf Accurate phylogenetic classification of variable-length DNA fragments., Nature Methods Jan 2007]
 
= HMP =
 
NIH Roadmap
# Sequence the genomes of 200 microbes that have been isolated from the human body;
# Recruit a set of healthy donors and obtain samples from a set of body regions
# Perform initial 16S rDNA gene metagenomic sequence analyses to estimate the complexity of the microbiota at these sites.
Centers: Baylor, Broad, JCVI, WUSTL
 
* ~10times more bateria cells than human cells in the body
* small-subunit (16S) ribosomal RNA gene-sequence-based surveys:  
   * found in all microorganisms  
   * has enough sequence conservation for accurate alignment  
   * has enough variation for phylogenetic analyses.
* skin, mouth, oesophagus, stomach, colon and vagina
* skin, mouth, oesophagus, stomach, colon and vagina
* largest reported data sets are for the gut
* largest reported data sets are for the gut
* most of the 10–100 trillion microorganisms in the human gastrointestinal tract live in the colon.  
* most of the 10–100 trillion microorganisms in the human gastrointestinal tract live in the colon.  
* more than 90% of all phylogenetic types (phylotypes) of colonic bacteria belong to just 2 of the 70 known divisions (phyla) in the domain Bacteria: the Firmicutes and the Bacteroidetes.
* more than 90% of all phylogenetic types (phylotypes) of colonic bacteria belong to just 2 of the 70 known divisions (phyla) in the domain Bacteria: the Firmicutes and the Bacteroidetes.
Firmicutes (Gram-positive bacteria) : 639 Genome Sequences
    * Bacilli    472
    * Clostridia    106
    * Erysipelotrichi    1
    * Mollicutes    60
    * Thermolithobacteria 
    * unclassified Firmicutes sensu stricto 
    * environmental samples   
Bacteroidetes : 53  Genome Sequences
    * Bacteroidetes (class)    22
    * Flavobacteria    21
    * Sphingobacteria    8
    * unclassified Bacteroidetes    1
    * environmental samples    1
Actinobacteria
* In colon, the differences between individuals are greater than the differences between different sampling sites in one individuall
* Comunirties are usually stable over time
= SIMHC =
== Data ==
=== Online ===
* 113 reference genomes : 89 complete, 24 incomplete
                    #elem  min    max    mean    median  n50    sum
  NC_chromosomes    103    943016  8264687 3569251 3481691 4326849 367,632,882
  NC_plasmids      70      3361    821788  147207  96488  300758  10,304,479
  NC_*              173    3361    8264687 2184609 1966858 4317977 377,937,361  # come from 89 genomes & 70 plasmids ; some genomes contain multiple chromosomes
  NZ_*              3505    185    1802798 26389  9051    73891  92,494,763    # come from 24 genomes
  N*_total          3678    185    8264687 127904  9940    3561584 470,432,124
* 118084 Sanger reads from 3 insert libs (small,med,large)
  #reads  min    max    mean    median  n50    sum
  116771  43      3754    950    968    982    110875383
* More than 50% of reads are from Proteobacteria; Gammaproteobacteria
Phylums:
 
  Proteobacteria          72634
  Firmicutes              15021
  Actinobacteria          10744
  Cyanobacteria          6877
  Chlorobi                4982
  Euryarchaeota          3266
  Chloroflexi            1277
  Bacteroidetes          1161
  Deinococcus-Thermus    809
  Total                  116771
* Organisms
  Burkholderia      12431 # Which one? B. cenocepacia AU 1054 , B. AMMD, B. sp. 383, B. xenovorans LB400, B. cenocepacia HI2424, B. vietnamiensis G4
  Shewanella        9613
  Rhodopseudomonas  5279
  ...
  Ferroplasma      471
  Pediococcus      456
  Oenococcus        422    # Firmicutes, Bacilli, Lactobacillales
* read coverage of each organism 0.09 ..0.53X
  0.53X: Moorella thermoacetica ATCC 39073 (1426 reads; NC_007644 2,628,784bp 55.79%GC) (Firmicutes)
  0.09X: Xylella fastidiosa Ann-1
=== Local ===
* 113 reference genomes : 89 complete, 24 incomplete
* 116771 Sanger reads from 3 insert libs (small,med,large) 
* 118084-116771=1313  Moorella and Xylella were discarded !!!
* read coverage of each organism ...
  0.23X: Moorella thermoacetica ATCC 39073 (740 reads; NC_007644 2,628,784bp 55.79%GC) (Firmicutes)
== Assemblies ==
Contig stats
                      #elem  min    max    mean    median  n50    sum            singl
  phrap-ctg            23398  73      8603    1289    1194    1341    30,163,430      66524
  arachne-ctg          578    240    6300    1878    1822    1985    1,085,508      115300
  jazz-scaff          860    1000    39837  3236    1105    7278    2,783,247      109080
  CA-scaff            4327    1000    48861  1682    1342    1456    7,278,044      76270    # OBT trimming of the reads
  CA-scf-unambiguous  4327    1000    26070  1473    1342    1403    6,374,650
  CA-ctg              4491    1000    5252    1419    1340    1388    6,374,863                # 4407 CONTAINED in references
  CA-deg              11611  66      5920    850    833    898    9,865,902
  AMOS-ctg            95372  72      7426    975    916    964    92,959,055      5661      # alignment based trimming of the reads; casm-layout "-S -r" have been used
  AMOS-ctg-plasmids    2078    159    7426    1007    911    972    2,092,115
  minimus2-ctg        15211  85      4699    1257    1205    1305    19,119,413      82535    # alignment based trimming of the reads
CA:
* Longest scaff: scf7180000043951 48861bp(26070 unambiguous) comes from NC_007968      41221  38.26  Psychrobacter cryohalolentis K5 plasmid 1
* Longest ctg:  ctg7180000029682 5252bp                    comes from NC_008499      35595  38.51  Lactobacillus brevis ATCC 367 plasmid 2
* Longest degen: ctg7180000030915 5920bp                                NC_008608      30722  56.32  Pelobacter propionicus DSM 2379 plasmid pPRO2
minimus2 on AMOScmp & CA (ctg+degen) : most of CA assemblies were contained in AMOScmp contigs; stats about same as AMOScmp

Latest revision as of 19:04, 9 June 2008

Web sites

 domain  
 phylum  
 class   
 order   
 family  
 genus   
 species 
 strain
  • Example: Pseudomonas aeruginosa
 domain: Bacteria
 phylum: Proteobacteria
 class:  Gammaproteobacteria
 order:  Pseudomonadales 
 family: Pseudomonadaceae
 genus:  Pseudomonas
 species:Pseudomonas aeruginosa group
 strain: Pseudomonas aeruginosa

Articles

HMP

NIH Roadmap

  1. Sequence the genomes of 200 microbes that have been isolated from the human body;
  2. Recruit a set of healthy donors and obtain samples from a set of body regions
  3. Perform initial 16S rDNA gene metagenomic sequence analyses to estimate the complexity of the microbiota at these sites.

Centers: Baylor, Broad, JCVI, WUSTL

  • ~10times more bateria cells than human cells in the body
  • small-subunit (16S) ribosomal RNA gene-sequence-based surveys:
 * found in all microorganisms 
 * has enough sequence conservation for accurate alignment 
 * has enough variation for phylogenetic analyses.
  • skin, mouth, oesophagus, stomach, colon and vagina
  • largest reported data sets are for the gut
  • most of the 10–100 trillion microorganisms in the human gastrointestinal tract live in the colon.
  • more than 90% of all phylogenetic types (phylotypes) of colonic bacteria belong to just 2 of the 70 known divisions (phyla) in the domain Bacteria: the Firmicutes and the Bacteroidetes.

Firmicutes (Gram-positive bacteria) : 639 Genome Sequences

   * Bacilli    472
   * Clostridia    106
   * Erysipelotrichi    1
   * Mollicutes    60
   * Thermolithobacteria   
   * unclassified Firmicutes sensu stricto   
   * environmental samples    

Bacteroidetes : 53 Genome Sequences

   * Bacteroidetes (class)    22
   * Flavobacteria    21
   * Sphingobacteria    8
   * unclassified Bacteroidetes    1
   * environmental samples    1

Actinobacteria

  • In colon, the differences between individuals are greater than the differences between different sampling sites in one individuall
  • Comunirties are usually stable over time

SIMHC

Data

Online

  • 113 reference genomes : 89 complete, 24 incomplete
                   #elem   min     max     mean    median  n50     sum
 NC_chromosomes    103     943016  8264687 3569251 3481691 4326849 367,632,882
 NC_plasmids       70      3361    821788  147207  96488   300758  10,304,479
 NC_*              173     3361    8264687 2184609 1966858 4317977 377,937,361   # come from 89 genomes & 70 plasmids ; some genomes contain multiple chromosomes
 NZ_*              3505    185     1802798 26389   9051    73891   92,494,763    # come from 24 genomes
 N*_total          3678    185     8264687 127904  9940    3561584 470,432,124


  • 118084 Sanger reads from 3 insert libs (small,med,large)
 #reads  min     max     mean    median  n50     sum
 116771  43      3754    950     968     982     110875383
  • More than 50% of reads are from Proteobacteria; Gammaproteobacteria

Phylums:

 Proteobacteria          72634
 Firmicutes              15021
 Actinobacteria          10744
 Cyanobacteria           6877
 Chlorobi                4982
 Euryarchaeota           3266
 Chloroflexi             1277
 Bacteroidetes           1161
 Deinococcus-Thermus     809
 Total                   116771
  • Organisms
 Burkholderia      12431 # Which one? B. cenocepacia AU 1054 , B. AMMD, B. sp. 383, B. xenovorans LB400, B. cenocepacia HI2424, B. vietnamiensis G4
 Shewanella        9613
 Rhodopseudomonas  5279
 ...
 Ferroplasma       471
 Pediococcus       456
 Oenococcus        422    # Firmicutes, Bacilli, Lactobacillales
  • read coverage of each organism 0.09 ..0.53X
 0.53X: Moorella thermoacetica ATCC 39073 (1426 reads; NC_007644 2,628,784bp 55.79%GC) (Firmicutes)
 0.09X: Xylella fastidiosa Ann-1

Local

  • 113 reference genomes : 89 complete, 24 incomplete
  • 116771 Sanger reads from 3 insert libs (small,med,large)
  • 118084-116771=1313 Moorella and Xylella were discarded !!!
  • read coverage of each organism ...
 0.23X: Moorella thermoacetica ATCC 39073 (740 reads; NC_007644 2,628,784bp 55.79%GC) (Firmicutes)

Assemblies

Contig stats

                      #elem   min     max     mean    median  n50     sum             singl
 phrap-ctg            23398   73      8603    1289    1194    1341    30,163,430      66524
 arachne-ctg          578     240     6300    1878    1822    1985    1,085,508       115300
 jazz-scaff           860     1000    39837   3236    1105    7278    2,783,247       109080
 CA-scaff             4327    1000    48861   1682    1342    1456    7,278,044       76270     # OBT trimming of the reads
 CA-scf-unambiguous   4327    1000    26070   1473    1342    1403    6,374,650
 CA-ctg               4491    1000    5252    1419    1340    1388    6,374,863                 # 4407 CONTAINED in references
 CA-deg               11611   66      5920    850     833     898     9,865,902
 AMOS-ctg             95372   72      7426    975     916     964     92,959,055      5661      # alignment based trimming of the reads; casm-layout "-S -r" have been used
 AMOS-ctg-plasmids    2078    159     7426    1007    911     972     2,092,115
 minimus2-ctg         15211   85      4699    1257    1205    1305    19,119,413      82535     # alignment based trimming of the reads

CA:

  • Longest scaff: scf7180000043951 48861bp(26070 unambiguous) comes from NC_007968 41221 38.26 Psychrobacter cryohalolentis K5 plasmid 1
  • Longest ctg: ctg7180000029682 5252bp comes from NC_008499 35595 38.51 Lactobacillus brevis ATCC 367 plasmid 2
  • Longest degen: ctg7180000030915 5920bp NC_008608 30722 56.32 Pelobacter propionicus DSM 2379 plasmid pPRO2

minimus2 on AMOScmp & CA (ctg+degen) : most of CA assemblies were contained in AMOScmp contigs; stats about same as AMOScmp