Dpuiu Assemblathon: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 107: Line 107:
* Data  
* Data  
                     reads      min    max        mean  sum            cvg
                     reads      min    max        mean  sum            cvg
   Total              20635060    101    101        101  2084141060    449
   Total              20,635,060  101    101        101  2084141060    449
   Sampled             
   Sampled            3,591,676  101    101        101  362759276      78
   Corrected(paired)   
   Corrected(paired)  1,556,316  30    101        79    122785294      26


* DeNovo
* DeNovo

Revision as of 15:15, 21 December 2010

Links

Assemblers

* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA 
* Newbler
* Velvet
* SOAPdenovo
* Maq

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bee, Bombus impatiens

Data

  • 497,318,144 Illumina 124bp reads
  • 8 libraries; inserts:
    • 400bp
    • 3k (outie)
    • 8k (outie)
  • Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Location:

/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

Bacterium, Staph aureus USA300

 Complete genome        : NC_010079       2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 In progress genome     : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
 454 FLX                : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023  http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full 
 Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516         http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full

454FLX

  • Data
                       reads   min    q1     q2     q3     max        mean       n50        sum            
 sff(original)         334241  36     201    256    277    362        235        264        78432227      
 sffToCA(adaptor free) 325555  51     103    148    208    358        157        185        51225615   # 58723 mates
  • DeNovo
 # ctg stats
 .                     ctgs min  q1     q2      q3      max     mean       n50      sum      
 CA.6.1.bog            18   238  21014  135971  239466  567548* 155836     277888*  2805055
 newbler.2.5p1.deNovo  100  103  295    3287    39467   229053  27879      78379    2787870
 # scf stats
 CA.6.1.bog            6    284  21014  173065  1032129 1458733* 467554    1458733* 2805325
 newbler.2.5p1.deNovo  8    2475 20731  110137  1030785 1408642  349895    1408642  2799157
  • Reference based(Saureus USA300)
  .                    ctgs min  q1     q2      q3      max     mean       n50      sum      
 newbler.2.3.refMapper 206  103  556    3098    15366   117487  12749      40687    2626469

Illumina101.cor

  • Comments:
 first 2 and last base show composition bias; should we trim them???
  • Data
                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            2,295,176   101    101        101   231812776      78
 Corrected(paired)  1,241,488   64     101        96    119212428      41.5
  • DeNovo
 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700   31075   75      14            6  
 SOAPdenovo   9382  32    32     52     63     85850   347    16726  3259522  34960  722    37358   24      0             0  
 velvet       453   61    85     224    3279   137163  6297   36496  2852552  53075  10478  53410   156     17            6  
 # scf stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700   31075   75      20            7               # 1 large rearangement :  scf120001252361 : NC_010079:622381-671743
 SOAPdenovo   186   100   333    1528   17623  144079  15625  55558  2906207  61878  26070  62049   65      453           6  
 velvet       427   61    82     177    2982   137163  6685   37874  2854649  53268  10478  53623   156     42            8

Bacterium, E coli

 Compelte genome        : NC_000913      4639675bp  Escherichia coli str. K-12 substr. MG1655

 454 FLX                : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full 
 Illumina 101bp paired  : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full

Illumina101.cor

  • Data
                    reads       min    max        mean  sum            cvg
 Total              20,635,060  101    101        101   2084141060     449
 Sampled            3,591,676   101    101        101   362759276      78
 Corrected(paired)  1,556,316   30     101        79    122785294      26
  • DeNovo
 #ctg stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum         0cvgSum snps  breaks rearrangements   
 # scf stats

Human, a single chromosome, medium-sized.