Dpuiu Assemblathon

From Cbcb
Revision as of 18:28, 12 January 2011 by Dpuiu (talk | contribs) (→‎GAGE)
Jump to navigation Jump to search

Links

GAGE

  • Location
 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage

Assemblers

* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA 
* Newbler
* Velvet
* SOAPdenovo
* Maq

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bee, Bombus impatiens

Data

  • 497,318,144 Illumina 124bp reads
  • 8 libraries; inserts:
    • 400bp
    • 3k (outie)
    • 8k (outie)
  • Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Location:

/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

Bacterium, Staph aureus USA300

 Complete genome        : NC_010079       2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 In progress genome     : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
 454 FLX                : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023  http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full 
 Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516         http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full

454FLX

  • Data
                       reads   min    q1     q2     q3     max        mean       n50        sum            
 sff(original)         334241  36     201    256    277    362        235        264        78432227      
 sffToCA(adaptor free) 325555  51     103    148    208    358        157        185        51225615   # 58723 mates
  • DeNovo
 # ctg stats
 .                     ctgs min  q1     q2      q3      max     mean       n50      sum      
 CA.6.1.bog            18   238  21014  135971  239466  567548* 155836     277888*  2805055
 newbler.2.5p1.deNovo  100  103  295    3287    39467   229053  27879      78379    2787870
 # scf stats
 CA.6.1.bog            6    284  21014  173065  1032129 1458733* 467554    1458733* 2805325
 newbler.2.5p1.deNovo  8    2475 20731  110137  1030785 1408642  349895    1408642  2799157
  • Reference based(Saureus USA300)
  .                    ctgs min  q1     q2      q3      max     mean       n50      sum      
 newbler.2.3.refMapper 206  103  556    3098    15366   117487  12749      40687    2626469

Illumina101.78X.cor

  • Comments:
 mated reads; lib mea/srd (CA estimates)=170/21
 first 2 and last base show composition bias; should we trim them???
 
  • Data
                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            2,295,176   101    101        101   231812776      78
 Corrected(paired)  1,479,510   30     101        89    131973249      45.9
  • DeNovo
 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700   31075   75      14            6  
 SOAPdenovo   9382  32    32     52     63     85850   347    16726  3259522  34960  722    37358   24      0             0  
 velvet       453   61    85     224    3279   137163  6297   36496  2852552  53075  10478  53410   156     17            6  
 # scf stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700   31075   75      20            7               # 1 large rearangement :  scf120001252361 : NC_010079:622381-671743
 SOAPdenovo   186   100   333    1528   17623  144079  15625  55558  2906207  61878  26070  62049   65      453           6  
 velvet       427   61    82     177    2982   137163  6685   37874  2854649  53268  10478  53623   156     42            8
  • Location
 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/
 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/

Illumina101.150X.cor

  • Comments:
 first 2 and last base show composition bias; should we trim them???
  • Data
                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            4,404,626   101    101        101   444867226      154
 Corrected(paired)  2,815,584   30     101        89    252331825      87
  • DeNovo
 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       47    1866  11740  37171  88747  388257* 60272  121078* 2832762  39703  28204  39775   142     12            9  
 SOAPdenovo   15229 32    33     48     63     74601   232    13773   3530251  31690  269    36416   17      0             0  
 velvet       429   61    86     206    3147   134034  6650   40448   2852695  52754  9400   52940   152     24            9  
 # scf stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       46    1866  11740  39650  88747  388257* 61582  129426* 2832782  39703  28204  39775   142     13            10  
 SOAPdenovo   158   101   202    1249   20910  150305  18353  74149   2899760  64028  30903  64392   52      506           7   
 velvet       409   61    85     180    2854   142341  6978   42466   2853854  52940  9400   52751   150     43            12

Bacterium, E coli

 Compelte genome        : NC_000913      4639675bp  Escherichia coli str. K-12 substr. MG1655

 454 FLX                : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full 
 Illumina 101bp paired  : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full

Illumina101.cor

  • Data
 mated reads; lib mea/srd (CA estimates)=160/24
                    reads       min    max        mean  sum            cvg
 Total              20,635,060  101    101        101   2084141060     449
 Sampled            3,591,676   101    101        101   362759276      78
 Corrected(paired)  1,556,316   30     101        79    122785294      26
  • DeNovo
 #ctg stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               677        1006   2265   4527   8853   45500      6604       10186      4471053     189441  162537  190445  170     16            4  
 SOAPdenovo           4000       32     35     57     101    43816      1172       9327       4687158     63179   3026    78607   67      1             0  
 velvet               711        61     166    2135   8579   57936*     6375       16400*     4532895     113612  30252   114128  166     31            5  
 # scf stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               658        1006   2343   4621   9033   45500      6795       10483      4471433     189441  162537  190445  170     35            4   
 SOAPdenovo           270        100    269    3431   24735  219248*    17273      53882*     4663670     111651  71505   112331  128     826           2   
 velvet               480        61     92     1147   13028  114907     9523       32233      4571240     113936  30253   114192  161     262           11
  • Location
 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/Illuminap100/
 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Assembly/Illuminap100.cor/

Human, a single chromosome, medium-sized

  • Data
 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/100x
  • Comments
    • Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.