Revision as of 18:22, 28 January 2011

Links

http://www.genomeweb.com/informatics/us-european-teams-launch-parallel-challenges-improve-computational-methods-genom

GAGE

Location

 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage

Assemblers

* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA 
* Newbler
* Velvet
* SOAPdenovo
* Maq

CBCB genomes

a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bee, Bombus impatiens

Data

497,318,144 Illumina 124bp reads
8 libraries; inserts:
- 400bp
- 3k (outie)
- 8k (outie)
Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Location:

/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

Bacterium, Staph aureus USA300

 Complete genome        : NC_010079       2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 In progress genome     : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs

 454 FLX                : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023  http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full 
 Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516         http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full

454FLX

Data

                       reads   min    q1     q2     q3     max        mean       n50        sum            
 sff(original)         334241  36     201    256    277    362        235        264        78432227      
 sffToCA(adaptor free) 325555  51     103    148    208    358        157        185        51225615   # 58723 mates

DeNovo

 # ctg stats
 .                     ctgs min  q1     q2      q3      max     mean       n50      sum      
 CA.6.1.bog            18   238  21014  135971  239466  567548* 155836     277888*  2805055
 newbler.2.5p1.deNovo  100  103  295    3287    39467   229053  27879      78379    2787870

 # scf stats
 CA.6.1.bog            6    284  21014  173065  1032129 1458733* 467554    1458733* 2805325
 newbler.2.5p1.deNovo  8    2475 20731  110137  1030785 1408642  349895    1408642  2799157

Reference based(Saureus USA300)

  .                    ctgs min  q1     q2      q3      max     mean       n50      sum      
 newbler.2.3.refMapper 206  103  556    3098    15366   117487  12749      40687    2626469

Illumina101.78X.cor

Comments:

 mated reads; lib mea/srd (CA estimates)=170/21
 first 2 and last base show composition bias; should we trim them???

Data

                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            2,295,176   101    101        101   231812776      78
 Corrected(paired)  1,479,510   30     101        89    131973249      45.9

DeNovo

 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700   31075   75      14            6  
 SOAPdenovo   9382  32    32     52     63     85850   347    16726  3259522  34960  722    37358   24      0             0  
 velvet       453   61    85     224    3279   137163  6297   36496  2852552  53075  10478  53410   156     17            6

 # scf stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700   31075   75      20            7               # 1 large rearangement :  scf120001252361 : NC_010079:622381-671743
 SOAPdenovo   186   100   333    1528   17623  144079  15625  55558  2906207  61878  26070  62049   65      453           6  
 velvet       427   61    82     177    2982   137163  6685   37874  2854649  53268  10478  53623   156     42            8

Location

 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/
 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/

Illumina101.150X.cor

Comments:

 first 2 and last base show composition bias; should we trim them???

Data

                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            4,404,626   101    101        101   444867226      154
 Corrected(paired)  2,815,584   30     101        89    252331825      87

DeNovo

 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       47    1866  11740  37171  88747  388257* 60272  121078* 2832762  39703  28204  39775   142     12            9  
 SOAPdenovo   15229 32    33     48     63     74601   232    13773   3530251  31690  269    36416   17      0             0  
 velvet       429   61    86     206    3147   134034  6650   40448   2852695  52754  9400   52940   152     24            9

 # scf stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       46    1866  11740  39650  88747  388257* 61582  129426* 2832782  39703  28204  39775   142     13            10  
 SOAPdenovo   158   101   202    1249   20910  150305  18353  74149   2899760  64028  30903  64392   52      506           7   
 velvet       409   61    85     180    2854   142341  6978   42466   2853854  52940  9400   52751   150     43            12

Bacterium, E coli

 Compelte genome        : NC_000913      4639675bp  Escherichia coli str. K-12 substr. MG1655

 454 FLX                : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full 
 Illumina 101bp paired  : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full

Illumina101.cor

Data

 mated reads; lib mea/srd (CA estimates)=160/24

                    reads       min    max        mean  sum            cvg
 Total              20,635,060  101    101        101   2084141060     449
 Sampled            3,591,676   101    101        101   362759276      78
 Corrected(paired)  1,556,316   30     101        79    122785294      26

DeNovo

 #ctg stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               677        1006   2265   4527   8853   45500      6604       10186      4471053     189441  162537  190445  170     16            4  
 SOAPdenovo           4000       32     35     57     101    43816      1172       9327       4687158     63179   3026    78607   67      1             0  
 velvet               711        61     166    2135   8579   57936*     6375       16400*     4532895     113612  30252   114128  166     31            5

 # scf stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               658        1006   2343   4621   9033   45500      6795       10483      4471433     189441  162537  190445  170     35            4   
 SOAPdenovo           270        100    269    3431   24735  219248*    17273      53882*     4663670     111651  71505   112331  128     826           2   
 velvet               480        61     92     1147   13028  114907     9523       32233      4571240     113936  30253   114192  161     262           11

Location

 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/Illuminap100/
 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Assembly/Illuminap100.cor/

Human, a single chromosome, medium-sized

Latest online assembly

 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
  NC_000014.8    107349540

Human bowtie indexes

  /fs/szdata/bowtie_indexes/h_sapiens_37_asm

Illumina data

 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/

 #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
 Lib          #Spots  #Bases  #Reads     #Mates     ReadLen  InsMea  InStd  InsMin  InsMax   TrimReadLen
 SRR067787    82.4M   16.6G   652448124  324283604  101      155     26     77      458     
 SRR067789    82.6M   16.7G   654133372  324876520  101      155     26     77      458     
 SRR067780    83.3M   16.8G   660001672  328021140  101      155     26     77      458     
 SRR067791    83.0M   16.8G   657963460  327205952  101      155     26     77      458     
 SRR067793    77.0M   15.5G   609634756  303094956  101      155     26     77      458     
 SRR067784    83.3M   16.8G   660118460  328244560  101      155     26     77      458     
 SRR067785    81.6M   16.5G   646350512  321174108  101      155     26     77      458     
 SRR067792    83.8M   16.9G   663997828  330084304  101      155     26     77      458     
 SRR067577    46.3M   9.3G    367673108  183472948  101      155     26     77      458     
 SRR067579    46.0M   9.3G    365743380  182532676  101      155     26     77      458     
 SRR067578    46.5M   9.4G    369557476  184410788  101      155     26     77      458     
 
 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
 SRR067771    81.5M   16.5G   644846296  320822716  101      2283    221    1620    2586    
 SRR067777    82.6M   16.7G   653163608  325232944  101      2283    221    1620    2586    
 SRR067781    82.1M   16.6G   649748720  323656576  101      2283    221    1620    2586    
 SRR067776    79.9M   16.1G   632590344  315165892  101      2283    221    1620    2586    
 
 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
 SRR067773    93.1M   18.8G   736456192  366884512  101      2803    271    1990    3106    
 SRR067779    94.0M   19.0G   743564440  370214028  101      2803    271    1990    3106    
 SRR067778    97.3M   19.6G   767984324  381879652  101      2803    271    1990    3106    
 SRR067786    94.6M   19.1G   747631104  372002548  101      2803    271    1990    3106    
 
 #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
 SRR068214    13.1M   2.0G    104505420  52087176   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 SRR068211    4.8M    736.9M  38612196   19252408   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 
 #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
 SRR068335    67.4M   13.6G   533805860  265481252  101      35318   2759   27041   35621   61(trim 20bp at 5',20bp at 3')

Comments
- Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.

@@ Line 168: / Line 168: @@
 * Latest online assembly
    ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
+   NC_000014.8    107349540
 * Human bowtie indexes

Dpuiu Assemblathon: Difference between revisions

Revision as of 18:22, 28 January 2011

Contents

Links

GAGE

Assemblers

CBCB genomes

Argentine ant

Bee, Bombus impatiens

Bacterium, Staph aureus USA300

454FLX

Illumina101.78X.cor

Illumina101.150X.cor

Bacterium, E coli

Illumina101.cor

Human, a single chromosome, medium-sized

Navigation menu

Dpuiu Assemblathon: Difference between revisions

Revision as of 18:22, 28 January 2011

Links

GAGE

Assemblers

CBCB genomes

Argentine ant

Bee, Bombus impatiens

Bacterium, Staph aureus USA300

454FLX

Illumina101.78X.cor

Illumina101.150X.cor

Bacterium, E coli

Illumina101.cor

Human, a single chromosome, medium-sized

Navigation menu

Search