Dpuiu Assemblathon

From Cbcb
Jump to navigation Jump to search

Links

GAGE

  • Location
 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage

Assemblers

* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA 
* Newbler
* Velvet
* SOAPdenovo
* Maq

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bee, Bombus impatiens

Data

  • 497,318,144 Illumina 124bp reads
  • 8 libraries; inserts:
    • 400bp
    • 3k (outie)
    • 8k (outie)
  • Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Locations:

/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/                               # best Celera Assembly
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/                             # best SOAPdenovo assembly

Bacterium, Staph aureus USA300

 Complete genome        : NC_010079       2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 In progress genome     : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
 454 FLX                : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023  http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full 
 Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516         http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full

454FLX

  • Data
                       reads   min    q1     q2     q3     max        mean       n50        sum            
 sff(original)         334241  36     201    256    277    362        235        264        78432227      
 sffToCA(adaptor free) 325555  51     103    148    208    358        157        185        51225615   # 58723 mates
  • DeNovo
 # ctg stats
 .                     ctgs min  q1     q2      q3      max     mean       n50      sum      
 CA.6.1.bog            18   238  21014  135971  239466  567548* 155836     277888*  2805055
 newbler.2.5p1.deNovo  100  103  295    3287    39467   229053  27879      78379    2787870
 # scf stats
 CA.6.1.bog            6    284  21014  173065  1032129 1458733* 467554    1458733* 2805325
 newbler.2.5p1.deNovo  8    2475 20731  110137  1030785 1408642  349895    1408642  2799157
  • Reference based(Saureus USA300)
  .                    ctgs min  q1     q2      q3      max     mean       n50      sum      
 newbler.2.3.refMapper 206  103  556    3098    15366   117487  12749      40687    2626469

Illumina101.78X.cor

  • Comments:
 mated reads; lib mea/srd (CA estimates)=170/21
 first 2 and last base show composition bias; should we trim them???
 
  • Data
                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            2,295,176   101    101        101   231812776      78
 Corrected(paired)  1,479,510   30     101        89    131973249      45.9
  • DeNovo
 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700   31075   75      14            6  
 SOAPdenovo   9382  32    32     52     63     85850   347    16726  3259522  34960  722    37358   24      0             0  
 velvet       453   61    85     224    3279   137163  6297   36496  2852552  53075  10478  53410   156     17            6  
 # scf stats
              elem  min   q1     q2     q3     max     mean   n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700   31075   75      20            7               # 1 large rearangement :  scf120001252361 : NC_010079:622381-671743
 SOAPdenovo   186   100   333    1528   17623  144079  15625  55558  2906207  61878  26070  62049   65      453           6  
 velvet       427   61    82     177    2982   137163  6685   37874  2854649  53268  10478  53623   156     42            8
  • Location
 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/
 /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/

Illumina101.150X.cor

  • Comments:
 first 2 and last base show composition bias; should we trim them???
  • Data
                    reads       min    max        mean  sum            cvg
 Total              30,597,352  101    101        101   3090332552     1065
 Sampled            4,404,626   101    101        101   444867226      154
 Corrected(paired)  2,815,584   30     101        89    252331825      87
  • DeNovo
 # ctg stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       47    1866  11740  37171  88747  388257* 60272  121078* 2832762  39703  28204  39775   142     12            9  
 SOAPdenovo   15229 32    33     48     63     74601   232    13773   3530251  31690  269    36416   17      0             0  
 velvet       429   61    86     206    3147   134034  6650   40448   2852695  52754  9400   52940   152     24            9  
 # scf stats
              elem  min   q1     q2     q3     max     mean   n50     sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog       46    1866  11740  39650  88747  388257* 61582  129426* 2832782  39703  28204  39775   142     13            10  
 SOAPdenovo   158   101   202    1249   20910  150305  18353  74149   2899760  64028  30903  64392   52      506           7   
 velvet       409   61    85     180    2854   142341  6978   42466   2853854  52940  9400   52751   150     43            12

Bacterium, E coli

 Compelte genome        : NC_000913      4639675bp  Escherichia coli str. K-12 substr. MG1655

 454 FLX                : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full 
 Illumina 101bp paired  : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full

Illumina101.cor

  • Data
 mated reads; lib mea/srd (CA estimates)=160/24
                    reads       min    max        mean  sum            cvg
 Total              20,635,060  101    101        101   2084141060     449
 Sampled            3,591,676   101    101        101   362759276      78
 Corrected(paired)  1,556,316   30     101        79    122785294      26
  • DeNovo
 #ctg stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               677        1006   2265   4527   8853   45500      6604       10186      4471053     189441  162537  190445  170     16            4  
 SOAPdenovo           4000       32     35     57     101    43816      1172       9327       4687158     63179   3026    78607   67      1             0  
 velvet               711        61     166    2135   8579   57936*     6375       16400*     4532895     113612  30252   114128  166     31            5  
 # scf stats
                      elem       min    q1     q2     q3     max        mean       n50        sum         q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
 CA.bog               658        1006   2343   4621   9033   45500      6795       10483      4471433     189441  162537  190445  170     35            4   
 SOAPdenovo           270        100    269    3431   24735  219248*    17273      53882*     4663670     111651  71505   112331  128     826           2   
 velvet               480        61     92     1147   13028  114907     9523       32233      4571240     113936  30253   114192  161     262           11
  • Location
 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/Illuminap100/
 /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Assembly/Illuminap100.cor/

Human, a single chromosome, medium-sized

Data

  • Latest online assembly
 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
  NC_000014.8    107,349,540
  • Human bowtie indexes
  /fs/szdata/bowtie_indexes/h_sapiens_37_asm
  • Illumina reads (all genome)
 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/
 #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
 Lib          #Spots  #Bases  #Reads     #Mates     ReadLen  InsMea  InStd  InsMin  InsMax   TrimReadLen
 SRR067787    82.4M   16.6G   652448124  324283604  101      155     26     77      458     
 SRR067789    82.6M   16.7G   654133372  324876520  101      155     26     77      458     
 SRR067780    83.3M   16.8G   660001672  328021140  101      155     26     77      458     
 SRR067791    83.0M   16.8G   657963460  327205952  101      155     26     77      458     
 SRR067793    77.0M   15.5G   609634756  303094956  101      155     26     77      458     
 SRR067784    83.3M   16.8G   660118460  328244560  101      155     26     77      458     
 SRR067785    81.6M   16.5G   646350512  321174108  101      155     26     77      458     
 SRR067792    83.8M   16.9G   663997828  330084304  101      155     26     77      458     
 SRR067577    46.3M   9.3G    367673108  183472948  101      155     26     77      458     
 SRR067579    46.0M   9.3G    365743380  182532676  101      155     26     77      458     
 SRR067578    46.5M   9.4G    369557476  184410788  101      155     26     77      458     
 
 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
 SRR067771    81.5M   16.5G   644846296  320822716  101      2283    221    1620    2586    
 SRR067777    82.6M   16.7G   653163608  325232944  101      2283    221    1620    2586    
 SRR067781    82.1M   16.6G   649748720  323656576  101      2283    221    1620    2586    
 SRR067776    79.9M   16.1G   632590344  315165892  101      2283    221    1620    2586    
 
 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
 SRR067773    93.1M   18.8G   736456192  366884512  101      2803    271    1990    3106    
 SRR067779    94.0M   19.0G   743564440  370214028  101      2803    271    1990    3106    
 SRR067778    97.3M   19.6G   767984324  381879652  101      2803    271    1990    3106    
 SRR067786    94.6M   19.1G   747631104  372002548  101      2803    271    1990    3106    
 
 #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
 SRR068214    13.1M   2.0G    104505420  52087176   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 SRR068211    4.8M    736.9M  38612196   19252408   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 
 #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
 SRR068335    67.4M   13.6G   533805860  265481252  101      35318   2759   27041   35621   61(trim 20bp at 5',20bp at 3')
  • Comments
    • Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
  • Illumina chr14 reads (aligned with bowtie & corrected)
 /fs/szattic-asmg8/treangen/*fastq
 hard to align: bowtie -5 20 -3 20 -e 1000 ...
 jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!

quake

  • Read counts
                             orig       cor               cor(paired)   cor(>=64bp good qual)    cor(>=64bp good qual,paired)
 chr14_fragment_12.fastq     36504800   34027168(93.21)   32621862      31103950                 38090442
 chr14_shortjump_12.fastq    22669408   17693564(78.05)   14054994      13695195                 8287760
 chr14_longjump_12.fastq     2405064    2140481 (88.99)   2009674       1727197                  1446566

Allpaths-lg Assembly (Daniela)

  • Read counts
                             orig       cor               cor(paired,all >64bp)
 chr14_fragment_12.fastq     36504800   35571477(97.44%)  34268444(10+bp ovl F/R)
 chr14_shortjump_12.fastq    22669408   11255320(49.64%)  11255320
 chr14_longjump_12.fastq     2405064    187398   (7.79%)  187398  
  • Assembly stats:
 .          elem  min    q1     q2     q3      max       mean     n50       sum       
 scf        418   96     131    256    1236    81646936  209781   81646936  87688255  
 scf10K+    17    10330  11780  26536  269876  81646936  5135452  81646936  87302692  
 ctg        4722  96     2342   9101   24174   240773    17887    38359     84461065  
  • Locations
 /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths             # orig
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths     # partial copy

CA (Tanja)

  • Assembly stats:
.           elem    min    q1     q2     q3     max     mean   n50    sum       
 scf        7912    100    5192   7193   12981  178086  11077  15594  87637684  
 scf10k+    2776    10004  12458  16572  25284  178086  21432  23847  59496415  
 ctg        13460   65     2062   3222   6428   178086  6371   12631  85747353  
 deg        207300  64     90     101    102    15353   111    101    22941304  
  • Locations
 mulberry:/scratch2/tmagoc/chr14/sang/                   # orig
 /fs/szattic-asmg4/tmagoc/GAGE/human/CA

SOAPdenovo (Tanja)

  • Assembly stats
 .          elem     min    q1     q2     q3     max        mean       n50        sum            
 scf        7874     100    160    441    1542   1447999    13609      246690     107159938   
 scf10K+    816      10135  33755  54468  153184 1447999    123427     275605     100716127    
 ctg        1001208  32     32     35     49     19444      114        974        114298188      
  • Location
 /fs/szattic-asmg4/tmagoc/GAGE/human/soapDenovo/k31gapClosed/assembly.summary