Dpuiu Assemblathon: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 71: Line 71:
== Staph aureus USA300 ==
== Staph aureus USA300 ==


=== Data ===
   Complete genome        : NC_010079      2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
   Complete genome        : NC_010079      2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
   In progress genome    : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
   In progress genome    : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
Line 76: Line 77:
   Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516        http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full
   Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516        http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full


=== Illumina101.78X.cor ===
== Assemblies ===


* Comments:
  [[Media:Staphylococcus_aureus.genome.summary|Staphylococcus_aureus.genome.summary]]
  mated reads; lib mea/srd (CA estimates)=170/21
  first 2 and last base show composition bias; should we trim them???
 
* Data
                    reads      min    max        mean  sum            cvg
  Total              30,597,352  101    101        101  3090332552    1065
  Sampled            2,295,176  101    101        101  231812776      78
  Corrected(paired)  1,479,510  30    101        89    131973249      45.9
 
* DeNovo
  # ctg stats
              elem  min  q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
  CA.bog      73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700  31075  75      14            6 
  SOAPdenovo  9382  32    32    52    63    85850  347    16726  3259522  34960  722    37358  24      0            0 
  velvet      453  61    85    224    3279  137163  6297  36496  2852552  53075  10478  53410  156    17            6 
 
  # scf stats
              elem  min  q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all 1.breaks.1k+
  CA.bog      67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700  31075  75      20            7              # 1 large rearangement : scf120001252361 : NC_010079:622381-671743
  SOAPdenovo  186  100  333    1528  17623  144079  15625  55558  2906207  61878  26070  62049  65      453          6 
  velvet      427  61    82    177    2982  137163  6685  37874  2854649  53268  10478  53623  156    42            8
 
* Location
  /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/
  /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/
 
=== Illumina101.150X.cor ===
 
* Comments:
  first 2 and last base show composition bias; should we trim them???
 
* Data
                    reads      min    max        mean  sum            cvg
  Total              30,597,352  101    101        101  3090332552    1065
  Sampled            4,404,626  101    101        101  444867226      154
  Corrected(paired)  2,815,584  30    101        89    252331825      87
 
* DeNovo
  # ctg stats
              elem  min  q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
  CA.bog      47    1866  11740  37171  88747  388257* 60272  121078* 2832762  39703  28204  39775  142    12            9 
  SOAPdenovo  15229 32    33    48    63    74601  232    13773  3530251  31690  269    36416  17      0            0 
  velvet      429  61    86    206    3147  134034  6650  40448  2852695  52754  9400  52940  152    24            9 
 
  # scf stats
              elem  min  q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
  CA.bog      46    1866  11740  39650  88747  388257* 61582  129426* 2832782  39703  28204  39775  142    13            10 
  SOAPdenovo  158  101  202    1249  20910  150305  18353  74149  2899760  64028  30903  64392  52      506          7 
  velvet      409  61    85    180    2854  142341  6978  42466  2853854  52940  9400  52751  150    43            12


== Human, a single chromosome, medium-sized ==
== Human, a single chromosome, medium-sized ==

Revision as of 17:30, 4 March 2011

Links

GAGE

  • Location
 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage

Assemblers

* Allpaths-LG    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/
* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/
* Velvet         /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/
* SOAPdenovo     /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo_Release1.04

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bombus impatiens

Data

  • 497,318,144 Illumina 124bp reads
  • 8 libraries; inserts:
    • 400bp
    • 3k (outie)
    • 8k (outie)
  • Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Locations:

/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

CA (best)

  • Stats
 .      elem   min  q1   q2    q3     max      mean    n50      sum        
 scf    1896   76   150  4044  67922  4021294  151761  1017298  287738041  
 ctg    92307  63   100  119   186    297795   2613    24781    241197400  

Location:

/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/                               # best Celera Assembly

SOAPdenovo

  • Stats
 .      elem       min    q1     q2     q3     max        mean       n50        sum
 scf    11178      100    111    135    390    5655980    23014      1205321    257251549
 ctg    10856652   31     .      .      .      85850      57         43         627095607
 ctg100 106741     100    .      .      .      85850**    2165       6939       231167576

Location:

/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/                             # best SOAPdenovo assembly

Staph aureus USA300

Data

 Complete genome        : NC_010079       2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 In progress genome     : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
 Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516         http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full

Assemblies =

Staphylococcus_aureus.genome.summary

Human, a single chromosome, medium-sized

Data

  • Latest online assembly
 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
  NC_000014.8    107,349,540  # total, with telomeric N's 
                  88,289,540  # clean
  • Human bowtie indexes
  /fs/szdata/bowtie_indexes/h_sapiens_37_asm
  • Illumina reads (all genome)
 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/
 #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
 Lib          #Spots  #Bases  #Reads     #Mates     ReadLen  InsMea  InStd  InsMin  InsMax   TrimReadLen
 SRR067787    82.4M   16.6G   652448124  324283604  101      155     26     77      458     
 SRR067789    82.6M   16.7G   654133372  324876520  101      155     26     77      458     
 SRR067780    83.3M   16.8G   660001672  328021140  101      155     26     77      458     
 SRR067791    83.0M   16.8G   657963460  327205952  101      155     26     77      458     
 SRR067793    77.0M   15.5G   609634756  303094956  101      155     26     77      458     
 SRR067784    83.3M   16.8G   660118460  328244560  101      155     26     77      458     
 SRR067785    81.6M   16.5G   646350512  321174108  101      155     26     77      458     
 SRR067792    83.8M   16.9G   663997828  330084304  101      155     26     77      458     
 SRR067577    46.3M   9.3G    367673108  183472948  101      155     26     77      458     
 SRR067579    46.0M   9.3G    365743380  182532676  101      155     26     77      458     
 SRR067578    46.5M   9.4G    369557476  184410788  101      155     26     77      458     
 
 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
 SRR067771    81.5M   16.5G   644846296  320822716  101      2283    221    1620    2586    
 SRR067777    82.6M   16.7G   653163608  325232944  101      2283    221    1620    2586    
 SRR067781    82.1M   16.6G   649748720  323656576  101      2283    221    1620    2586    
 SRR067776    79.9M   16.1G   632590344  315165892  101      2283    221    1620    2586    
 
 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
 SRR067773    93.1M   18.8G   736456192  366884512  101      2803    271    1990    3106    
 SRR067779    94.0M   19.0G   743564440  370214028  101      2803    271    1990    3106    
 SRR067778    97.3M   19.6G   767984324  381879652  101      2803    271    1990    3106    
 SRR067786    94.6M   19.1G   747631104  372002548  101      2803    271    1990    3106    
 
 #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
 SRR068214    13.1M   2.0G    104505420  52087176   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 SRR068211    4.8M    736.9M  38612196   19252408   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 
 #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
 SRR068335    67.4M   13.6G   533805860  265481252  101      35318   2759   27041   35621   61(trim 20bp at 5',20bp at 3')
  • Comments
    • Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
  • Illumina chr14 reads (aligned with bowtie & corrected)
 /fs/szattic-asmg8/treangen/*fastq
 hard to align: bowtie -5 20 -3 20 -e 1000 ...
 jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!

Allpaths-lg Assembly (Daniela)

  • Read counts
                             orig       cor               cor(paired,all >64bp)
 chr14_fragment_12.fastq     36504800   35571477(97.44%)  34268444(10+bp ovl F/R)
 chr14_shortjump_12.fastq    22669408   11255320(49.64%)  11255320
 chr14_longjump_12.fastq     2405064    187398   (7.79%)  187398  
  • Assembly stats:
 .          elem  min    q1     q2     q3      max       mean     n50       sum       
 scf        418   96     131    256    1236    81646936  209781   81646936  87688255  
 scf10K+    17    10330  11780  26536  269876  81646936  5135452  81646936  87302692  
 ctg        4722  96     2342   9101   24174   240773    17887    36530       84461065  
  • Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
 18hr 50min :                   multiprocessor
 1104299/(3600*24)=12.78 days : singleprocessor
  • Locations
 /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths             # original
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths     # final contigs, scaff
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor  # corrected reads

All stats