Dpuiu Assemblathon: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 84: Line 84:
** K40+ too large: no "valley" in the kmerFreq histogram
** K40+ too large: no "valley" in the kmerFreq histogram
   paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head
   paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head
    0  K23    K31    K35    K47    K63    K91
   paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more
    1  1215557 1324588 1345050 1341251 1267566 1320413
 
    2  99008   114112  121642  154462  271530* 607742
== Rhodobacter sphaeroides ==
    3  42016*  63476*  77549*  142340* 294349  365241
 
    4  49699  79492  98636  177143  316061* 199863
=== Data ===
    5  68867  104994  127443  209888  310508  103830
    6  92256  133034  156821  229779  281005  52742
    7  115782  157642  178836  232521* 240082  26005
    8  136882  175040  191909  225207  200114  12960
    9  153819  183152  195194* 206206  166881  6688
    10  162669  183863* 190133  181641  139384  3658
    11  167123* 179571  178403  159411  113550  2502
    12  164594  164750  160745  139853  94505  1912
    13  156888  150408  144201  122557  78537  1589
    14  146575  135817  129665  107723  61636  1259
    15  132814  122688  115605  94830  49006  1107
    16  122214  109744  104458  83563  38171  899
    17  110636  98653  92573  73674  28860  765


  paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more
* Complete genome:  Rhodobacter sphaeroides 2.4.1 : 2 chromosomes, 5 plasmids
    0  K23    K31    K35    K47    K63    K91
   CP000143    3188609  
    1 8739    10732  11912  17072  36392  551062
   CP000144    943016    
    2 8787    11401  13290  22170  60715  591437*
   DQ232586    114045    
    3  12234   16630  19450  34113  102041 491309
   CP000145    114178    
    4  16256   22470   26838  52586  149043  347252
   CP000146    105284    
    5  22106   31615  39184  77664   194089  226048
   CP000147    100828    
    6  31106   46089   56270  107253  225484  140047
   DQ232587    37100   
    7  43196   63267  76838   134323  240196* 81910
   Total      4603060
    8  57224   82197  98380   160232  238399  47334
** [http://www.ncbi.nlm.nih.gov/sra/SRX033397?report=full SRX033397] pair lib ;    readLen=101 ; insMea=180
    9  73715   101814  118827  175541  223600  26993
** [http://www.ncbi.nlm.nih.gov/sra/SRX016063?report=full SRX016063] jumping lib ; readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)
    10  90461   119701  136018  185207  203221  15402
    11  105636  135515  150537  185381* 176277  9011
    12  119236  144979  156871  175924  155002  5521
    13  128641  149954* 156996* 164873  133640  3513
    14  135628  149639  153938  150354  114657  2584
    15 137244* 145976  147385  136342  98152  1890
    16  135666  138605  136978  121743  84865  1550


== Human, a single chromosome, medium-sized ==
== Human, a single chromosome, medium-sized ==

Revision as of 18:02, 14 March 2011

Links

GAGE

  • Location
 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage

Assemblers

* Allpaths-LG    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/
* CA             /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/
* Velvet         /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/
* SOAPdenovo     /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo_Release1.04

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Argentine ant

Bombus impatiens

Data

  • 497,318,144 Illumina 124bp reads
  • 8 libraries; inserts:
    • 400bp
    • 3k (outie)
    • 8k (outie)
  • Traces

Adapters: in 3k & 8k libraries

C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
  • Locations:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

Assembly

/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/                               # best Celera Assembly
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/                             # best SOAPdenovo assembly

Staph aureus USA300

Data

  • Complete genome  : NC_010079 2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
 SRP001086 Staphylococcus aureus Sequencing on Illumina
 SRX007714 pair lib
 SRX007711 jumping lib
  • Locations:
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/

Assembly

 ~dpuiu/GAGE/Staphylococcus_aureus/
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/
  • SOAPdenovo v1.05 :
    • new quake version did not help much (quake-0.2.2 vs davek44-error_correction-28dbe11)
    • SOAPdenovo map -K 37+ : fails on quakeCor.k18 corrected reads
    • "according" to kmerFreq , should probably not use -K >47
    • longer kmer => longer scaffolds
    • longer kmer => shorted contigs
    • K40+ too large: no "valley" in the kmerFreq histogram
 paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head
 paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more

Rhodobacter sphaeroides

Data

  • Complete genome: Rhodobacter sphaeroides 2.4.1 : 2 chromosomes, 5 plasmids
 CP000143    3188609  
 CP000144    943016   
 DQ232586    114045   
 CP000145    114178   
 CP000146    105284   
 CP000147    100828   
 DQ232587    37100    
 Total       4603060
    • SRX033397 pair lib ; readLen=101 ; insMea=180
    • SRX016063 jumping lib ; readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)

Human, a single chromosome, medium-sized

Data

  • Latest online assembly
 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
  NC_000014.8    107,349,540  # total, with telomeric N's 
                  88,289,540  # clean
  • Human bowtie indexes
  /fs/szdata/bowtie_indexes/h_sapiens_37_asm
  • Illumina reads (all genome)
 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/
 #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
 Lib          #Spots  #Bases  #Reads     #Mates     ReadLen  InsMea  InStd  InsMin  InsMax   TrimReadLen
 SRR067787    82.4M   16.6G   652448124  324283604  101      155     26     77      458     
 SRR067789    82.6M   16.7G   654133372  324876520  101      155     26     77      458     
 SRR067780    83.3M   16.8G   660001672  328021140  101      155     26     77      458     
 SRR067791    83.0M   16.8G   657963460  327205952  101      155     26     77      458     
 SRR067793    77.0M   15.5G   609634756  303094956  101      155     26     77      458     
 SRR067784    83.3M   16.8G   660118460  328244560  101      155     26     77      458     
 SRR067785    81.6M   16.5G   646350512  321174108  101      155     26     77      458     
 SRR067792    83.8M   16.9G   663997828  330084304  101      155     26     77      458     
 SRR067577    46.3M   9.3G    367673108  183472948  101      155     26     77      458     
 SRR067579    46.0M   9.3G    365743380  182532676  101      155     26     77      458     
 SRR067578    46.5M   9.4G    369557476  184410788  101      155     26     77      458     
 
 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
 SRR067771    81.5M   16.5G   644846296  320822716  101      2283    221    1620    2586    
 SRR067777    82.6M   16.7G   653163608  325232944  101      2283    221    1620    2586    
 SRR067781    82.1M   16.6G   649748720  323656576  101      2283    221    1620    2586    
 SRR067776    79.9M   16.1G   632590344  315165892  101      2283    221    1620    2586    
 
 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
 SRR067773    93.1M   18.8G   736456192  366884512  101      2803    271    1990    3106    
 SRR067779    94.0M   19.0G   743564440  370214028  101      2803    271    1990    3106    
 SRR067778    97.3M   19.6G   767984324  381879652  101      2803    271    1990    3106    
 SRR067786    94.6M   19.1G   747631104  372002548  101      2803    271    1990    3106    
 
 #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
 SRR068214    13.1M   2.0G    104505420  52087176   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 SRR068211    4.8M    736.9M  38612196   19252408   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')
 
 #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
 SRR068335    67.4M   13.6G   533805860  265481252  101      35318   2759   27041   35621   61(trim 20bp at 5',20bp at 3')
  • Comments
    • Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
  • Illumina chr14 reads (aligned with bowtie & corrected)
 /fs/szattic-asmg8/treangen/*fastq
 hard to align: bowtie -5 20 -3 20 -e 1000 ...
 jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!

Assembly

Allpaths-lg

  • Read counts
                             orig       cor               cor(paired,all >64bp)
 chr14_fragment_12.fastq     36504800   35571477(97.44%)  34268444(10+bp ovl F/R)
 chr14_shortjump_12.fastq    22669408   11255320(49.64%)  11255320
 chr14_longjump_12.fastq     2405064    187398   (7.79%)  187398  
  • Assembly stats:
 .          elem  min    q1     q2     q3      max       mean     n50       sum       
 scf        418   96     131    256    1236    81646936  209781   81646936  87688255  
 scf10K+    17    10330  11780  26536  269876  81646936  5135452  81646936  87302692  
 ctg        4722  96     2342   9101   24174   240773    17887    36530       84461065  
  • Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
 18hr 50min :                   multiprocessor
 1104299/(3600*24)=12.78 days : singleprocessor
  • Locations
 /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths             # original
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths     # final contigs, scaff
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor  # corrected reads