Dpuiu Assemblathon: Difference between revisions
Jump to navigation
Jump to search
Line 60: | Line 60: | ||
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # best Celera Assembly | /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # best Celera Assembly | ||
=== SOAPdenovo | === SOAPdenovo === | ||
* Stats | * Stats |
Revision as of 16:25, 4 March 2011
Links
GAGE
- Location
http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
Assemblers
* Allpaths-LG /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/ * CA /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/ * Velvet /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/ * SOAPdenovo /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo_Release1.04
CBCB genomes
- a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
- A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
- Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.
Argentine ant
Bee, Bombus impatiens
Data
- 497,318,144 Illumina 124bp reads
- 8 libraries; inserts:
- 400bp
- 3k (outie)
- 8k (outie)
- Traces
Adapters: in 3k & 8k libraries
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
Locations:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt # original fastq files /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt # adaptor free corrected reads (long inserts) /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt # corrected reads (short inserts)
Assemblies
CA (best)
- Stats
. elem min q1 q2 q3 max mean n50 sum scf 1896 76 150 4044 67922 4021294 151761 1017298 287738041 ctg 92307 63 100 119 186 297795 2613 24781 241197400
Location:
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # best Celera Assembly
SOAPdenovo
- Stats
. elem min q1 q2 q3 max mean n50 sum scf 11178 100 111 135 390 5655980 23014 1205321 257251549 ctg 10856652 31 . . . 85850 57 43 627095607 ctg100 106741 100 . . . 85850** 2165 6939 231167576
Location:
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/ # best SOAPdenovo assembly
Bacterium, Staph aureus USA300
Complete genome : NC_010079 2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516 In progress genome : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
Illumina 101bp paired : Staphylococcus aureus subsp. aureus USA300_TCH1516 http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full
Illumina101.78X.cor
- Comments:
mated reads; lib mea/srd (CA estimates)=170/21 first 2 and last base show composition bias; should we trim them???
- Data
reads min max mean sum cvg Total 30,597,352 101 101 101 3090332552 1065 Sampled 2,295,176 101 101 101 231812776 78 Corrected(paired) 1,479,510 30 101 89 131973249 45.9
- DeNovo
# ctg stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 73 1716 14503 26431 52086 148524* 39043 58038* 2850157 31075 7700 31075 75 14 6 SOAPdenovo 9382 32 32 52 63 85850 347 16726 3259522 34960 722 37358 24 0 0 velvet 453 61 85 224 3279 137163 6297 36496 2852552 53075 10478 53410 156 17 6
# scf stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 67 1716 14869 32929 58038 148524* 42541 65383* 2850277 31075 7700 31075 75 20 7 # 1 large rearangement : scf120001252361 : NC_010079:622381-671743 SOAPdenovo 186 100 333 1528 17623 144079 15625 55558 2906207 61878 26070 62049 65 453 6 velvet 427 61 82 177 2982 137163 6685 37874 2854649 53268 10478 53623 156 42 8
- Location
/fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/ /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/
Illumina101.150X.cor
- Comments:
first 2 and last base show composition bias; should we trim them???
- Data
reads min max mean sum cvg Total 30,597,352 101 101 101 3090332552 1065 Sampled 4,404,626 101 101 101 444867226 154 Corrected(paired) 2,815,584 30 101 89 252331825 87
- DeNovo
# ctg stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 47 1866 11740 37171 88747 388257* 60272 121078* 2832762 39703 28204 39775 142 12 9 SOAPdenovo 15229 32 33 48 63 74601 232 13773 3530251 31690 269 36416 17 0 0 velvet 429 61 86 206 3147 134034 6650 40448 2852695 52754 9400 52940 152 24 9
# scf stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 46 1866 11740 39650 88747 388257* 61582 129426* 2832782 39703 28204 39775 142 13 10 SOAPdenovo 158 101 202 1249 20910 150305 18353 74149 2899760 64028 30903 64392 52 506 7 velvet 409 61 85 180 2854 142341 6978 42466 2853854 52940 9400 52751 150 43 12
Human, a single chromosome, medium-sized
Data
- Latest online assembly
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/ NC_000014.8 107,349,540 # total, with telomeric N's 88,289,540 # clean
- Human bowtie indexes
/fs/szdata/bowtie_indexes/h_sapiens_37_asm
- Illumina reads (all genome)
Human NA12878 Genome on Illumina ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/ ginko:/scratch1/Human_NA12878_on_Illumina/
#Fragment (mean insert size: 155bp, SD 26), 101 bp read length Lib #Spots #Bases #Reads #Mates ReadLen InsMea InStd InsMin InsMax TrimReadLen SRR067787 82.4M 16.6G 652448124 324283604 101 155 26 77 458 SRR067789 82.6M 16.7G 654133372 324876520 101 155 26 77 458 SRR067780 83.3M 16.8G 660001672 328021140 101 155 26 77 458 SRR067791 83.0M 16.8G 657963460 327205952 101 155 26 77 458 SRR067793 77.0M 15.5G 609634756 303094956 101 155 26 77 458 SRR067784 83.3M 16.8G 660118460 328244560 101 155 26 77 458 SRR067785 81.6M 16.5G 646350512 321174108 101 155 26 77 458 SRR067792 83.8M 16.9G 663997828 330084304 101 155 26 77 458 SRR067577 46.3M 9.3G 367673108 183472948 101 155 26 77 458 SRR067579 46.0M 9.3G 365743380 182532676 101 155 26 77 458 SRR067578 46.5M 9.4G 369557476 184410788 101 155 26 77 458 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length SRR067771 81.5M 16.5G 644846296 320822716 101 2283 221 1620 2586 SRR067777 82.6M 16.7G 653163608 325232944 101 2283 221 1620 2586 SRR067781 82.1M 16.6G 649748720 323656576 101 2283 221 1620 2586 SRR067776 79.9M 16.1G 632590344 315165892 101 2283 221 1620 2586 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length SRR067773 93.1M 18.8G 736456192 366884512 101 2803 271 1990 3106 SRR067779 94.0M 19.0G 743564440 370214028 101 2803 271 1990 3106 SRR067778 97.3M 19.6G 767984324 381879652 101 2803 271 1990 3106 SRR067786 94.6M 19.1G 747631104 372002548 101 2803 271 1990 3106 #Fosmid1 (mean insert size: 35295bp, SD 2703), 76 bp read length SRR068214 13.1M 2.0G 104505420 52087176 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') SRR068211 4.8M 736.9M 38612196 19252408 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') #Fosmid2 (mean insert size: 35318bp, SD 2759), 101 bp read length SRR068335 67.4M 13.6G 533805860 265481252 101 35318 2759 27041 35621 61(trim 20bp at 5',20bp at 3')
- Comments
- Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
- Illumina chr14 reads (aligned with bowtie & corrected)
/fs/szattic-asmg8/treangen/*fastq hard to align: bowtie -5 20 -3 20 -e 1000 ... jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!
Allpaths-lg Assembly (Daniela)
- Read counts
orig cor cor(paired,all >64bp) chr14_fragment_12.fastq 36504800 35571477(97.44%) 34268444(10+bp ovl F/R) chr14_shortjump_12.fastq 22669408 11255320(49.64%) 11255320 chr14_longjump_12.fastq 2405064 187398 (7.79%) 187398
- Assembly stats:
. elem min q1 q2 q3 max mean n50 sum scf 418 96 131 256 1236 81646936 209781 81646936 87688255 scf10K+ 17 10330 11780 26536 269876 81646936 5135452 81646936 87302692 ctg 4722 96 2342 9101 24174 240773 17887 36530 84461065
- Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
18hr 50min : multiprocessor 1104299/(3600*24)=12.78 days : singleprocessor
- Locations
/scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # original /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # final contigs, scaff /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor # corrected reads