Dpuiu Assemblathon
Jump to navigation
Jump to search
Links
GAGE
- Location
http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
Assemblers
* Allpaths-LG /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/ * CA /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/ * Velvet /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/ * SOAPdenovo /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo_Release1.04
CBCB genomes
- a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
- A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
- Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.
Argentine ant
Bombus impatiens
Data
- 497,318,144 Illumina 124bp reads
- 8 libraries; inserts:
- 400bp
- 3k (outie)
- 8k (outie)
- Traces
Adapters: in 3k & 8k libraries
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
- Locations:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt # original fastq files /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt # adaptor free corrected reads (long inserts) /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt # corrected reads (short inserts)
Assembly
- Bombus_impatiens.assembly.summary
- Locations:
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # best Celera Assembly /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/ # best SOAPdenovo assembly
Staph aureus USA300
Data
- Complete genome : NC_010079 2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
SRP001086 Staphylococcus aureus Sequencing on Illumina SRX007714 pair lib SRX007711 jumping lib
- Locations:
/nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/ /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/
Assembly
- Staphylococcus_aureus.genome.summary
- Locations
~dpuiu/GAGE/Staphylococcus_aureus/ /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/
- SOAPdenovo v1.05 :
- new quake version did not help much (quake-0.2.2 vs davek44-error_correction-28dbe11)
- SOAPdenovo map -K 37+ : fails on quakeCor.k18 corrected reads
- "according" to kmerFreq , should probably not use -K >47
- longer kmer => longer scaffolds
- longer kmer => shorted contigs
- K40+ too large: no "valley" in the kmerFreq histogram
paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head 0 K23 K31 K35 K47 K63 K91 1 1215557 1324588 1345050 1341251 1267566 1320413 2 99008 114112 121642 154462 271530* 607742 3 42016* 63476* 77549* 142340* 294349 365241 4 49699 79492 98636 177143 316061* 199863 5 68867 104994 127443 209888 310508 103830 6 92256 133034 156821 229779 281005 52742 7 115782 157642 178836 232521* 240082 26005 8 136882 175040 191909 225207 200114 12960 9 153819 183152 195194* 206206 166881 6688 10 162669 183863* 190133 181641 139384 3658 11 167123* 179571 178403 159411 113550 2502 12 164594 164750 160745 139853 94505 1912 13 156888 150408 144201 122557 78537 1589 14 146575 135817 129665 107723 61636 1259 15 132814 122688 115605 94830 49006 1107 16 122214 109744 104458 83563 38171 899 17 110636 98653 92573 73674 28860 765
paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more 0 K23 K31 K35 K47 K63 K91 1 8739 10732 11912 17072 36392 551062 2 8787 11401 13290 22170 60715 591437* 3 12234 16630 19450 34113 102041 491309 4 16256 22470 26838 52586 149043 347252 5 22106 31615 39184 77664 194089 226048 6 31106 46089 56270 107253 225484 140047 7 43196 63267 76838 134323 240196* 81910 8 57224 82197 98380 160232 238399 47334 9 73715 101814 118827 175541 223600 26993 10 90461 119701 136018 185207 203221 15402 11 105636 135515 150537 185381* 176277 9011 12 119236 144979 156871 175924 155002 5521 13 128641 149954* 156996* 164873 133640 3513 14 135628 149639 153938 150354 114657 2584 15 137244* 145976 147385 136342 98152 1890 16 135666 138605 136978 121743 84865 1550
Human, a single chromosome, medium-sized
Data
- Latest online assembly
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/ NC_000014.8 107,349,540 # total, with telomeric N's 88,289,540 # clean
- Human bowtie indexes
/fs/szdata/bowtie_indexes/h_sapiens_37_asm
- Illumina reads (all genome)
Human NA12878 Genome on Illumina ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/ ginko:/scratch1/Human_NA12878_on_Illumina/
#Fragment (mean insert size: 155bp, SD 26), 101 bp read length Lib #Spots #Bases #Reads #Mates ReadLen InsMea InStd InsMin InsMax TrimReadLen SRR067787 82.4M 16.6G 652448124 324283604 101 155 26 77 458 SRR067789 82.6M 16.7G 654133372 324876520 101 155 26 77 458 SRR067780 83.3M 16.8G 660001672 328021140 101 155 26 77 458 SRR067791 83.0M 16.8G 657963460 327205952 101 155 26 77 458 SRR067793 77.0M 15.5G 609634756 303094956 101 155 26 77 458 SRR067784 83.3M 16.8G 660118460 328244560 101 155 26 77 458 SRR067785 81.6M 16.5G 646350512 321174108 101 155 26 77 458 SRR067792 83.8M 16.9G 663997828 330084304 101 155 26 77 458 SRR067577 46.3M 9.3G 367673108 183472948 101 155 26 77 458 SRR067579 46.0M 9.3G 365743380 182532676 101 155 26 77 458 SRR067578 46.5M 9.4G 369557476 184410788 101 155 26 77 458 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length SRR067771 81.5M 16.5G 644846296 320822716 101 2283 221 1620 2586 SRR067777 82.6M 16.7G 653163608 325232944 101 2283 221 1620 2586 SRR067781 82.1M 16.6G 649748720 323656576 101 2283 221 1620 2586 SRR067776 79.9M 16.1G 632590344 315165892 101 2283 221 1620 2586 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length SRR067773 93.1M 18.8G 736456192 366884512 101 2803 271 1990 3106 SRR067779 94.0M 19.0G 743564440 370214028 101 2803 271 1990 3106 SRR067778 97.3M 19.6G 767984324 381879652 101 2803 271 1990 3106 SRR067786 94.6M 19.1G 747631104 372002548 101 2803 271 1990 3106 #Fosmid1 (mean insert size: 35295bp, SD 2703), 76 bp read length SRR068214 13.1M 2.0G 104505420 52087176 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') SRR068211 4.8M 736.9M 38612196 19252408 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') #Fosmid2 (mean insert size: 35318bp, SD 2759), 101 bp read length SRR068335 67.4M 13.6G 533805860 265481252 101 35318 2759 27041 35621 61(trim 20bp at 5',20bp at 3')
- Comments
- Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
- Illumina chr14 reads (aligned with bowtie & corrected)
/fs/szattic-asmg8/treangen/*fastq hard to align: bowtie -5 20 -3 20 -e 1000 ... jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!
Assembly
Allpaths-lg
- Read counts
orig cor cor(paired,all >64bp) chr14_fragment_12.fastq 36504800 35571477(97.44%) 34268444(10+bp ovl F/R) chr14_shortjump_12.fastq 22669408 11255320(49.64%) 11255320 chr14_longjump_12.fastq 2405064 187398 (7.79%) 187398
- Assembly stats:
. elem min q1 q2 q3 max mean n50 sum scf 418 96 131 256 1236 81646936 209781 81646936 87688255 scf10K+ 17 10330 11780 26536 269876 81646936 5135452 81646936 87302692 ctg 4722 96 2342 9101 24174 240773 17887 36530 84461065
- Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
18hr 50min : multiprocessor 1104299/(3600*24)=12.78 days : singleprocessor
- Locations
/scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # original /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # final contigs, scaff /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor # corrected reads