Dpuiu Assemblathon: Difference between revisions
Jump to navigation
Jump to search
Line 198: | Line 198: | ||
* [[Media:Rhodobacter_sphaeroides.assembly.summary|Rhodobacter_sphaeroides.assembly.summary]] | * [[Media:Rhodobacter_sphaeroides.assembly.summary|Rhodobacter_sphaeroides.assembly.summary]] | ||
* [[Media:runCA.filter.sh|runCA.filter. | * [[Media:runCA.filter.sh|runCA.filter.txt]] modified CA run | ||
* Assembly directories: | * Assembly directories: |
Revision as of 14:42, 7 April 2011
Links
GAGE
- Location
http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
Assemblers
* Allpaths-LG /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/ * CA /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/ * Velvet /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/ * SOAPdenovo /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo_Release1.04
CBCB genomes
- a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
- A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
- Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.
Argentine ant
Bombus impatiens
Data
- 497,318,144 Illumina 124bp reads
- 8 libraries; inserts:
- 400bp
- 3k (outie)
- 8k (outie)
- Traces
Adapters: in 3k & 8k libraries
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
- Read directories:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt # original fastq files /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt # adaptor free corrected reads (long inserts) /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt # corrected reads (short inserts)
- Original read files:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_2_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_1_sequence.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_2_sequence.txt
- Quake corrected files:
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_1_sequence.cor.rev.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_2_sequence.cor.rev.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_1_sequence.cor.rev.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_2_sequence.cor.rev.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_1_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_2_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_1_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_2_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_1_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_2_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_1_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_2_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_1_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_2_sequence.cor.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_1_sequence.cor.rev.txt /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_2_sequence.cor.rev.txt
- k_unitig corrected files: (in progress --Dpuiu 10:38, 5 April 2011 (EDT))
Assembly
- Bombus_impatiens.assembly.summary
- Assembly directories:
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # best Celera Assembly /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/ # best SOAPdenovo assembly (2010) /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.K47.s_1-9.cor # best SOAPdenovo assembly (2011)
Staph aureus USA300
Data
- Complete genome : NC_010079 2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
SRP001086 Staphylococcus aureus Sequencing on Illumina SRX007714 pair lib SRX007711 jumping lib
- Read directories:
/nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/ /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/
- Original read files:
/nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_1.fastq /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_2.fastq /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_1.fastq /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_2.fastq
- Quake corrected files:
/nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq
- Allpaths-LG corrected files:
/fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta
- k_unitig corrected files:
/nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq
Assembly
- Staphylococcus_aureus.genome.summary
- Assembly directories:
~dpuiu/GAGE/Staphylococcus_aureus/ /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/
- SOAPdenovo v1.05 :
- new quake version did not help much (quake-0.2.2 vs davek44-error_correction-28dbe11)
- SOAPdenovo map -K 37+ : fails on quakeCor.k18 corrected reads
- "according" to kmerFreq , should probably not use -K >47
- longer kmer => longer scaffolds (K=63 : largest N50scf)
- longer kmer => shorted contigs (K=31 : largest N50ctg)
- K40+ too large: no "valley" in the kmerFreq histogram
paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more
Rhodobacter sphaeroides
Data
- Complete genome: Rhodobacter sphaeroides 2.4.1 : 2 chromosomes, 5 plasmids
CP000143 3188609 CP000144 943016 DQ232586 114045 CP000145 114178 CP000146 105284 CP000147 100828 DQ232587 37100 Total 4603060 SRX033397 pair lib ; readLen=101 ; insMea=180 SRX016063 jumping lib ; readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)
- Original read files:
/fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_1.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_2.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_1.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_2.fastq
- Quake corrected read files:
/fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq
- QuakeIter2 corrected read files:
/fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_1.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_2.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_1.cor.fastq /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_2.cor.fastq
- Allpaths-LG corrected files:
/fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta
- k_unitig corrected files:
/nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq
Assembly
- Rhodobacter_sphaeroides.assembly.summary
- runCA.filter.txt modified CA run
- Assembly directories:
/fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/
Human, a single chromosome, medium-sized
Data
- Latest online assembly
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/ NC_000014.8 107,349,540 # total, with telomeric N's 88,289,540 # clean
- Human bowtie indexes
/fs/szdata/bowtie_indexes/h_sapiens_37_asm
- Illumina reads (all genome)
Human NA12878 Genome on Illumina ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/ ginko:/scratch1/Human_NA12878_on_Illumina/
#Fragment (mean insert size: 155bp, SD 26), 101 bp read length Lib #Spots #Bases #Reads #Mates ReadLen InsMea InStd InsMin InsMax TrimReadLen SRR067787 82.4M 16.6G 652448124 324283604 101 155 26 77 458 SRR067789 82.6M 16.7G 654133372 324876520 101 155 26 77 458 SRR067780 83.3M 16.8G 660001672 328021140 101 155 26 77 458 SRR067791 83.0M 16.8G 657963460 327205952 101 155 26 77 458 SRR067793 77.0M 15.5G 609634756 303094956 101 155 26 77 458 SRR067784 83.3M 16.8G 660118460 328244560 101 155 26 77 458 SRR067785 81.6M 16.5G 646350512 321174108 101 155 26 77 458 SRR067792 83.8M 16.9G 663997828 330084304 101 155 26 77 458 SRR067577 46.3M 9.3G 367673108 183472948 101 155 26 77 458 SRR067579 46.0M 9.3G 365743380 182532676 101 155 26 77 458 SRR067578 46.5M 9.4G 369557476 184410788 101 155 26 77 458 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length SRR067771 81.5M 16.5G 644846296 320822716 101 2283 221 1620 2586 SRR067777 82.6M 16.7G 653163608 325232944 101 2283 221 1620 2586 SRR067781 82.1M 16.6G 649748720 323656576 101 2283 221 1620 2586 SRR067776 79.9M 16.1G 632590344 315165892 101 2283 221 1620 2586 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length SRR067773 93.1M 18.8G 736456192 366884512 101 2803 271 1990 3106 SRR067779 94.0M 19.0G 743564440 370214028 101 2803 271 1990 3106 SRR067778 97.3M 19.6G 767984324 381879652 101 2803 271 1990 3106 SRR067786 94.6M 19.1G 747631104 372002548 101 2803 271 1990 3106 #Fosmid1 (mean insert size: 35295bp, SD 2703), 76 bp read length SRR068214 13.1M 2.0G 104505420 52087176 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') SRR068211 4.8M 736.9M 38612196 19252408 76 35295 2703 27186 35523 36(trim 20bp at 5',20bp at 3') #Fosmid2 (mean insert size: 35318bp, SD 2759), 101 bp read length SRR068335 67.4M 13.6G 533805860 265481252 101 35318 2759 27041 35621 61(trim 20bp at 5',20bp at 3')
- Comments
- Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
- Illumina chr14 reads (aligned with bowtie & corrected)
/fs/szattic-asmg8/treangen/*fastq hard to align: bowtie -5 20 -3 20 -e 1000 ... jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!
- Original read files:
/fs/szattic-asmg8/treangen/chr14_fragment_1.fastq /fs/szattic-asmg8/treangen/chr14_fragment_2.fastq /fs/szattic-asmg8/treangen/chr14_shortjump_1.fastq /fs/szattic-asmg8/treangen/chr14_shortjump_2.fastq /fs/szattic-asmg8/treangen/chr14_longjump_1.fastq /fs/szattic-asmg8/treangen/chr14_longjump_2.fastq
- Quake corrected files:
/fs/szattic-asmg8/treangen/chr14_fragment_1.cor.fastq /fs/szattic-asmg8/treangen/chr14_fragment_2.cor.fastq /fs/szattic-asmg8/treangen/chr14_shortjump_1.cor.fastq /fs/szattic-asmg8/treangen/chr14_shortjump_2.cor.fastq /fs/szattic-asmg8/treangen/chr14_longjump_1.cor.fastq /fs/szattic-asmg8/treangen/chr14_longjump_2.cor.fastq
- Allpaths-LG corrected files:
/fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_2.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_2.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_1.cor.fasta /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_2.cor.fasta
Assembly
Allpaths-lg
- Read counts
orig cor cor(paired,all >64bp) chr14_fragment_12.fastq 36504800 35571477(97.44%) 34268444(10+bp ovl F/R) chr14_shortjump_12.fastq 22669408 11255320(49.64%) 11255320 chr14_longjump_12.fastq 2405064 187398 (7.79%) 187398
- Assembly stats:
. elem min q1 q2 q3 max mean n50 sum scf 418 96 131 256 1236 81646936 209781 81646936 87688255 scf10K+ 17 10330 11780 26536 269876 81646936 5135452 81646936 87302692 ctg 4722 96 2342 9101 24174 240773 17887 36530 84461065
- Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
18hr 50min : multiprocessor 1104299/(3600*24)=12.78 days : singleprocessor
- Assembly directories
/scratch1/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # original /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths # final contigs, scaff /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor # corrected reads