Dpuiu Assemblathon
Jump to navigation
Jump to search
Links
Assemblers
* CA /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA * Newbler * Velvet * SOAPdenovo * Maq
CBCB genomes
- a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
- A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
- Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.
Argentine ant
Bee, Bombus impatiens
Data
- 497,318,144 Illumina 124bp reads
- 8 libraries; inserts:
- 400bp
- 3k (outie)
- 8k (outie)
- Traces
Adapters: in 3k & 8k libraries
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
Location:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt # original fastq files /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt # adaptor free corrected reads (long inserts) /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt # corrected reads (short inserts)
Bacterium, Staph aureus USA300
Complete genome : NC_010079 2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516 In progress genome : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
454 FLX : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023 http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full Illumina 101bp paired : Staphylococcus aureus subsp. aureus USA300_TCH1516 http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full
454FLX
- Data
reads min q1 q2 q3 max mean n50 sum sff(original) 334241 36 201 256 277 362 235 264 78432227 sffToCA(adaptor free) 325555 51 103 148 208 358 157 185 51225615 # 58723 mates
- DeNovo
# ctg stats . ctgs min q1 q2 q3 max mean n50 sum CA.6.1.bog 18 238 21014 135971 239466 567548* 155836 277888* 2805055 newbler.2.5p1.deNovo 100 103 295 3287 39467 229053 27879 78379 2787870
# scf stats CA.6.1.bog 6 284 21014 173065 1032129 1458733* 467554 1458733* 2805325 newbler.2.5p1.deNovo 8 2475 20731 110137 1030785 1408642 349895 1408642 2799157
- Reference based(Saureus USA300)
. ctgs min q1 q2 q3 max mean n50 sum newbler.2.3.refMapper 206 103 556 3098 15366 117487 12749 40687 2626469
Illumina101.78X.cor
- Comments:
mated reads; lib mea/srd (CA estimates)=170/21 first 2 and last base show composition bias; should we trim them???
- Data
reads min max mean sum cvg Total 30,597,352 101 101 101 3090332552 1065 Sampled 2,295,176 101 101 101 231812776 78 Corrected(paired) 1,479,510 30 101 89 131973249 45.9
- DeNovo
# ctg stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 73 1716 14503 26431 52086 148524* 39043 58038* 2850157 31075 7700 31075 75 14 6 SOAPdenovo 9382 32 32 52 63 85850 347 16726 3259522 34960 722 37358 24 0 0 velvet 453 61 85 224 3279 137163 6297 36496 2852552 53075 10478 53410 156 17 6
# scf stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 67 1716 14869 32929 58038 148524* 42541 65383* 2850277 31075 7700 31075 75 20 7 # 1 large rearangement : scf120001252361 : NC_010079:622381-671743 SOAPdenovo 186 100 333 1528 17623 144079 15625 55558 2906207 61878 26070 62049 65 453 6 velvet 427 61 82 177 2982 137163 6685 37874 2854649 53268 10478 53623 156 42 8
- Location
/fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/ /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/
Illumina101.150X.cor
- Comments:
first 2 and last base show composition bias; should we trim them???
- Data
reads min max mean sum cvg Total 30,597,352 101 101 101 3090332552 1065 Sampled 4,404,626 101 101 101 444867226 154 Corrected(paired) 2,815,584 30 101 89 252331825 87
- DeNovo
# ctg stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 47 1866 11740 37171 88747 388257* 60272 121078* 2832762 39703 28204 39775 142 12 9 SOAPdenovo 15229 32 33 48 63 74601 232 13773 3530251 31690 269 36416 17 0 0 velvet 429 61 86 206 3147 134034 6650 40448 2852695 52754 9400 52940 152 24 9
# scf stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 46 1866 11740 39650 88747 388257* 61582 129426* 2832782 39703 28204 39775 142 13 10 SOAPdenovo 158 101 202 1249 20910 150305 18353 74149 2899760 64028 30903 64392 52 506 7 velvet 409 61 85 180 2854 142341 6978 42466 2853854 52940 9400 52751 150 43 12
Bacterium, E coli
Compelte genome : NC_000913 4639675bp Escherichia coli str. K-12 substr. MG1655 454 FLX : Escherichia coli str. K-12 substr. MG1655 http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full Illumina 101bp paired : Escherichia coli str. K-12 substr. MG1655 http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full
Illumina101.cor
- Data
mated reads; lib mea/srd (CA estimates)=160/24
reads min max mean sum cvg Total 20,635,060 101 101 101 2084141060 449 Sampled 3,591,676 101 101 101 362759276 78 Corrected(paired) 1,556,316 30 101 79 122785294 26
- DeNovo
#ctg stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 677 1006 2265 4527 8853 45500 6604 10186 4471053 189441 162537 190445 170 16 4 SOAPdenovo 4000 32 35 57 101 43816 1172 9327 4687158 63179 3026 78607 67 1 0 velvet 711 61 166 2135 8579 57936* 6375 16400* 4532895 113612 30252 114128 166 31 5
# scf stats elem min q1 q2 q3 max mean n50 sum q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+ CA.bog 658 1006 2343 4621 9033 45500 6795 10483 4471433 189441 162537 190445 170 35 4 SOAPdenovo 270 100 269 3431 24735 219248* 17273 53882* 4663670 111651 71505 112331 128 826 2 velvet 480 61 92 1147 13028 114907 9523 32233 4571240 113936 30253 114192 161 262 11
- Location
/fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/Illuminap100/ /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Assembly/Illuminap100.cor/
Human, a single chromosome, medium-sized
- Data
Human NA12878 Genome on Illumina ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/ ginko:/scratch1/Human_NA12878_on_Illumina/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680 -> /scratch1/Human_NA12878_on_Illumina/SRP003680/
cat SRP003680/*/*.libinfo | grep Protocol: | count.pl | sed 's/Protocol: //' 400bp fragment library HiSeq plus Betaine 17 400bp lib HiSeq (plus accuprime) 17 ShARC 14 180bp fragment library 11 Sheared Jumps: 2.5-3, 3-3.5 kb 8 EcoP15I library: 6-7, 7-8, 8-9, 10-12kb 7 400bp fragment library HiSeq minus Betaine 6 400bp fragment lib PCR FREE 4 ShARC libraries 4 FOSILL 4 180bp fragment library PCR free 4
grep ^Layout SRP003680/*/*.libinfo | sort -nk3 | head SRP003680/SRR067577/SRR067577.libinfo:Layout: PAIRED 288 27.85 5'3'-3'5' ... SRP003680/SRR067818/SRR067818.libinfo:Layout: PAIRED 40000 1021.19 5'3'-3'5'