Pseudomonas aeruginosa: Difference between revisions
(→Ssake) |
(→NCBI) |
||
Line 9: | Line 9: | ||
[http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PA14 PA14] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000438 CP000438] 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar | [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PA14 PA14] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000438 CP000438] 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar | ||
[http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=16720 PA7] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000744 CP000744] 1 chromosome, 6,588,339 bp | [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=16720 PA7] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000744 CP000744] 1 chromosome, 6,588,339 bp | ||
5.7Kbp repeat coordinates: | |||
PA14: | |||
732540 738302 + | |||
4951956 4957714 - | |||
5535975 5541728 - | |||
6312434 6318199 - | |||
PAO1: | |||
721550 727325 + | |||
4788516 4794273 - | |||
5264042 5269801 - | |||
6039515 6045289 - | |||
=== CBCB === | === CBCB === |
Revision as of 19:36, 12 February 2008
Strain PA-b1
Data
NCBI
Complete strains:
PAO1 AE004091 1 chromosome, 6,264,404 bp, 66.56 %GC PA14 CP000438 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar PA7 CP000744 1 chromosome, 6,588,339 bp
5.7Kbp repeat coordinates:
PA14: 732540 738302 + 4951956 4957714 - 5535975 5541728 - 6312434 6318199 -
PAO1: 721550 727325 + 4788516 4794273 - 5264042 5269801 - 6039515 6045289 -
CBCB
File location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
Files:
s_1_0001_prb.txt s_1_sequence.txt # contains seq & qual; seq names: HWI% ; all reads are 33 bp s_1_tag.txt s_7_0300_prb.txt s_7_sequence.txt # contains seq & qual; seq names HWU% ; all reads are 33 bp s_7_tag.txt wc -l s_1_sequence.txt s_7_sequence.txt 4105993 s_1_sequence.txt 4521907 s_7_sequence.txt 8627900 total
#all reads that contain at least 1 ambiguity grep -c N s_*seq s_1_sequence.seq:37933 s_7_sequence.seq:69669
#all reads that contain at least 2 ambiguities cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 18771 cat s_7_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589 ...
s_1 reads have significantly fewer N's than s_7 reads !!! s_1 reads align better to ref than s_2 reads !!! conflicts with Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
Assemblies
Untrimmed reads
All contain ref duplications (this section should be deleted eventually) Duplication causes
1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3) 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
Trimmed reads
Quality trimming
Art's script: qual-trim.awk
Min_Qual=15
. #elem #elem0 #elem<0 min median max sum mean stdev n50 qualTrim 8627900 1266 0 0 23 33 193999902 22.49 7.43 26
!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate
Alignment based trimming
PA14 ref assembly
Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)
8.62M reads, 4.10M HWI 5.32M HWU 7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17) 6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20) 5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads) (-c 33 -l 17) 3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b
2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats: desc #elem min max mean stdev sum all 2053 17 170485 3011.84 11917.53 6183320 200 428 203 170485 14262.09 22852.74 6104175 10K 157 10240 170485 35468.89 26531.33 5568616 Singletons: 1127399
Data accuracy Get all assembled bases with coverage>=20 count_bases=5926977 sum_bases=235619001 sum_errors=2455670
sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error
PAO1 ref assembly
Same method except that 1.b was not used
Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
Contigs stats: desc #elem min max mean stdev sum all 2797 17 75626 2161.19 5812.2 6044851 200 865 200 75626 6893.96 8766.63 5963278 10K 204 10016 75626 19016.22 10368 3879309 Singletons: 1592525
PA14 & PAO1 merge
Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies
Filter contigs that contain at least 2 adjacent PA14 merged by PAO1
desc #elem min max mean stdev sum all 1850 17 236472 3400.31 16863.79 6290586 200 306 204 236472 20318.3 37147.91 6217401 10K 113 10520 236472 52647.45 45602.76 5949162 Singletons: 1066226
Ssake
Input: 1066226 singletons
Ssake version 3.0 run with default parameters: -m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
Contigs stats: desc #elem min max mean stdev sum all 19795 34 2866 67.78 102.73 1341825 200 879 200 2866 416.72 308.92 366304
200 bp contigs stats desc. #elem min mean max sum contig_len 879 200 416.72 2866 366304 contig_reads 879 52 414.21 3157 364091 contig_cvg 879 7.91 31.58 146.52 27764.51 Singletons: 702007
Find contigs overlapping 12 bp or more and merge them: desc #elem min max mean stdev sum original 879 200 2866 416.72 308.92 366304 new 670 200 4826 539.32 579.26 361350