Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 258: Line 258:
==== Alignment based trimming ====
==== Alignment based trimming ====


!!! Reduced the duplications significantly
==== PA14 ref assembly ====
 
Solution:


1. align all reads to the reference using nucmer. I initially used  minmatch=17, mincluster=17 (-c 17 -l 17)
1. align all reads to the reference using nucmer. I initially used  minmatch=17, mincluster=17 (-c 17 -l 17)
Line 271: Line 269:
       3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
       3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)


2. trim reads according to their nucmer alignment coordinates
2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)
3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)


2008_0109_AMOSCmp-PA14-relaxed-17-nucmer
2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2


   command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
   command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
Line 280: Line 278:
   qc stats: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer/Pa.qc
   qc stats: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer/Pa.qc


  All contigs stats:
  Contigs stats:
  #contigs          3137
  #singleton_contigs 1106 (probably kmers that match by chance?)
  min    17
  1quater 17
  median  27
  3quater 114
  max    103575
  sum    6168056
  mean    1966.23
  stdev  7084.82
  n50    22308
 
  Long(>10Kb) contigs stats:
  #contigs  195
  min    10005
  1quater 13763
  median  20558
  3quater 28890
  max    103575
  sum    4782226
  mean    24524.24
  stdev  15171.51
  n50    27496
    
    
Long contigs align at 99.5% id to reference
   desc   #elem   min    max    mean    stdev  sum
Most commons SPN's : a<->g, c<->t 
  all    2053   17     170485 3011.84 11917.53        6183320
  #      total
   200     428     203     170485 14262.09        22852.74        6104175
  ag      6376
   10K    157    10240   170485  35468.89        26531.33        5568616
  ga      6323
  ct      6152
  tc      6017
  cg      2890
  gc      2662
  ac      1169
  ca      1151
  gt      1091
  tg      1075
  c-      679
  -c      623
   ...
 
Singleton contig stats:
   #contigs   1106
  min    17
  1quater 17
  median  17
  3quater 18
  max    33
  sum    20205
  mean   18.27
   stdev  2.88
   n50    17
 
  Positive gap summary:
  #gaps   2883
  1quater 13
  median  44
  3quater 131
  max     3099
  sum     373849
  mean    129.67
  stdev  254.67
  n50     388
 
I aligned the contigs to the reference. There seem to be no duplications in the new assembly !!!
 
---
 
Other alignments:
 
  nucmer_params #alignments
  -l 17 -c 17    7261629
  -l 20 -c 20    6541682
  -l 14 -c 28    5178816
  -l 28 -c 28    4950683   
  -l 14 -c 14    8307164
 
  "-c 28" looks too stringent; many reads don't align
  "-l 14 -c 14" takes a very long time to run. If we decide to use these parameters, in order to speed up the process we could run nucmer twice: once using a larger match length (-l 16..20) and then a smaller match length (-l 14) on the reads which did not align or on the no/low coverage reference regions. The two delta files could be easily merged in the end and used as input for the layout.
 
 
   Sample data set of 8,627 reads (1/1000 of total reads) aligned to ref using nucmer -l 17 -c 17; look for SNP's
 
   cat Pa.sample.snps | cut -f3 | count.pl
  #      total
  ca      606
  ga      280
  tg      277
  ct      221
  tc      180
  ag      158
  gt      142
  ac      120
  cg      117
  gc      102
  ta      46
  at      12
  -c      8
  g-      7
  a-      6
  -a      6
  c-      6
  -t      6
  nc      5
  na      4
  nt      3
  -g      3
  ng      3
  t-      2
  Total  2320

Revision as of 14:39, 12 February 2008

Strain PA-b1

Data

NCBI

Complete strains:

 PAO1 AE004091 1 chromosome, 6,264,404 bp, 66.56 %GC
 PA14 CP000438 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar
 PA7 CP000744  1 chromosome, 6,588,339 bp

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_0001_prb.txt
 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_tag.txt         
 
 s_7_0300_prb.txt
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
 
 conflicts with
 
 Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!

Assemblies

Untrimmed reads: all contain ref duplications (this section should be deleted eventually)

 1. 2007_1203_AMOSCmp-PAO1
 2. 2007_1203_AMOSCmp-PAO1-relaxed
 3. 2007_1204_AMOSCmp-PA14
 4. 2007_1204_AMOSCmp-PA14-relaxed 
 5. 2007_1210_AMOSCmp_PA7-relaxed
 6. 2007_1204_AMOSCmp-PA14-relaxed-noN (not shown): removed all reads that contain N's; same parameters as 2007_1204_AMOSCmp-PA14-relaxed => no big differences compared to 2007_1204_AMOSCmp-PA14-relaxed
 QC stats:
 Assembly                              1              2               3              4            5
 ===================================================================================================
 
 [Info]
 REF                                 PAO1           PAO1           PA14           PA14            PA7
 MINCLUSTER                            20             20             20             20             20
 MINOVL                                10              3             10              3              3
 MAJORITY                              70             50             70             50             50
 MINMATCH                              20             20             20             20             20
 
 [Scaffolds]
 TotalScaffolds                         1              1              1              1              1
 TotalContigsInScaffolds             4412           2197           3226           1828          25419
 MeanContigsPerScaffold              4412           2197           3226           1828          25419
 MinContigsPerScaffold               4412           2197           3226           1828          25419
 MaxContigsPerScaffold               4412           2197           3226           1828          25419
 TotalBasesInScaffolds            6000280        5985597        6137031        6127761        4924485
 MeanBasesInScaffolds             6000280        5985597        6137031        6127761        4924485
 MaxBasesInScaffolds              6000280        5985597        6137031        6127761        4924485
 N50ScaffoldBases                 6000280        5985597        6137031        6127761        4924485
 TotalSpanOfScaffolds             6264430        6264430        6537674        6537674        6588355
 MeanSpanOfScaffolds              6264430        6264430        6537674        6537674        6588355
 MinScaffoldSpan                  6264430        6264430        6537674        6537674        6588355
 MaxScaffoldSpan                  6264430        6264430        6537674        6537674        6588355
 IntraScaffoldGaps                   4411           2196           3225           1827          25418
 2KbScaffolds                           1              1              1              1              1
 2KbScaffoldSpan                  6264430        6264430        6537674        6537674        6588355
 2KbScaffoldPercent                   100            100            100            100            100
 MeanSequenceGapSize                   59            126            123            223             64
 
 [Contigs]
 TotalContigs                        4412           2197           3226           1828          25419
 TotalBasesInContigs              6620035        6602214        6791916        6782045        5334464
 MeanContigSize                      1500           3005           2105           3710            210
 MinContigSize                         33             33             33             33             33
 MaxContigSize                      21576          46268          61425          86811          13454
 N50ContigBases                      3548           8478           6480          14280            410
 
 [BigContigs_greater_10000]
 TotalBigContigs                       44            173            142            229              3
 BigContigLength                   547699        2709766        2127747        4558923          34975
 MeanBigContigSize                  12448          15663          14984          19908          11658
 MinBigContig                       10018          10045          10022          10040          10262
 MaxBigContig                       21576          46268          61425          86811          13454
 BigContigsPercentBases                 8             41             31             67           0.66
 
 [SmallContigs]
 TotalSmallContigs                   4368           2024           3084           1599          25416
 SmallContigLength                6072336        3892448        4664169        2223122        5299489
 MeanSmallContigSize                 1390           1923           1512           1390            209
 MinSmallContig                        33             33             33             33             33
 MaxSmallContig                      9964           9996           9977           9995           8411
 SmallContigsPercentBases              92             59             69             33             99
 
 [Reads]
 TotalReads                       8627900        8627900        8627900        8627900        8627900
 ReadsInContigs                   6218126        6218255        6541160        6541364        3766798
 BigContigReads                    549658        2601986        2115038        4460608          39074
 SmallContigReads                 5668468        3616269        4426122        2080756        3727724
 SingletonReads                   2409774        2409645        2086740        2086536        4861102

 1: nucmer -c 20 -l 20
 2: nucmer -c 19 -l 19
 3: nucmer -c 18 -l 18
 4: nucmer -c 17 -l 17
 5: nucmer -c 16 -l 16 : still running
 Assembly                              1              2               3              4   
 ========================================================================================
 
 [Info]
 REF                                 PA14           PA14           PA14           PA14
 MINCLUSTER                            20             19             18             17
 MINMATCH                              20             19             18             17
 MINOVL                                 3              3              3              3
 MAJORITY                              50             50             50             50
 
 [Scaffolds]
 TotalScaffolds                         1              1              1              1
 TotalContigsInScaffolds             1828           1760           1949           2447
 MeanContigsPerScaffold              1828           1760           1949           2447
 MinContigsPerScaffold               1828           1760           1949           2447
 MaxContigsPerScaffold               1828           1760           1949           2447
 TotalBasesInScaffolds            6127761        6139966        6161019        6202155
 MeanBasesInScaffolds             6127761        6139966        6161019        6202155
 MaxBasesInScaffolds              6127761        6139966        6161019        6202155
 N50ScaffoldBases                 6127761        6139966        6161019        6202155
 TotalSpanOfScaffolds             6537674        6537675        6537676        6537680
 MeanSpanOfScaffolds              6537674        6537675        6537676        6537680
 MinScaffoldSpan                  6537674        6537675        6537676        6537680
 MaxScaffoldSpan                  6537674        6537675        6537676        6537680
 IntraScaffoldGaps                   1827           1759           1948           2446
 2KbScaffolds                           1              1              1              1
 2KbScaffoldSpan                  6537674        6537675        6537676        6537680
 2KbScaffoldPercent                   100            100            100            100
 MeanSequenceGapSize                  223            225            192            136
 
 [Contigs]
 TotalContigs                        1828           1760           1949           2447
 TotalBasesInContigs              6782045        7000867        7473670        8558383
 MeanContigSize                      3710           3978           3835           3498
 MinContigSize                         33             33             33             33
 MaxContigSize                      86811          99026         134098         254819
 N50ContigBases                     14280          20312          30580          52845
 
 [BigContigs_greater_10000]
 TotalBigContigs                      229            218            221            193
 BigContigLength                  4558923        5260728        6380541        7854838
 MeanBigContigSize                  19908          24132          28871          40699
 MinBigContig                       10040          10014          10072          10051
 MaxBigContig                       86811          99026         134098         254819
 BigContigsPercentBases                67             75             85             92
 
 [SmallContigs]
 TotalSmallContigs                   1599           1542           1728           2254
 SmallContigLength                2223122        1740139        1093129         703545
 MeanSmallContigSize                 1390           1128            633            312
 MinSmallContig                        33             33             33             33
 MaxSmallContig                      9995           9978           9849           9820
 SmallContigsPercentBases              33             25             15              8
 
 [Reads]
 TotalReads                       8627900        8627900        8627900        8627900
 ReadsInContigs                   6541364        6744255        6967026        7260781
 BigContigReads                   4460608        5102815        5993784        6716141
 SmallContigReads                 2080756        1641440         973242         544640
 SingletonReads                   2086536        1883645        1660874        1367119
 ::::::::::::::
 contig.summary
 ::::::::::::::
 -c,-l   #elem   #elem0  #elem<0 min     median  max     sum     mean    stdev   n50
 20      1828    0       0       33      251     64109   5521501 3020.51 6134.05 11727
 19      1760    0       0       33      129     94246   6659370 3783.73 9299.94 19361
 18      1949    0       0       33      63      126592  7021203 3602.46 10698.9 29219
 17      2447    0       0       33      52      234167  7908046 3231.73 13190.4 49318
 ::::::::::::::
 contig.10000.summary
 ::::::::::::::
 -c,-l   #elem   #elem0  #elem<0 min     median  max     sum     mean            stdev           n50
 20      178     0       0       10056   14356   64109   3173401 17828.1         9644.73         17994
 19      204     0       0       10026   18436   94246   4868508 23865.24        15842.96        27908
 18      212     0       0       10196   21088   126592  5899614 27828.37        19240.29        35458
 17      190     0       0       10058   27883   234167  7223145 38016.55        30263.25        54897

2007_1204_AMOSCmp-PA14-relaxed

Find adjacent contigs:

1. nucmer contigs, filter adjacent contig end matches

  overlaps haev to be >=20 bp
  => 12 overlaps
 $ nucmer -c 20 -l 20 
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2007_1204_AMOSCmp-PA14-relaxed/nucmer/Pa.fasta  
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2007_1204_AMOSCmp-PA14-relaxed/nucmer/Pa.fasta
 ...
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
   6738     6765  |        1       28  |       28       28  |   100.00  |     6765    41748  |     0.41     0.07  | 1132       1133    [END]
    133      161  |        1       29  |       29       29  |   100.00  |      161    10235  |    18.01     0.28  | 1271       1272    [END]
  11534    11631  |        1       98  |       98       98  |    92.86  |    11631     3390  |     0.84     2.89  | 1409       1410    [END]
    102      122  |        1       21  |       21       21  |   100.00  |      122       46  |    17.21    45.65  | 1464       1465    [END]
    281      301  |        1       21  |       21       21  |   100.00  |      301      183  |     6.98    11.48  | 1623       1624    [END]
    659      988  |        1      303  |      330      303  |    79.15  |      988      418  |    33.40    72.49  | 1626       1627    [END]
    177      416  |        5      241  |      240      237  |    75.93  |      418     1604  |    57.42    14.78  | 1627       1628    [END]
   1296     1344  |        1       49  |       49       49  |   100.00  |     1344    15090  |     3.65     0.32  | 217        218     [END]
    319      652  |        1      212  |      334      212  |    60.30  |      652      322  |    51.23    65.84  | 314        315     [END]
  13265    13285  |        1       21  |       21       21  |   100.00  |    13285    14465  |     0.16     0.15  | 757        758     [END]
    892      995  |        5      176  |      104      172  |    59.30  |     1000     3085  |    10.40     5.58  | 815        816     [END]
   2406     2425  |        1       20  |       20       20  |   100.00  |     2425     5200  |     0.82     0.38  | 93         94      [END]

2. nucmer PA14 contigs to PAO1 contigs

Duplication causes

 1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Trimmed reads

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 ref assembly

1. align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)

      8.62M reads, 4.10M HWI 5.32M HWU
 
      7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17)
      6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20)
      5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (-c 33 -l 17)
      3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)

2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions 3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)

2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2

 command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
 location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer
 qc stats: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer/Pa.qc
Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     2053    17      170485  3011.84 11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K    157     10240   170485  35468.89        26531.33        5568616